Using Prometheus to Monitor Complex Applications and Infrastructure

How to use Prometheus to Monitor Complex Applications and Infrastructure

Prometheus is an open-source monitoring software designed to monitor containerized applications in a microservice architecture. Given the increasing complexity of modern infrastructure, it is not easy to track an application’s state, health, and performance. Failure of one application might affect other applications’ ability to run correctly. Prometheus helps to track the condition, performance, or any custom metrics of an application over time, which allows an engineer to avoid unexpected situations and improve the overall functionality of microservice infrastructure. However, engineers can also use Prometheus to monitor monolithic infrastructure.

Before we delve into the nitty gritties of Prometheus, we should discuss some important terms:

  • Metric: A number with a name, measuring which has a meaning. For example: “cpu_usage = 1000 microns” is a metric.
  • Target: Any containerized application exporting metrics at ‘/metrics’ HTTP endpoint in Prometheus format.
  • Exporter: A library/code which converts existing metrics into Prometheus format.
  • Scrape: Pulling metrics from the target by making an HTTP request.

Prometheus Architecture

Prometheus has three major components :

  • Retrieval: This component is responsible for scraping the metrics from all targets at configured intervals of time.
  • Time Series DB: Stores the metrics scraped at regular intervals of time as a time-series data or as a vector.
  • HTTP Server: Accepts query over time-series data as an HTTP request and returns the result as an HTTP response. The query language used here is PromQL.


Prometheus Architecture
Prometheus Architecture

Prometheus Metric Types

Prometheus provides different metrics. Out of which, Counter, Gauge, Summary, and Histogram work in most situations. It’s the job of the application to provide the metrics in a predefined format that Prometheus understands. It is easy to publish these metrics from your application using the provided client libraries. Currently, libraries exist for popular languages such as GO, Python, Java, Ruby etc.

In this blog, we will be tackling the python version. However, it is easy to translate these concepts into other languages.


Any value that increases with time, such as HTTP request count, HTTP error response count, etc., can use counter metrics. A metric that can decrease can never use counter metrics. Counter has the advantage of querying the rate at which the value increases using the rate() function.

In the example below we are counting the number of function calls to a python def:

from prometheus_client import start_http_server, Counter
import random
import time

COUNTER = Counter('function_calls', 'number of times the function is called', ['module'])

def process_request(t):
    """A dummy function that takes some time."""

if __name__ == '__main__':
    while True:

This code generates a counter metric named function_calls_total. We are incrementing function_calls_total every time we call the function process_request. Prometheus metrics have labels to identify similar metrics generated from different applications. In the above code, function_calls_total has a label named module with a value equal to counter_demo

The output of

curl http://localhost:8001/metrics

will be:

# HELP function_calls_total number of times the function is called
# TYPE function_calls_total counter
function_calls_total{module="counter_demo"} 22.0
# HELP function_calls_created number of times the function is called
# TYPE function_calls_created gauge
function_calls_created{module="counter_demo"} 1.6061945420716858e+09


Any value that increases or decreases with time uses gauge metrics such as CPU usage, memory usage, and processing time.

In the example below we are calculating the time taken for the latest function call of python def process_request:


from prometheus_client import start_http_server, Gauge
import random
import time

TIME = Gauge('process_time', 'time taken for each function call', ['module'])

def process_request(t):
    """A dummy function that takes some time."""

if __name__ == '__main__':
    while True:

Gauge metric supports set(x), inc(x), and dec(x) methods to set the metric, increment the metric, and decrement the metric by x respectively.

The output of http://localhost:8002/metrics will be:

# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="9",patchlevel="0",version="3.9.0"} 1.0
# HELP process_time number of times the function is called
# TYPE process_time gauge
process_time{module="gauge_demo"} 0.4709189918033123


The histogram has predefined buckets of different sizes from 0.005 to 10. For example, if you want to measure/observe every HTTP request’s duration coming to your application, each request duration may fall under any predefined bucket. If a request’s duration is ten, then the value of the bucket of size ten is incremented by one. The histogram also has the sum of the duration of all requests and the number of requests.

In the example below we are observing the duration of a function call of python def: process_request:

h = Histogram('request_duration', 'Description of histogram', ['module'])
HIST = h.labels('histogram_demo')

def process_request(t):
    """A dummy function that takes some time."""

The output of curl http://localhost:8003/metrics will be:

# HELP request_duration Description of histogram
# TYPE request_duration histogram
request_duration_bucket{le="0.005",module="histogram_demo"} 0.0
request_duration_bucket{le="0.01",module="histogram_demo"} 0.0
request_duration_bucket{le="0.025",module="histogram_demo"} 0.0
request_duration_bucket{le="0.05",module="histogram_demo"} 2.0
request_duration_bucket{le="0.075",module="histogram_demo"} 3.0
request_duration_bucket{le="0.1",module="histogram_demo"} 3.0
request_duration_bucket{le="0.25",module="histogram_demo"} 5.0
request_duration_bucket{le="0.5",module="histogram_demo"} 9.0
request_duration_bucket{le="0.75",module="histogram_demo"} 11.0
request_duration_bucket{le="1.0",module="histogram_demo"} 16.0
request_duration_bucket{le="2.5",module="histogram_demo"} 16.0
request_duration_bucket{le="5.0",module="histogram_demo"} 16.0
request_duration_bucket{le="7.5",module="histogram_demo"} 16.0
request_duration_bucket{le="10.0",module="histogram_demo"} 16.0
request_duration_bucket{le="+Inf",module="histogram_demo"} 16.0
request_duration_count{module="histogram_demo"} 16.0
request_duration_sum{module="histogram_demo"} 7.188765686771258
# HELP request_duration_created Description of histogram
# TYPE request_duration_created gauge
request_duration_created{module="histogram_demo"} 1.60620555290144e+09

  • request_duration_bucket are the buckets of size ranging from 0.005 to 10.
  • request_duration_sum is the sum of durations of each function call.
  • request_duration_count is the total number of function calls.



A summary is very similar to a histogram, except it doesn’t store the bucket information but only has the sum of the observations and count of total observations.

In the example below we are observing the duration of a function call of python def:

s = Summary('request_processing_seconds', 'Time spent \ 
processing request', ['module'])
SUMM = s.labels('pymo')

def process_request(t):
    """A dummy function that takes some time."""

The output of curl http://localhost:8004/metrics will be:

# HELP request_processing_seconds Time spent processing request
# TYPE request_processing_seconds summary
request_processing_seconds_count{module="pymo"} 20.0
request_processing_seconds_sum{module="pymo"} 8.590708531
# HELP request_processing_seconds_created Time spent processing request
# TYPE request_processing_seconds_created gauge
request_processing_seconds_created{module="pymo"} 1.606206054827252e+09

Configuring Prometheus

So far we have seen the different types of metrics and how to generate them. Now let us see how to configure Prometheus to monitor the metrics we have developed. Prometheus uses a YAML file to define the configuration.

  scrape_interval: 15s 
  evaluation_interval: 15s 
  scrape_timeout: 10s 
  - job_name: 'prometheus_demo'
      - targets: ['localhost:8001', 'localhost:8002', 'localhost:8003', 'localhost:8004']

The four primary configuration sections used in most situations are:

  1. scrape_interval defines the interval in which Prometheus scrapes the targets
  2. evaluation_interval determines how frequently Prometheus evaluates the rules defined in the rules file.
  3. job_name defined will be added as a label to any time-series scraped from this config.
  4. targets define the list of HTTP endpoints to scrape for metrics

Starting Prometheus

tar xvfz prometheus-*.tar.gz
./prometheus --config.file=proemtheus.yaml 


docker run \
    -p 9090:9090 \
    -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \

Accessing UI


You can see time-series data as graphs as well.


If you need an alert on function_calls_total > 100, you need to set up alert rules and alert managers. Prometheus uses another YAML file to define alert rules.

- name: example
  - alert: ExampleAlertName
    expr: function_calls_total{module="counter_demo"} > 100
    for: 10s
      severity: low
      summary: example summary

  scrape_interval: 15s 
  evaluation_interval: 15s 
  scrape_timeout: 10s 
  - alertrules.yml
  - job_name: 'prometheus_demo'
      - targets: ['localhost:8001', 'localhost:8002', 'localhost:8003', 'localhost:8004']

Once you restart Prometheus with the new configuration, you can see alerts listed in the alert section of the Prometheus UI. But these alerts reach your slack channel or email only if setup and configure the alert manager.

How do you achieve concurrency with Python threads?


The process of threading allows the execution of multiple instructions of a program, at once. Only multi-threaded programming languages like Python support this technique. Several I/O operations running consecutively decelerates the program. So,  the process of threading helps to achieve concurrency.

In this article, we will explore:

  1. Types of concurrency in Python
  2. Global Interpreter Lock
  3. Need for Global Interpreter Lock
  4. Thread execution model in Python 2
  5. Global Interpreter Lock in Python 3

Types of Concurrency In Python

In general, concurrency is the parallel execution of different units of a program, which helps optimize and speed up the overall process. In Python, there are 3 methods to achieve concurrency:

  1. Multi-Threading
  2. Asyncio
  3. Multi-processing

We will be discussing the fundamentals of thread-execution model. Before going into the concepts directly, we shall first discuss Python’s Global Interpreter Lock (GIL).

Global Interpreter Lock

Python threads are real system threads (POSIX threads). The host operating system fully manages the POSIX threads, also known as p-threads.

In multi-core operating systems, the Global Interpreter Lock (GIL) prevents the parallel execution of p-threads of a multi-threaded Python process. Thus, ensuring that only one thread runs in the interpreter, at any given time.

Why do we need Global Interpreter Lock?

GIL helps to simplify, the implementation of the interpreter, and memory management. To understand how GIL does this, we need to understand reference counting.

For example: In the code below, ‘b’ is not a new list. It is just a reference to the previous list ‘a.’

>>> a = []
>>> b = a
>>> b.append(1)
>>> a
>>> a.append(2)
>>> b


Python uses reference counting variables to track the number of references that point to an object. The memory occupied by the object is released if the value of the reference counting variable is zero. If threads, of a process sharing the same memory, try to access this variable to increment and decrement simultaneously, it can cause leaked memory that is never released, or releasing the memory incorrectly. And this leads to crashes.

One solution is to have a lock for the reference counting variable object memory by using semaphores so that it is not modified simultaneously. However, adding locks to all objects increases the performance overhead in the acquisition and release of locks. Hence, Python has a GIL which gives access to all resources of a process to only one thread at one time. Apart from GIL there are other solutions, such as garbage collection used in JPython interpreter, for memory management.

So, the primary outcome of GIL is, instead of parallel computing you get pre-emptive (threading) and co-operative multitasking (asyncio).

Thread Execution Model in Python 2
import time
def countdown(n):
     while n > 0:
     n -= 1
count = 50000000
start = time.time()
end = time.time()
print('Time taken in seconds -', end-start)


import time
from threading import Thread
COUNT = 50000000
def countdown(n):
     while n>0:
     n -= 1
t1 = Thread(target=countdown, args=(COUNT//2,))
t2 = Thread(target=countdown, args=(COUNT//2,))
​start = time.time()
end = time.time()
​print('Time taken in seconds -', end - start)


~/Pythonpractice/blog❯ python
(‘Time taken in seconds -‘, 1.2412900924682617)
~/Pythonpractice/blog❯ python
(‘Time taken in seconds -‘, 1.8751227855682373)

Ideally, the execution time should be half of the execution time. However, in the above example we can see that the execution time is slightly higher than that of To understand the reduction in performance, despite the sharing the work between two threads which run in parallel, we need to first discuss CPU-bound and I/O-bound threads.

CPU-bound threads are threads performing CPU intense operations such as matrix multiplication or nested loop operations. Here, the speed of program execution depends on CPU performance.



I/O-bound threads are threads performing I/O operations such as listening to a socket or waiting for a network connection. Here, the speed of program execution depends on factors including external file systems and network connections.


Scenarios in a Python multi-threaded program

When all threads are I/O-bound

If a thread is running, it holds the GIL. When the thread hits the I/O operation, it releases the GIL, and another thread acquires it to get executed. This alternate execution of threads is called multi-tasking.



Where one thread is CPU bound, and another thread is IO-bound:

A CPU bound thread is a unique case in thread execution. A CPU bound thread releases the GIL after every few ticks and tries to acquire again. A tick is a machine instruction. When releasing the GIL, it also signals the next thread in the execution queue (ready queue of the operating system) that the GIL has been released. Now, these CPU-bound and I/O-bound threads are in a race to acquire the GIL. The operating system decides which thread needs to be executed. This model of executing the next thread in the execution queue, before completing the previous thread, is called pre-emptive multitasking.




In most of the cases, operating system gives preference to the CPU bound thread and allows it to reacquire the GIL, leaving the I/O-bound thread starving. In the below diagram, the CPU-bound thread has released GIL and signaled thread 2. But even before thread 2 tries to acquire GIL, the CPU-bound thread has reacquired the GIL.  This issue has been resolved in Python 3 interpreter’s GIL.



In a single core operating system, if the CPU bound thread reacquires GIL, it pushes back the second thread to the ready queue, assigning it some priority. This is because Python doesn’t have control over the priorities assigned by the operating system.

In a multicore operating system, if the CPU bound thread reacquires the GIL then it does not push back the second thread, but continuously tries to acquire the GIL using another core. This characterizes thrashing. Thrashing reduces the performance if many threads try to acquire the GIL, using different cores of the operating system. Python 3 also addresses the issue of thrashing.



Global Interpreter Lock in Python 3

The Python 3 threat execution model has a new GIL. If there is only one thread running, it continues to run, until it hits an I/O operation or other thread requests, to drop the GIL. A global variable (gil_drop_request) helps to implement this.

If gil_drop_request = 0, running thread can continue until it hits I/O

If gil_drop_request = 1, running thread is forced to give up the GIL

Instead of CPU-bound thread check after every few ticks, the second thread is sending a GIL drop request by setting the variable gil_drop_request = 1 after reaching a timeout. The first thread will then immediately drop the GIL. Additionally, to suspend the first thread’s execution, the second thread sends a signal. This helps to avoid thrashing. This check is not available in Python 2.



Missing Bits in the New GIL

While the new GIL does address issue such as thrashing, it still has some areas of improvement:

Waiting for time out

Waiting for timeout can make the I/O response slow. Especially, when there are multiple, recurrent I/O operations. I/O-bound Python programs take considerable time to add a GIL, followed by a time out. This happens after every I/O operation, and before the next input/output is ready.

Unfair GIL acquiring

As seen below, the thread that makes the GIL drop request is not the one that gets the GIL. This type of situation can reduce the performance of I/O, where response time is critical.



Prioritizing threads

There is a need for the GIL to distinguish between CPU-bound and I/O-bound threads and then assign priorities. High priority threads must be able to immediately preempt low priority threads. This will help improve the response time considerably.  This issue has already been resolved in operating systems. Operating systems use timeout to automatically adjust task priorities. If a thread is preempted by a timeout, it is penalized with a low priority. Conversely, if a thread suspends early, it is rewarded with raised priority. Incorporating this in Python will help improve thread performance.