Prom-Client aggregation causing master to choke in cluster mode #628

ssg2526 · 2024-04-23T20:21:11Z

I implemented a small cluster module based code. I have one master and eight workers running. I am using AggregatorRegistry class like:

const AggregatorRegistry = client.AggregatorRegistry;
const aggregatorRegistry = new AggregatorRegistry();

const externalMethodTimings = new client.Histogram({
    name: 'external',
    labelNames: ['status', 'method', 'url'],
    help: 'None'
});

The worker has one end point which responds with a 30ms latency and also a middleware to increase the number of metrics to simulate the production like behaviour.

let express = require('express');
let app = express();
app.use(function(req, res, next)
{
    req.startTime = new Date().getTime();
    req.custom = req.url+"/"+counter%700
    counter++;
    next();
});

app.use(function(req, res, next){
    var startAt = req.startTime;
    res.on('finish', ()=> {
        try {
        var diff = new Date().getTime() - startAt;
        externalMethodTimings.labels({'method': req.method, 'url': req.custom}).observe(diff/1000.0);
        return;
        } catch(err){
        
        }
    });
    next();
});

app.get('/worker', function(req, res){
    setTimeout(()=>{
        return res.status(200).send('worker Server is running');
    },30);
});

app.listen(port, function () {
    console.log('worker server listening on port %d', port);
});

Now I have a master that has the following code

app.get('/master', async (req, res) => {
    try {
        var start = new Date().getTime();
        const workerMetrics = aggregatorRegistry.clusterMetrics();
		            
        workerMetrics.then(values => {
            var allMetrics = values; 
            console.log(new Date().getTime() - start);
            res.set('Content-Type', aggregatorRegistry.contentType);
            res.send(allMetrics);
        });
    } catch (e) {
	    res.statusCode = 500;
	    res.send(e.message);
    }
});

app.listen(master_port, function(){
    console.log('master server listening on port %d', master_port);
});

When I run this application and put load using apache benchmark using the below configs. Also I ran a watcher to hit the cluster metric endpoint on master at every 1 second to simulate prometheus scraping. The command is given below -

ab -n 60000 -c 50 "http://localhost:8080/worker"

watch -n 1 "curl http://localhost:3000/master > /dev/null";

I get my 95th percentile at about 190ms - 200ms and the 98th and 99th percentiles are even higher on this load. When i stop my watcher and don't hit the metric end point during the load the latencies are in the expected ranges of about 30-40ms at 95th percentile.
I tried commenting the aggregation code in the prom-client and ran the same load it ran fine again. So my understanding till this point is that the aggregation of metrics from different workers are blocking the master when the size of metrics are bigger and as the master gets choked it stops routing the requests to the workers and the workers don't get the requests which causes the latencies reported by the benchmark tool. Another thing to prove the same hypothesis is that if we see the prometheus metrics from the end point the latencies reported by the workers are always in le 50ms bucket which means that once the request reaches the worker, it always gets responded as per the expectation.
Kindly suggest a way to resolve this. Also happy to raise any PR if required

The text was updated successfully, but these errors were encountered:

ssg2526 · 2024-04-24T17:40:45Z

To add a background of why we tried this to verify. In our production app which is a proxy service and uses prom-client for tracking all the metrics. The service is a very high scale service and serves a huge traffic. During one activity we had to shut down prometheus and we suddenly observed a drop of p99 latencies from 450ms to mere 50ms. The prometheus scraping happens at every 10 seconds in production.
This event lead us to find the root cause of the issue and with this small simulation we ended up identifying that when prometheus does the aggregation the master gets choked. I have also confirmed with by commenting the aggregation logic and just sending some empty response from prom-client while the workers still send the actual metrics to prom-client via IPC, this results in expected latencies from apache benchmark again.

zbjornson · 2024-04-26T18:46:20Z

Thanks for the detailed write-up. I did most of the cluster code in this module, but have had no time to do any work on it recently. If you see places for improvement, a PR would be welcome.

ssg2526 · 2024-05-20T19:08:24Z

@zbjornson @siimon we were able identify the choking points in the code and have devised a solution around that. There were 2 major choking points.

The hash that is being created to identify a metric uniquely is being generated and is being put in a Map. The insertion in the Map with large keys in javascript is costly and when we timed it we found out that it was taking long to do this activity.
When master asks workers to get the metrics then the IPC is getting choked when the metrics are sent from workers to master and because of this the master was unable to route the requests to workers as IPC was choking.

Proposed solution -

The hashing of each metric labels is moved to workers. Each worker hashes the data and adds that in a key called hash in the metrics json. This is distributing the hashing across all workers.
- The impact of this change is that now master don't need to hash for all the workers.
The map which is being built on every request freshly is made global and only if a new metric comes then that will be put in the map else we will get the value from map and push a new worker metric in the array.
- The impact of making this change is that insertion in the Map is slow which we want to minimise
Since the IPC was getting choked earlier and with new hashes being added at worker level we wanted to avoid using IPC for this communication. So the workers create a tmp file and writes the metrics into the tmp file (file name with pid). and each worker only sends the file name to master on IPC and now the master reads that file and deletes it. The remaining things are unchanged. in case master fails to delete the file the file will be overwritten anyways by the worker in next prometheus scraping.
- The impact of making this change is to avoid the IPC getting choked so that the routing of requests doesn't get affected when metrics sizes are large

PR Link

Here are the results after making the above mentioned changes. The load and code which is being used is as mentioned in this thread at the beginning -