Boulder performance testing system design

We currently do not have a solid method for detecting performance regressions other than directly watching Grafana dashboards after deploying changes to staging.

Aim

We need to develop a solution in order to better catch these changes before they are shipped to staging (where due to non-standard load they may not be caught). Ideally this solution will provide:

Per PR/commit metrics for performance of HTTP endpoints/RPCs (github web hooks based)
Comparison between master and proposed changes
Historic performance (change over time)
Alerting on significant performance regressions (significant improvement as well?)
Job queue (ability to run multiple jobs, probably serially)
Easy to add new performance tests/metrics

In a perfect world we'd also get:

Easy experimentation without needing to submit a PR
Being able to request a re-run using github comments

Design

We should develop a system which can ingest requests (likely from github webhooks) to run a load test against a specific boulder commit. Based on this request it should spin up a docker container running boulder at that commit and a load-generator instance. These docker instances should output metrics (the built-in boulder ones) to a prometheus instance. Once the load-generator run is finished (the settings for which should probably be statically configured) a set of configured prometheus queries (i.e. HTTP endpoint/RPC median/p99 latency, etc) should be run within the time period the load-generator was running (we may also want some magic way to tag the boulder outputted metrics in some way?). These metrics should be then stored in some backend (SQLite, MySQL, InfluxDB, whatever) and compared against a previous set of metrics considered the 'master set' (the set for current boulder git master). The set of metrics for the commit, and the comparison to the 'master set' should then be sent back to github (or just presented somewhere locally, i.e. a website).

Likely we'll want to limit the number of tests that can be run at once (probably one at first) so we'll need someway to store jobs in a queue. Given jobs will likely run for a while the queue should probably be stored in some kind of stateful storage backend so that it can survive a restart of the main system. We'll probably also want some fallback method of restarting/re-requesting a job that goes alongside github webhooks in case for whatever reason a job is lost and we still want the data.

For the system to be able to compare metrics from a run against the 'master set' we'll also need a way to generate and store that set itself. Probably the simplest way to do this is to have processing for two different types of requests: new-test and new-master. We can then have a webhook for when we merge a new commit to master that runs a new-master job. This job will basically be the exact same as the job described above except it won't do any comparison or reporting, just saving the metrics as the new 'master set'.

Configuration options

load-generator
- run time
- rate
prometheus queries
alerting thresholds

Architecture

A main golang service manages jobs (full life cycle of receive, run, output)
Docker
Prometheus

It may also need some sort of non-filesystem based storage backend (MySQL or something) for storing metrics tagged by PR/commit.

             job manager
            +------------------------+
            |                        |
github+-------->git webhook ingest   |
            |          +             |
            |          |             |
            |          v             |
            |     job control+---------->docker+---->prometheus
            |                        |                   +
            |          +---------------------------------+
            |          |             |
            |          v             |
            |  metrics consolidation |
            |          +             |
            |          |             |
            |          v             |
github<---------+report generation+------->storage
            |                        |
            +------------------------+

Development steps

Create a set of docker images (likely using docker-compose) that hold a boulder instance, prometheus instance, and load-generator instance that take commit, run time, metrics queries, etc and run a load-generator against a boulder at the specified commit and output the results of the metrics queries
Create a storage system that can store these query results linked to the commit (+ some unique run identifier probably)
Create a method for comparing two sets of query results
Create a system that can run the docker blob when requested and store the query results in the storage system
Tie all of the input/outputs into the docker job manager
(optionally) Create a frontend that allows for arbitrary viewing of query results + comparison of result sets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Boulder performance testing system design

Aim

Design

Configuration options

Architecture

Development steps

Clone this wiki locally