Skip to content

Site Reliability

Dani Donisa edited this page Jan 11, 2024 · 19 revisions

Here is how we ensure that our reference server https://build.opensuse.org functions reliably

Logging

In our Ruby on Rails app, we make use of lograge to log to disk. System logs go to a central logging server via rsyslog.

System Health/Performance Monitoring

On our servers, we make use of icinga and many monitoring-plugins which send infrastructure performance and health monitoring data to an InfluxDB time series database, which we then visualize on a Grafana dashboard. This dashboard is not public.

More details in System Health Monitoring.

Application Performance Monitoring

Inside our Ruby on Rails app, we make use of influxdb-rails which sends performance data to an InfluxDB time series database. We visualize this data on a Grafana dashboard reachable at https://obs-measure.opensuse.org

More details in Application Performance Monitoring.

Application Health Monitoring

Inside our Ruby on Rails app, we make use of bunny which sends telemetry to a RabbitMQ message broker, where a telegraf server agent reads the telemetry and stores it into a InfluxDB time series database. We visualize this data on a Grafana dashboard reachable at https://obs-measure.opensuse.org

More details in Application Health Monitoring.

Exception Tracking

Inside our Ruby on Rails app, we make use of airbrake which sends application exceptions to an errbit error catcher service at https://errbit.opensuse.org

Web Analytics

We don't do analytics

Tracing

We don't trace

Incident Management

There is always at least one person "on-call". As soon as we are alerted that person takes on the incident command and holds all positions (hacking on the problem, operating the server, communication to the users) that they have not delegated. They are free to pull in anyone they need and hand out tasks/roles to solve this incident.

After resolving the incident we do a root cause analysis and publish a report, based on our Post-Mortem-Template, on https://openbuildservice.org/categories/deployments/

We are using priority labels for issues.

  • P1: Urgent - EVERYONE drop everything and fix this
  • P2: High - If at all possible, assign this to you and fix it ASAP

Development Environment

You can run OBS and all the tools we use in our SRE stack in your development environment. To set up the stack run

rake docker:sre:build

This will fetch all images and configure them. Afterward you can issue any docker compose command you would normally use by appending the docker-compose.sre.yml file. So for instance to boot up OBS including the SRE stack you would use

docker compose -f docker-compose.sre.yml -f docker-compose.yml up

Configure Grafana

Go to Grafana frontend, http://0.0.0.0:8000, login (admin/admin) and import the 'influxdb-rails' sample dashboards (Overview, per Request, per Action) or export/import dashboards from obs-measure etc.

Clone this wiki locally