RFC: Bundled Observability #12793

dannykopping · 2024-03-28T11:38:30Z

dannykopping
Mar 28, 2024
Collaborator

Problem Statement

Offering a product solely on-prem burdens the user (specifically the ”operator” persona) with the responsibility to run a reliable service for their own internal customers. As Coder becomes more and more complex the possible failure modes increase, and we burden the operator with a) knowing how to identify these, b) understanding the problems, and c) fixing them.

Goals

If we could provide an excellent self-service experience for operators by providing world-class OSS observability in our software, their MTTR (Mean Time to Repair) will decrease. If users are able to help themselves instead of reaching out to us for support or filing issues, they will recover from downtime / problems quicker and be more delighted with the product.

Ultimately we want operators who do not have to become subject matter experts on Coder. The easier we make their jobs, the more they will love our product and make themselves/their team look good when they deliver a reliable service.

Additionally, having all of this observability will allow ourselves to better administer our own internal installations, and thereby creating a feedback loop when we make modifications.

This should be made available to both OSS users & Enterprise customers alike.

UX

Operators should be given industry-standard OSS tools with which to observe Coder, such as Grafana, Prometheus, and Alertmanager. Using these should be as simple as port-forwarding to a service, with zero to minimal configuration.

Requirements

Initial Functional Requirements

Grafana, Prometheus, and Alertmanager should all be installable via a Helm chart, likely a separate one to the main coder chart, with feature toggles. We will be reusing existing charts for these components.
Dashboards, alerts, and runbooks should all be bundled with the Helm installation, although they must not be tightly coupled should a user already have Grafana/Prometheus/Alertmanager already installed and want to import these into their installation instead.
Support log aggregation via Grafana Loki with integration into Grafana.
Our existing dashboard should be expanded, and split into logical groupings around the key components of Coder (network mesh, provisioning, control plane, end-user usage, etc).

Initial Non-functional Requirements

Operators should be able to derive valuable insights into normal & abnormal operation through dashboards & alerts.

Operators should not need any explicit training, and all relevant context should be colocated with the dashboards (instruction panels) & alerts (runbooks).

I would love if we took inspiration from what Grafana Mimir has done with their dashboards. Each dashboard is scoped to a specific sub-component of the product. They have an Overview dashboard which has clear explanations about what each panel means, with the ability to drill down into each for more detail.

Eventual Requirements

Build in support for trace (Tempo or Jaeger) and profile (Pyroscope or Parca) aggregation: these will generally only be useful to us, and users who wish to troubleshoot and fix performance-related issues.

A side-effect of having Prometheus in the Helm chart is we could use it later to implement logical autoscaling (i.e. based on, for example, provisioner latency) using KEDA rather than solely on resources (CPU/RAM) which Kubernetes provides natively.

Scope

The goal is to deliver a bundled set of observability & reliability tools, and accompanying material.

Glossary

Grafana: popular OSS dashboarding tool

Prometheus: popular OSS metric collection & querying tool

Alertmanager: popular OSS alert management tool

Alert: comprises an expression (query which represents a suboptimal state of a system) and a notification (sent to a receiver such as Slack, PagerDuty, etc)

Runbook: a set of actions which an operator can take in response to an alert

Dashboard: collection of panels displaying information about systems from which observability signals are collected (metrics, logs, traces, profiles)

Metric: a numeric value indicating a system’s state (e.g. number of live provisioners)

Log: a line of text produced by a system which is stored in a file

Trace: a series of timing records associated to operations within a system

Profile: a set of diagnostic records indicating resource usage (CPU, RAM)

mattlqx · 2024-03-28T22:34:16Z

mattlqx
Mar 28, 2024

It sounds like this is framed more about server internals than the metrics about the product, or are those included in this effort as well?

I'm all for offering an observability stack using existing OSS with easy setup through Helm. I would say though if I already have a Grafana instance and want to use it, it would be nice to offer the dashboards adhoc so we can import them.

From the product side, it would excite me to have Loki be able to search through workspace start logs so I can easily identify how widespread a given scenario is across the workspaces.

1 reply

dannykopping Apr 2, 2024
Collaborator Author

It sounds like this is framed more about server internals than the metrics about the product, or are those included in this effort as well?

We're aiming to include both. We want to give operators insight into both the low-level (compute resources) and high-level (network mesh, provisioning, control plane, end-user usage, etc) indicators, and alert when these are not within nominal range.

I would say though if I already have a Grafana instance and want to use it, it would be nice to offer the dashboards adhoc so we can import them.

Absolutely! We expect a number of users will already be using the LGTM stack (or a subset), and we'll make sure the resources like dashboards & alert definitions will not be tied to the Helm installation.

From the product side, it would excite me to have Loki

Us too 🙂

Thanks for your thoughts @mattlqx!

smolinari · 2024-04-02T12:53:58Z

smolinari
Apr 2, 2024

This sounds very promising. Hopefully it won't fall too quickly under the "Enterprise" umbrella. 🤔

I must also happily say, I've been using Coder now for a few months and I've only seen a glitch like a workspace losing connectivity overnight (which a restart fixes) like three times and other than that, knock on wood, Coder has been running flawlessly. I guess what I'm saying is, I'm not sure if this is a solution looking for a problem? 😊

Granted, my usage of Coder may be somewhat minimal compared to others so, take my point lightly. And, despite my thoughts, I still think it would be a cool thing to have for the possible rare eventuality that I have an issue with my Coder instance. 😁

Scott

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Bundled Observability #12793

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

RFC: Bundled Observability #12793

dannykopping Mar 28, 2024 Collaborator

Problem Statement

Goals

UX

Requirements

Initial Functional Requirements

Initial Non-functional Requirements

Eventual Requirements

Scope

Glossary

Replies: 2 comments · 1 reply

mattlqx Mar 28, 2024

dannykopping Apr 2, 2024 Collaborator Author

smolinari Apr 2, 2024

dannykopping
Mar 28, 2024
Collaborator

Replies: 2 comments 1 reply

mattlqx
Mar 28, 2024

dannykopping Apr 2, 2024
Collaborator Author

smolinari
Apr 2, 2024