RFC: Bundled Observability #12793
Replies: 2 comments 1 reply
-
It sounds like this is framed more about server internals than the metrics about the product, or are those included in this effort as well? I'm all for offering an observability stack using existing OSS with easy setup through Helm. I would say though if I already have a Grafana instance and want to use it, it would be nice to offer the dashboards adhoc so we can import them. From the product side, it would excite me to have Loki be able to search through workspace start logs so I can easily identify how widespread a given scenario is across the workspaces. |
Beta Was this translation helpful? Give feedback.
-
This sounds very promising. Hopefully it won't fall too quickly under the "Enterprise" umbrella. 🤔 I must also happily say, I've been using Coder now for a few months and I've only seen a glitch like a workspace losing connectivity overnight (which a restart fixes) like three times and other than that, knock on wood, Coder has been running flawlessly. I guess what I'm saying is, I'm not sure if this is a solution looking for a problem? 😊 Granted, my usage of Coder may be somewhat minimal compared to others so, take my point lightly. And, despite my thoughts, I still think it would be a cool thing to have for the possible rare eventuality that I have an issue with my Coder instance. 😁 Scott |
Beta Was this translation helpful? Give feedback.
-
Problem Statement
Offering a product solely on-prem burdens the user (specifically the ”operator” persona) with the responsibility to run a reliable service for their own internal customers. As Coder becomes more and more complex the possible failure modes increase, and we burden the operator with a) knowing how to identify these, b) understanding the problems, and c) fixing them.
Goals
If we could provide an excellent self-service experience for operators by providing world-class OSS observability in our software, their MTTR (Mean Time to Repair) will decrease. If users are able to help themselves instead of reaching out to us for support or filing issues, they will recover from downtime / problems quicker and be more delighted with the product.
Ultimately we want operators who do not have to become subject matter experts on Coder. The easier we make their jobs, the more they will love our product and make themselves/their team look good when they deliver a reliable service.
Additionally, having all of this observability will allow ourselves to better administer our own internal installations, and thereby creating a feedback loop when we make modifications.
This should be made available to both OSS users & Enterprise customers alike.
UX
Operators should be given industry-standard OSS tools with which to observe Coder, such as Grafana, Prometheus, and Alertmanager. Using these should be as simple as port-forwarding to a service, with zero to minimal configuration.
Requirements
Initial Functional Requirements
Grafana, Prometheus, and Alertmanager should all be installable via a Helm chart, likely a separate one to the main coder chart, with feature toggles. We will be reusing existing charts for these components.
Dashboards, alerts, and runbooks should all be bundled with the Helm installation, although they must not be tightly coupled should a user already have Grafana/Prometheus/Alertmanager already installed and want to import these into their installation instead.
Support log aggregation via Grafana Loki with integration into Grafana.
Our existing dashboard should be expanded, and split into logical groupings around the key components of Coder (network mesh, provisioning, control plane, end-user usage, etc).
Initial Non-functional Requirements
Operators should be able to derive valuable insights into normal & abnormal operation through dashboards & alerts.
Operators should not need any explicit training, and all relevant context should be colocated with the dashboards (instruction panels) & alerts (runbooks).
I would love if we took inspiration from what Grafana Mimir has done with their dashboards. Each dashboard is scoped to a specific sub-component of the product. They have an Overview dashboard which has clear explanations about what each panel means, with the ability to drill down into each for more detail.
Eventual Requirements
Build in support for trace (Tempo or Jaeger) and profile (Pyroscope or Parca) aggregation: these will generally only be useful to us, and users who wish to troubleshoot and fix performance-related issues.
A side-effect of having Prometheus in the Helm chart is we could use it later to implement logical autoscaling (i.e. based on, for example, provisioner latency) using KEDA rather than solely on resources (CPU/RAM) which Kubernetes provides natively.
Scope
The goal is to deliver a bundled set of observability & reliability tools, and accompanying material.
Glossary
Grafana: popular OSS dashboarding tool
Prometheus: popular OSS metric collection & querying tool
Alertmanager: popular OSS alert management tool
Alert: comprises an expression (query which represents a suboptimal state of a system) and a notification (sent to a receiver such as Slack, PagerDuty, etc)
Runbook: a set of actions which an operator can take in response to an alert
Dashboard: collection of panels displaying information about systems from which observability signals are collected (metrics, logs, traces, profiles)
Metric: a numeric value indicating a system’s state (e.g. number of live provisioners)
Log: a line of text produced by a system which is stored in a file
Trace: a series of timing records associated to operations within a system
Profile: a set of diagnostic records indicating resource usage (CPU, RAM)
Beta Was this translation helpful? Give feedback.
All reactions