Skip to content

Best practices for monitoring

Bernhard M. Wiedemann edited this page Feb 15, 2023 · 3 revisions

Dashboard

Layout

Create a Text panel at the very top of every dashboard on its own (unnamed) row. For example, see ResourceLoader. The purpose of this text panel is to:

  • Define in a short statement what the subject of the dashboard is. A Dashboard should tell a story or answer a question.
  • Summarise in a sentence or two the flow of the data from the source to your screen.
  • Answer the question, if I show this to someone else, how long will it take them to figure out what is the dashboard about?

Try not to add too many panels in a single dashboard if they are not related. Instead, create separate dashboards. To address this problem, periodically review your dashboards and remove unnecessary ones. Only add as many panels on a dashboard that can be viewed on a single screen without scrolling. We have decided to stick with screen size 1440x900.

Settings

  • Preferred timezone: UTC.
  • Preferred range: Last 3 hours for most dashboards.
  • Auto-refresh: Provide options for 5min and 15min. If on by default, use 5min as the default interval. Avoid smaller intervals to not cause high load.

Panel

When creating a graph, keep in mind what question you want the graph to answer. If possible, try to focus on a single metric only. More metrics are usually a sign that a graph may be attempting to answer too many questions at once.

To decide which dashboard to use to add new metrics we can use common observability strategies. It helps to make uniform dashboards and scale your observability platform more easily. There are two methods USE (to monitor hardware resources in infrastructure) and RED (to monitor services). Details on these methods are available here.
To further simplify the observability platform we can introduce hierarchies where related panels can be linked together using the Panel options -> Panel links feature.

Panel high-level checklist

  • Do all graphs have a left Y with a useful and correct unit?
  • Is it obvious and easy to understand what a graph represents exactly?
  • Do all the graphs have a meaningful description, title, and name?
  • Do the alert messages make sense? Do they have the correct corresponding channel?

Sources

We used the following sources to write this page:

Clone this wiki locally