Grafana dashboard: add an uptime panel to overview #10762

guoard · 2024-03-18T07:26:16Z

Proposed Changes

This pull request adds an uptime panel to the RabbitMQ overview Grafana dashboard.
By incorporating this feature, users can easily track the uptime of RabbitMQ instance.

Types of Changes

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

mkuratczyk · 2024-03-18T09:07:42Z

Thanks a lot for contributing. Unfortunately it doesn't work well as currently implemented. If you restart the pods, they get a new identity in this panel - rather than an updated (shorter) uptime, you will see multiple rows for each pod:

To reproduce the problem, just kubectl rollout restart statefulset foo and check the dashboard afterwards.

If you can fix this, I'm happy to merge.

guoard · 2024-03-19T11:36:37Z

@mkuratczyk thank you for your time.
I pushed another commit, that should fix the problem in k8s statefulset.

mkuratczyk · 2024-03-21T07:42:30Z

I'm afraid it still doesn't work when node restarts happen (which is kind of the whole point). Looking at a cluster that went through multiple node restarts I see this:

guoard · 2024-03-24T11:20:45Z

I conducted several tests on a 2-node k3s cluster with 5 instances of RabbitMQ, but I couldn't replicate the issue you described. However, I'm keen to assist further.

Firstly, could you kindly verify that the Prometheus query used matches the following:

rabbitmq_erlang_uptime_seconds * on(instance, job) group_left(rabbitmq_cluster) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}

Assuming the query aligns, it would be immensely helpful if you could provide additional details or steps that may aid in reproducing the issue. This could include specific configurations, environmental factors, or any other relevant information that might shed light on the problem. Thank you in advance for your assistance in resolving this matter.

guoard · 2024-03-24T11:23:09Z

this the manifest I used to run Rabbitmq cluster:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: foo
spec:
  replicas: 5
  service:
    type: NodePort

mkuratczyk · 2024-03-25T07:50:32Z

I can reproduce this even with a single node - just deploy it and the delete the pod to make it restart. It gets a new IP address and "becomes a new instance" (you can see the difference in the labels):

guoard · 2024-03-27T05:28:07Z

Thank you for providing additional details.

I haven't faced this issue as my monitoring setup operates outside the Kubernetes cluster, with the instance label manually defined.

It appears challenging to correlate the rabbitmq_erlang_uptime_seconds metric with rabbitmq_identity_info without a unique label on the rabbitmq_identity_info metric. Without this, mapping seems unfeasible.

If you agree with my assessment, please consider closing the PR.

mkuratczyk · 2024-03-27T08:55:30Z

I think uptime would be indeed valuable on the dashboard and I'm sure we can solve the query problem. I converted this to a draft PR and will have a look at fixing this problem when I have more time.

guoard · 2024-03-28T15:39:49Z

What are your thoughts on adopting the following approach?

max(max_over_time(QUERY[$__interval]))

I'm unsure of the exact implementation details for the query at the moment. However, employing this method would enable us to track the maximum uptime within a specified custom interval.

michaelklishin · 2024-04-11T22:58:36Z

@mkuratczyk do you have an opinion on this approach? #10762 (comment)

mkuratczyk self-assigned this Mar 18, 2024

michaelklishin changed the title ~~Add uptime panel to rabbitmq overview grafana dashboard~~ Grafana dashboard: add an uptime panel to overview Mar 18, 2024

guoard added 2 commits March 19, 2024 23:09

Add uptime panel to rabbitmq overview grafana dashboard

acbc00a

Don't use rabbitmq_node label for vector matching

c895b57

guoard force-pushed the grafana-uptime-panel branch from 436a285 to c895b57 Compare March 19, 2024 19:39

mkuratczyk marked this pull request as draft March 27, 2024 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grafana dashboard: add an uptime panel to overview #10762

Grafana dashboard: add an uptime panel to overview #10762

guoard commented Mar 18, 2024 •

edited

mkuratczyk commented Mar 18, 2024

guoard commented Mar 19, 2024

mkuratczyk commented Mar 21, 2024

guoard commented Mar 24, 2024

guoard commented Mar 24, 2024

mkuratczyk commented Mar 25, 2024

guoard commented Mar 27, 2024

mkuratczyk commented Mar 27, 2024

guoard commented Mar 28, 2024

michaelklishin commented Apr 11, 2024

Grafana dashboard: add an uptime panel to overview #10762

Are you sure you want to change the base?

Grafana dashboard: add an uptime panel to overview #10762

Conversation

guoard commented Mar 18, 2024 • edited

Proposed Changes

Types of Changes

Checklist

mkuratczyk commented Mar 18, 2024

guoard commented Mar 19, 2024

mkuratczyk commented Mar 21, 2024

guoard commented Mar 24, 2024

guoard commented Mar 24, 2024

mkuratczyk commented Mar 25, 2024

guoard commented Mar 27, 2024

mkuratczyk commented Mar 27, 2024

guoard commented Mar 28, 2024

michaelklishin commented Apr 11, 2024

guoard commented Mar 18, 2024 •

edited