ArgoCD healthcheck and GrafanaDashboard #1444

zelig81 · 2024-03-10T13:39:57Z

Is your feature request related to a problem? Please describe.
I created an ArgoCD healthcheck to show if ArgoCD app (and related GrafanaDashboard resource) is OK or not. If the Dashboard has some issue, it is not replicated in the status of the GrafanaDashboard object. It is just not resynced.

Here is the error message in the grafana-operator that I want to be propagated to the object (here: the dashboard json is broken on purpose - title changed to "title-break"):

2024-03-10T13:24:47Z	ERROR	GrafanaDashboardReconciler	error reconciling dashboard	{"controller": "grafanadashboard", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDashboard", "GrafanaDashboard": {"name":"tag-server.json","namespace":"ssp"}, "namespace": "ssp", "name": "tag-server.json", "reconcileID": "8ed6f103-b4e2-470a-9652-7e3c8e9a2329", "dashboard": "tag-server.json", "grafana": "external-grafana", "error": "status: 400, body: {\"message\":\"Dashboard title cannot be empty\",\"status\":\"empty-name\"}"}

Describe the solution you'd like
I propose to update status.lastMessage with this error message (like it is done in GrafanaDatasource: "error 400: Dashboard title cannot be empty")

Describe alternatives you've considered
Meanwhile, I'm waiting for 5 minutes since the last resync and set the status of the healthcheck about failure to sync for more than 5 minutes and a plea to check in grafana-operator logs.

Additionally - it can be nice to make the status of the object to following usual k8s resources convention like:

...
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-01-28T07:02:32Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-03-10T12:45:09Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-03-10T12:45:09Z"
    status: "True"
    type: DashboardReady
  ...
  uid: ....
  hash: ....
  lastResync: "2024-01-28T07:02:32Z"

Additional context
My current ArgoCD healthcheck:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
data:
  resource.customizations: |
   "grafana.integreatly.org/GrafanaD*":
      health.lua.useOpenLibs: true
      health.lua: |
        local hs = {}
        if obj.status ~= nil then
          if obj.status.lastMessage ~= nil then
            hs.status = "Degraded"
            hs.message = obj.status.lastMessage
            return hs
          end
          if obj.status.NoMatchingInstances ~= nil then
            hs.status = "Degraded"
            hs.message = "No matching grafana instances found"
            return hs
          end
          if obj.status.hash ~= nil and obj.status.lastResync ~= nil and obj.status.uid ~= nil then
            local lastResync = os.time({year=string.sub(obj.status.lastResync, 1, 4), month=string.sub(obj.status.lastResync, 6, 7), day=string.sub(obj.status.lastResync, 9, 10), hour=string.sub(obj.status.lastResync, 12, 13), min=string.sub(obj.status.lastResync, 15, 16), sec=string.sub(obj.status.lastResync, 18, 19)})
            local diff = os.difftime(os.time(), lastResync)
            if diff <= 300 then
              hs.status = "Healthy"
              hs.message = "The resource already synced at " .. obj.status.lastResync
              return hs
            else
              hs.status = "Degraded"
              hs.message = "The resource last synced more than 5 minutes ago, at " .. obj.status.lastResync .. ". Please check the grafana-operator logs for more information"
              return hs
            end
          end
        end
        hs.status = "Progressing"
        hs.message = "Still creating the Object"
        return hs

The text was updated successfully, but these errors were encountered:

NissesSenap · 2024-03-10T14:18:54Z

Hi @zelig81 , that is really nice.
We are currently working on getting alert support added to the grafana-operator as can be seen here: #1420.
In it, we are using conditions.

When it gets merged we are planning to change the other CRs to match how status is managed in the alert. How exactly they will look like we haven't decided.

So in short, your issue is in our TODO, we just haven't created any issue for it. So this will be a great way for us to keep track of it. Thank you.

When those changes get merged, it would be great if we can work on custom health checks for ArgoCD and write some docs about it.

If you are interested in contributing the new status solution to the existing CRs that would be great. Else someone maintainer will take a stab at it in the future.

zelig81 · 2024-03-10T14:56:05Z

I would be delighted to assist with this. Just please let me know the format of the conditions.

zelig81 · 2024-03-10T15:06:23Z

In general, we just need to understand which status/condition will be relevant for which of 4 states of the resource in terms of ArgoCD (Progressing; Healthy; Degraded; Suspended(if relevant) )

It will be most nice to have all the types of CR to follow this convention (Grafana, GrafanaFolder, GrafanaDatasource, GrafanaDashboard, GrafanaAlertrule...). Although it can look a little bit strange, it will allow us to make a single ArgoCD healthcheck to rule them all :)
And to add the healthcheck straight to https://github.com/argoproj/argo-cd

NissesSenap · 2024-03-10T15:25:05Z

@zelig81 we haven't come that far to have given it any thought on how the status would look like.

It would be great if you took a look at the Alert implementation and see how it's done there and see how if you think it would fit for the other CRs.

We are also planning to go towards finalizers in the other CRs which should remove the need for some of the existing statuses. It's probably a good idea to do that at the same time. But I do understand that it will make this refactor allot bigger.

So please come with a suggestion on how you think it could be done, and we will try to provide feedback as quickly as possible. The maintainers also have weekly bug scrub meetings, and if you want to join and talk about this refactoring, you are also free to do that.
You can just ping me on Kubernetes/Grafana slack and I can send you the info.

Btw, I think it sounds like a great idea to have a healtcheck in the upstream argo repo around this.

zelig81 · 2024-03-10T16:55:55Z

let's continue the conversation on slack

zelig81 added enhancement New feature or request needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 10, 2024

NissesSenap added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 10, 2024

theSuess mentioned this issue Mar 12, 2024

Add status conditions to all objects #1450

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArgoCD healthcheck and GrafanaDashboard #1444

ArgoCD healthcheck and GrafanaDashboard #1444

zelig81 commented Mar 10, 2024

NissesSenap commented Mar 10, 2024 •

edited

zelig81 commented Mar 10, 2024

zelig81 commented Mar 10, 2024

NissesSenap commented Mar 10, 2024

zelig81 commented Mar 10, 2024 •

edited

ArgoCD healthcheck and GrafanaDashboard #1444

ArgoCD healthcheck and GrafanaDashboard #1444

Comments

zelig81 commented Mar 10, 2024

NissesSenap commented Mar 10, 2024 • edited

zelig81 commented Mar 10, 2024

zelig81 commented Mar 10, 2024

NissesSenap commented Mar 10, 2024

zelig81 commented Mar 10, 2024 • edited

NissesSenap commented Mar 10, 2024 •

edited

zelig81 commented Mar 10, 2024 •

edited