Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArgoCD healthcheck and GrafanaDashboard #1444

Open
zelig81 opened this issue Mar 10, 2024 · 5 comments
Open

ArgoCD healthcheck and GrafanaDashboard #1444

zelig81 opened this issue Mar 10, 2024 · 5 comments
Labels
enhancement New feature or request triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@zelig81
Copy link

zelig81 commented Mar 10, 2024

Is your feature request related to a problem? Please describe.
I created an ArgoCD healthcheck to show if ArgoCD app (and related GrafanaDashboard resource) is OK or not. If the Dashboard has some issue, it is not replicated in the status of the GrafanaDashboard object. It is just not resynced.

Here is the error message in the grafana-operator that I want to be propagated to the object (here: the dashboard json is broken on purpose - title changed to "title-break"):

2024-03-10T13:24:47Z	ERROR	GrafanaDashboardReconciler	error reconciling dashboard	{"controller": "grafanadashboard", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDashboard", "GrafanaDashboard": {"name":"tag-server.json","namespace":"ssp"}, "namespace": "ssp", "name": "tag-server.json", "reconcileID": "8ed6f103-b4e2-470a-9652-7e3c8e9a2329", "dashboard": "tag-server.json", "grafana": "external-grafana", "error": "status: 400, body: {\"message\":\"Dashboard title cannot be empty\",\"status\":\"empty-name\"}"}

Describe the solution you'd like
I propose to update status.lastMessage with this error message (like it is done in GrafanaDatasource: "error 400: Dashboard title cannot be empty")

Describe alternatives you've considered
Meanwhile, I'm waiting for 5 minutes since the last resync and set the status of the healthcheck about failure to sync for more than 5 minutes and a plea to check in grafana-operator logs.

Additionally - it can be nice to make the status of the object to following usual k8s resources convention like:

...
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-01-28T07:02:32Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-03-10T12:45:09Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-03-10T12:45:09Z"
    status: "True"
    type: DashboardReady
  ...
  uid: ....
  hash: ....
  lastResync: "2024-01-28T07:02:32Z"

Additional context
My current ArgoCD healthcheck:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
data:
  resource.customizations: |
   "grafana.integreatly.org/GrafanaD*":
      health.lua.useOpenLibs: true
      health.lua: |
        local hs = {}
        if obj.status ~= nil then
          if obj.status.lastMessage ~= nil then
            hs.status = "Degraded"
            hs.message = obj.status.lastMessage
            return hs
          end
          if obj.status.NoMatchingInstances ~= nil then
            hs.status = "Degraded"
            hs.message = "No matching grafana instances found"
            return hs
          end
          if obj.status.hash ~= nil and obj.status.lastResync ~= nil and obj.status.uid ~= nil then
            local lastResync = os.time({year=string.sub(obj.status.lastResync, 1, 4), month=string.sub(obj.status.lastResync, 6, 7), day=string.sub(obj.status.lastResync, 9, 10), hour=string.sub(obj.status.lastResync, 12, 13), min=string.sub(obj.status.lastResync, 15, 16), sec=string.sub(obj.status.lastResync, 18, 19)})
            local diff = os.difftime(os.time(), lastResync)
            if diff <= 300 then
              hs.status = "Healthy"
              hs.message = "The resource already synced at " .. obj.status.lastResync
              return hs
            else
              hs.status = "Degraded"
              hs.message = "The resource last synced more than 5 minutes ago, at " .. obj.status.lastResync .. ". Please check the grafana-operator logs for more information"
              return hs
            end
          end
        end
        hs.status = "Progressing"
        hs.message = "Still creating the Object"
        return hs
@zelig81 zelig81 added enhancement New feature or request needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 10, 2024
@NissesSenap
Copy link
Collaborator

NissesSenap commented Mar 10, 2024

Hi @zelig81 , that is really nice.
We are currently working on getting alert support added to the grafana-operator as can be seen here: #1420.
In it, we are using conditions.

When it gets merged we are planning to change the other CRs to match how status is managed in the alert. How exactly they will look like we haven't decided.

So in short, your issue is in our TODO, we just haven't created any issue for it. So this will be a great way for us to keep track of it. Thank you.

When those changes get merged, it would be great if we can work on custom health checks for ArgoCD and write some docs about it.

If you are interested in contributing the new status solution to the existing CRs that would be great. Else someone maintainer will take a stab at it in the future.

@NissesSenap NissesSenap added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 10, 2024
@zelig81
Copy link
Author

zelig81 commented Mar 10, 2024

I would be delighted to assist with this. Just please let me know the format of the conditions.

@zelig81
Copy link
Author

zelig81 commented Mar 10, 2024

In general, we just need to understand which status/condition will be relevant for which of 4 states of the resource in terms of ArgoCD (Progressing; Healthy; Degraded; Suspended(if relevant) )

It will be most nice to have all the types of CR to follow this convention (Grafana, GrafanaFolder, GrafanaDatasource, GrafanaDashboard, GrafanaAlertrule...). Although it can look a little bit strange, it will allow us to make a single ArgoCD healthcheck to rule them all :)
And to add the healthcheck straight to https://github.com/argoproj/argo-cd

@NissesSenap
Copy link
Collaborator

@zelig81 we haven't come that far to have given it any thought on how the status would look like.

It would be great if you took a look at the Alert implementation and see how it's done there and see how if you think it would fit for the other CRs.

We are also planning to go towards finalizers in the other CRs which should remove the need for some of the existing statuses. It's probably a good idea to do that at the same time. But I do understand that it will make this refactor allot bigger.

So please come with a suggestion on how you think it could be done, and we will try to provide feedback as quickly as possible. The maintainers also have weekly bug scrub meetings, and if you want to join and talk about this refactoring, you are also free to do that.
You can just ping me on Kubernetes/Grafana slack and I can send you the info.

Btw, I think it sounds like a great idea to have a healtcheck in the upstream argo repo around this.

@zelig81
Copy link
Author

zelig81 commented Mar 10, 2024

let's continue the conversation on slack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

2 participants