Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] failed to update status with content for dashboard #789

Closed
ctrought opened this issue Jul 13, 2022 · 8 comments
Closed

[Bug] failed to update status with content for dashboard #789

ctrought opened this issue Jul 13, 2022 · 8 comments
Labels
bug Something isn't working needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@ctrought
Copy link

ctrought commented Jul 13, 2022

Describe the bug
Version 4.5 of the operator adds a status field to the Dashboard CRD. For large dashboards duplicating the spec.json and also placing into status might exceed maximum size in etcd causing update to fail. I'm not too familiar with the operator and if the same status content field is set for dashboards configured via URL, but I assume the error/limitation could be hit if any dashboard exceeds the maximum size for etcd for any of the supported ways to configure dashboards if the contents are also stored in the status field.

After updating the operator, this caused excessive load on the kubernetes api-server and the top apiserver_watch_events_sizes_sum in the cluster was for grafanadashboards.integreatly.org

Grafana operator log

2022-07-12T17:52:41.576Z	ERROR	dashboard-mysql	failed to request dashboard url, falling back to config map; if specified	{"error": "failed to update status with content for dashboard namespace/mysql: rpc error: code = ResourceExhausted desc = trying to send message larger than max (2461912 vs. 2097152)"}

Version
4.5.0

To Reproduce
Steps to reproduce the behavior:

  1. Create a large dashboard and embed in the CR in json field (> 1 MiB) but below the maximum size of etcd.
  2. Grafana operator will be unable to update the status field if doing so causes it to exceed the maximum size

Expected behavior
The fact it does not update due to limitations of etcd object size (that's my assumption anyway) is not really an issue. It's more that the failure caused excessive load on the cluster apiserver and etcd. If there were a way to handle the error cleaner it would be ideal, perhaps even suggest a solution so they're aware why it's not working properly. The excessive load to apiserver caused instability on our control plane & etcd.

If using gzipJson is the suggested solution, the operator could propagate the message in the log or generate an event and stop attempting to update the dashboard. Not sure if there is a way to catch it before the user applies it, admission webhook that checks the size?

Suspect component/Location where the bug might be occuring:
Addition of status field was introduced in this PR

#689

Screenshots
If applicable, add screenshots to help explain your problem.
apiserver_watch_events_sizes_sum
image

Runtime (please complete the following information):

  • OS: CoreOS
  • Grafana Operator Version: 4.5.0
  • Environment: OpenShift 4.10
  • Deployment type: OLM
@ctrought ctrought added bug Something isn't working needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 13, 2022
@weisdd
Copy link
Collaborator

weisdd commented Jul 15, 2022

I think we should be careful with what we store in the status field, because etcd has limits on its storage size: https://etcd.io/docs/v3.3/dev-guide/limit/
@addreas perhaps, something of your interest.

@addreas
Copy link
Contributor

addreas commented Jul 15, 2022

If the content is already in the spec (either json, gzipJson, or jsonnet) it shouldn't be duplicated in the status as well. The status content is only intended as a cache for fetched dashboards. Luckily should be an easy fix.

Don't quite understand why the error would cause excessive load on the API server. Is the operator retrying a lot?

As for fetched dashboard my first though is to just gzip the content in the status. Dashboard definitions should compress pretty nicely, right?

@ctrought
Copy link
Author

Don't quite understand why the error would cause excessive load on the API server. Is the operator retrying a lot?

I believe so, the log was filled with errors relating to updating the status. The CPU usage of the operator pod spiked quite high after the upgrade, ~0.5 cores while it ran at 0.02 prior to the update. It could likely be reproduced by configuring a dashboard via url that exceeds the size of etcd for the current cluster.

If the content is already in the spec (either json, gzipJson, or jsonnet) it shouldn't be duplicated in the status as well. The status content is only intended as a cache for fetched dashboards. Luckily should be an easy fix.

I had another look at the original dashboard that it was failing to update and found the user had set both json and url in the GrafanaDashboard CR which would explain why it was duplicated. Had they only configured it via url it should have fit into the one CR. I guess the point remains if the remote dashboard exceeded the max size for etcd then the same issue could be hit, but the gzip solution you proposed sounds like it would solve that.

@addreas
Copy link
Contributor

addreas commented Jul 15, 2022

Have had a quick look through and created a preliminary PR in #790. If you would want to give it a spin there's an image here: ghcr.io/addreas/grafana-operator:v4.5.0-status-content-gzip. Just a warning: that PR currently deletes any existing spec.json if there is a spec.url or spec.grafanaCom , which I'm not sure is completely sane. It does keep the CR smaller though.

Haven't investigated the error-retry CPU usage part yet, though that PR should keep the error from occurring in the first place.

@weisdd
Copy link
Collaborator

weisdd commented Jul 16, 2022

@addreas The docs say the following:

url: Url address to download a json or jsonnet string with the dashboard contents.
Warning: If both url and json are specified then the json field will be updated with fetched.
The dashboard fetch priority by parameter is: url > configmap > json > jsonnet.

https://github.com/grafana-operator/grafana-operator/blob/master/documentation/dashboards.md#dashboard-properties
On one hand, looks like the contents were not supposed to land into the status field in the first place. On the other hand, rewriting the spec could potentially lead to a state drift (haven't experimented with it yet, just a thought experiment).
Whatever decision is made in the end, the docs have to be in line with that (=updated if needed) :)

@weisdd
Copy link
Collaborator

weisdd commented Jul 23, 2022

@ctrought Please, let us know if you're happy with the fix. It's included in v4.5.1.

@ctrought
Copy link
Author

@ctrought Please, let us know if you're happy with the fix. It's included in v4.5.1.

The proposal sounds good to me!

@pb82
Copy link
Collaborator

pb82 commented Jul 26, 2022

fixed with #790

@pb82 pb82 closed this as completed Jul 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

4 participants