[Bug] failed to update status with content for dashboard #789

ctrought · 2022-07-13T16:47:31Z

Describe the bug
Version 4.5 of the operator adds a status field to the Dashboard CRD. For large dashboards duplicating the spec.json and also placing into status might exceed maximum size in etcd causing update to fail. I'm not too familiar with the operator and if the same status content field is set for dashboards configured via URL, but I assume the error/limitation could be hit if any dashboard exceeds the maximum size for etcd for any of the supported ways to configure dashboards if the contents are also stored in the status field.

After updating the operator, this caused excessive load on the kubernetes api-server and the top apiserver_watch_events_sizes_sum in the cluster was for grafanadashboards.integreatly.org

Grafana operator log

2022-07-12T17:52:41.576Z	ERROR	dashboard-mysql	failed to request dashboard url, falling back to config map; if specified	{"error": "failed to update status with content for dashboard namespace/mysql: rpc error: code = ResourceExhausted desc = trying to send message larger than max (2461912 vs. 2097152)"}

Version
4.5.0

To Reproduce
Steps to reproduce the behavior:

Create a large dashboard and embed in the CR in json field (> 1 MiB) but below the maximum size of etcd.
Grafana operator will be unable to update the status field if doing so causes it to exceed the maximum size

Expected behavior
The fact it does not update due to limitations of etcd object size (that's my assumption anyway) is not really an issue. It's more that the failure caused excessive load on the cluster apiserver and etcd. If there were a way to handle the error cleaner it would be ideal, perhaps even suggest a solution so they're aware why it's not working properly. The excessive load to apiserver caused instability on our control plane & etcd.

If using gzipJson is the suggested solution, the operator could propagate the message in the log or generate an event and stop attempting to update the dashboard. Not sure if there is a way to catch it before the user applies it, admission webhook that checks the size?

Suspect component/Location where the bug might be occuring:
Addition of status field was introduced in this PR

#689

Screenshots
If applicable, add screenshots to help explain your problem.
apiserver_watch_events_sizes_sum

Runtime (please complete the following information):

OS: CoreOS
Grafana Operator Version: 4.5.0
Environment: OpenShift 4.10
Deployment type: OLM

The text was updated successfully, but these errors were encountered:

weisdd · 2022-07-15T08:49:09Z

I think we should be careful with what we store in the status field, because etcd has limits on its storage size: https://etcd.io/docs/v3.3/dev-guide/limit/
@addreas perhaps, something of your interest.

addreas · 2022-07-15T11:34:41Z

If the content is already in the spec (either json, gzipJson, or jsonnet) it shouldn't be duplicated in the status as well. The status content is only intended as a cache for fetched dashboards. Luckily should be an easy fix.

Don't quite understand why the error would cause excessive load on the API server. Is the operator retrying a lot?

As for fetched dashboard my first though is to just gzip the content in the status. Dashboard definitions should compress pretty nicely, right?

ctrought · 2022-07-15T19:20:57Z

Don't quite understand why the error would cause excessive load on the API server. Is the operator retrying a lot?

I believe so, the log was filled with errors relating to updating the status. The CPU usage of the operator pod spiked quite high after the upgrade, ~0.5 cores while it ran at 0.02 prior to the update. It could likely be reproduced by configuring a dashboard via url that exceeds the size of etcd for the current cluster.

If the content is already in the spec (either json, gzipJson, or jsonnet) it shouldn't be duplicated in the status as well. The status content is only intended as a cache for fetched dashboards. Luckily should be an easy fix.

I had another look at the original dashboard that it was failing to update and found the user had set both json and url in the GrafanaDashboard CR which would explain why it was duplicated. Had they only configured it via url it should have fit into the one CR. I guess the point remains if the remote dashboard exceeded the max size for etcd then the same issue could be hit, but the gzip solution you proposed sounds like it would solve that.

addreas · 2022-07-15T21:51:35Z

Have had a quick look through and created a preliminary PR in #790. If you would want to give it a spin there's an image here: ghcr.io/addreas/grafana-operator:v4.5.0-status-content-gzip. Just a warning: that PR currently deletes any existing spec.json if there is a spec.url or spec.grafanaCom , which I'm not sure is completely sane. It does keep the CR smaller though.

Haven't investigated the error-retry CPU usage part yet, though that PR should keep the error from occurring in the first place.

weisdd · 2022-07-16T12:06:32Z

@addreas The docs say the following:

url: Url address to download a json or jsonnet string with the dashboard contents.
Warning: If both url and json are specified then the json field will be updated with fetched.
The dashboard fetch priority by parameter is: url > configmap > json > jsonnet.

https://github.com/grafana-operator/grafana-operator/blob/master/documentation/dashboards.md#dashboard-properties
On one hand, looks like the contents were not supposed to land into the status field in the first place. On the other hand, rewriting the spec could potentially lead to a state drift (haven't experimented with it yet, just a thought experiment).
Whatever decision is made in the end, the docs have to be in line with that (=updated if needed) :)

weisdd · 2022-07-23T11:36:45Z

@ctrought Please, let us know if you're happy with the fix. It's included in v4.5.1.

ctrought · 2022-07-25T04:10:04Z

@ctrought Please, let us know if you're happy with the fix. It's included in v4.5.1.

The proposal sounds good to me!

pb82 · 2022-07-26T11:42:26Z

fixed with #790

ctrought added bug Something isn't working needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 13, 2022

addreas mentioned this issue Jul 15, 2022

Gzip content cache and bugfix cache time calculation #790

Merged

7 tasks

harshad16 mentioned this issue Jul 19, 2022

Grafana Operator in smaug crashlooping operate-first/apps#2132

Closed

pb82 closed this as completed Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] failed to update status with content for dashboard #789

[Bug] failed to update status with content for dashboard #789

ctrought commented Jul 13, 2022 •

edited

weisdd commented Jul 15, 2022

addreas commented Jul 15, 2022

ctrought commented Jul 15, 2022

addreas commented Jul 15, 2022

weisdd commented Jul 16, 2022 •

edited

weisdd commented Jul 23, 2022

ctrought commented Jul 25, 2022

pb82 commented Jul 26, 2022

[Bug] failed to update status with content for dashboard #789

[Bug] failed to update status with content for dashboard #789

Comments

ctrought commented Jul 13, 2022 • edited

weisdd commented Jul 15, 2022

addreas commented Jul 15, 2022

ctrought commented Jul 15, 2022

addreas commented Jul 15, 2022

weisdd commented Jul 16, 2022 • edited

weisdd commented Jul 23, 2022

ctrought commented Jul 25, 2022

pb82 commented Jul 26, 2022

ctrought commented Jul 13, 2022 •

edited

weisdd commented Jul 16, 2022 •

edited