Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful handling of data from Redis that fails to be parsed #607

Open
julianbrost opened this issue Jun 21, 2023 · 4 comments
Open

Graceful handling of data from Redis that fails to be parsed #607

julianbrost opened this issue Jun 21, 2023 · 4 comments

Comments

@julianbrost
Copy link
Contributor

At the moment, Icinga DB completely relies on Icinga 2 that there is no data present in Redis it can't process. If this assumption is violated, Icinga DB simply crashes at the moment. There already were a few instances where such bugs were reported, for example:

Not writing data that the receiver does not understand at all is a good idea, but in order to make Icinga DB more robust, we should implement more graceful/fine-grained error handling in Icinga DB as well.

Task

Go through all code that reads from Redis and the errors that can occur, especially where data is parsed. Check if there is a more graceful way of handling the error than just bailing out. For example, for the object sync, an individual object that fails to be parsed could just be skipped. Summarize your findings in a comment to this issue and propose changes where and how things could be handled better.

@Al2Klimov
Copy link
Member

propose changes where and how things could be handled better.

  • a health check bit in the DB telling the user to look into logs
  • a new sublogger (grep-able 👍) triggering the latter, probably even named health, for the following
  • (overdue: nothing to do here)
  • config+runtime: can't decode? – ignore it. (I guess. Object( update)s would disappear.)
  • history: here we can't do the above (ex. for state+notification) as an end event upserts its start event. But here we have History(Table)Meta we can fall back to. This gives us the object identification we can blacklist in memory for future successfully decoded events. However, ignored stuff is kept in Redis!

OK?

@Al2Klimov
Copy link
Member

Almost forgot: to be honest, I'd prefer not to do anything at all here. As Eric said, our main problem is string/uint discrepancy which we fixed in 2.14/2.13.8, not individual values being provocatively out of range. E.g. if you've set you check timeout to 2^64, well, please set it back to an actual value. Or, if you've got an actual problem, please report it.

A crash is a good motivator for the latter and a reasonable one if(!) 2.14/2.13.8 is enough which it is IMAO. In contrast, a health check is easy to ignore.

@julianbrost
Copy link
Contributor Author

  • a health check bit in the DB telling the user to look into logs
  • a new sublogger (grep-able +1) triggering the latter, probably even named health, for the following

These two points are for #608.

  • (overdue: nothing to do here)

Why? I mean this probably only takes very few inputs so likely not our biggest concern.

  • config+runtime: can't decode? – ignore it. (I guess. Object( update)s would disappear.)

For the initial sync (i.e. the one computing the delta first), would this leave the object as-is in the database or delete it (that might happen if you treat it like it wouldn't exist in Redis)?

  • history: here we can't do the above (ex. for state+notification) as an end event upserts its start event. But here we have History(Table)Meta we can fall back to. This gives us the object identification we can blacklist in memory for future successfully decoded events. However, ignored stuff is kept in Redis!

And if even that part wouldn't be decodable, it would stop the sync for the entire type? Or discard it and go on as there probably isn't hope in fixing it either?

Almost forgot: to be honest, I'd prefer not to do anything at all here. As Eric said, our main problem is string/uint discrepancy which we fixed in 2.14/2.13.8, not individual values being provocatively out of range. E.g. if you've set you check timeout to 2^64, well, please set it back to an actual value. Or, if you've got an actual problem, please report it.

I stick to "if it crashes, it's a bug", we can argue about the priority. It just doesn't look good if you can (accidentally or on purpose) input a value and this knocks out significant parts of your monitoring.

A crash is a good motivator for the latter and a reasonable one if(!) 2.14/2.13.8 is enough which it is IMAO. In contrast, a health check is easy to ignore.

What you call a good motivator potentially stops people from working with Icinga Web. If anything, that should be an argument for not having a "hide warning entirely" button.

@Al2Klimov
Copy link
Member

  • (overdue: nothing to do here)

Why? I mean this probably only takes very few inputs so likely not our biggest concern.

It even is not a concern at all. It doesn't suffer from such problems at all. By design. So there's nothing to do here.

  • config+runtime: can't decode? – ignore it. (I guess. Object( update)s would disappear.)

For the initial sync (i.e. the one computing the delta first), would this leave the object as-is in the database or delete it (that might happen if you treat it like it wouldn't exist in Redis)?

I'd do the latter.

  • more motivating to fix the error
  • less confusing than it's there, but not up to date
  • history: here we can't do the above (ex. for state+notification) as an end event upserts its start event. But here we have History(Table)Meta we can fall back to. This gives us the object identification we can blacklist in memory for future successfully decoded events. However, ignored stuff is kept in Redis!

And if even that part wouldn't be decodable, it would stop the sync for the entire type? Or discard it and go on as there probably isn't hope in fixing it either?

Oh. 😅 I'd stop the sync for the entire type. This -if occurs- is a serious problem.

If anything, that should be an argument for not having a "hide warning entirely" button.

@nilmerg Can we do this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants