Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relaxing Akka.Persistence.HealthCheck #278

Open
Aaronontheweb opened this issue May 14, 2024 · 0 comments
Open

Relaxing Akka.Persistence.HealthCheck #278

Aaronontheweb opened this issue May 14, 2024 · 0 comments

Comments

@Aaronontheweb
Copy link
Member

From some conversations on Discord - it seems like this probe might be a bit too aggressive:

Hi @everyone, I have a small question about the Persistance healthchecks. I think they are changed and they are now cleaning up their snapshots https://github.com/petabridge/akkadotnet-healthcheck, That cleanup sometimes fails with a 404. at the same time the seems to fail, unsure if that is because the delete failed or that is because the creation failed but that brings down the container running akka.net. Is there a way to add fault tolerance for this? Because if I add fault tolerance to the container healthchecks, all healthchecks will have that extra tolerance, which might not be wanted.
Aaronontheweb — 04/24/2024 8:27 AM
cc @Arkatufus - we just made a bunch of bug fixes to these because they were throwing off false positives at startup @kupo1309
do you have a lot of load at startup or something @kupo1309 ? Or does this probe just fail eventually later
kupo1309 — 04/24/2024 8:43 AM
no this is after days/weeks of running, it seems
so my guess is that it is in fact a transient issue in the azurestorage
we are using 1.5.18 
Arkatufus — 04/24/2024 9:13 AM
@kupo1309 you can add a layer of resiliency on top of it, like, it needs to fail twice or 3 times in a row before being killed?
kupo1309 — 04/24/2024 9:17 AM
its 3 by default indeed
i am upping it to 10, but indeed it seems like it failed a few times, then it got disassociated and then it got restarted.

TL;DR; - we might need to have this probe persistently fail several times before we mark the node as unhealthy. Failing at the first sign of trouble seems like it compounds problems that busy systems are having.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant