Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit telemetry failure event on handler failure #98

Merged
merged 3 commits into from Oct 14, 2021

Conversation

bryannaegele
Copy link
Contributor

We've discussed the need for a mechanism by which users could react or monitor a handler failure which subsequently results in the handler being detached. Currently handler detachments are somewhat silent in that we only log an error message. For some handlers and use cases, this could be sufficient. However, if a user is highly dependent on an event for monitoring a key metric they likely need to be alerted of this failure through other monitoring infrastructure, e.g. metric events.

For example, a user is monitoring the size of open orders in their database using poller to emit periodic events with this number and they report this out using a gauge. If the handler were to fail and detach, the gauge would be "stuck" at the last reported value with no indication in monitoring that something is amiss. The only way the user could be alerted that this failed is to be using a logging service and setting an alert looking for patterns of handlers failing. This is particularly concerning with reliance on telemetry for tracing.

By emitting a failure event we are able to give users a mechanism by which they could monitor for failures using the infrastructure they are already using for monitoring/alerting. This also allows for potential error recovery attempts by advanced library authors to attempt some recovery strategy before abandoning completely, though this is strongly cautioned against.

@arkgil
Copy link
Contributor

arkgil commented Sep 29, 2021

@bryannaegele it might be a stupid question, but should we detach handlers of the failure event? If the failure handler fails, then we're back in the situation where only the log is emitted.

@bryannaegele
Copy link
Contributor Author

@arkgil They have to subscribe to the failure event, so hopefully they're more careful there 😆 But it's a good question. I think the original motivation of detaching failed handlers because they'll likely fail again should probably still apply here. What do you think?

The main motivation here is to give users/library authors a secondary way to monitor handlers failing without having to only do it in logging solutions like Datadog where they also have no way to potentially handle that failure in-app.

@josevalim
Copy link
Contributor

Yeah, I think we should detach them otherwise they can go into an infinite loop of failure too.

@arkgil
Copy link
Contributor

arkgil commented Sep 29, 2021

That makes sense. I like it, a nice addition! 👍

src/telemetry.erl Outdated Show resolved Hide resolved
src/telemetry.erl Outdated Show resolved Hide resolved
src/telemetry.erl Outdated Show resolved Hide resolved
src/telemetry.erl Show resolved Hide resolved
arkgil
arkgil previously approved these changes Oct 1, 2021
@bryannaegele
Copy link
Contributor Author

I think we're good?

@bryannaegele bryannaegele merged commit f627244 into main Oct 14, 2021
@bryannaegele bryannaegele deleted the failed-handler-event branch October 14, 2021 16:37
@hkrutzer
Copy link

hkrutzer commented Dec 2, 2021

Could a new version be released with this change? 🙏

@josevalim
Copy link
Contributor

We need to at least wait for #102 to be addressed.

@hkrutzer
Copy link

hkrutzer commented Dec 7, 2021

@josevalim cool, looks like that one is working 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants