New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add downstream dependency service name to logs and errors to improve alert insights #183215
Add downstream dependency service name to logs and errors to improve alert insights #183215
Conversation
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
} | ||
|
||
const downstreamServiceResource = await getDownstreamServiceResource({ | ||
traceId: errorGroup.traceId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downstream service name is resolved via a single sample trace id. The error could have multiple failed downstream dependencies. Ideally we'd get every downstream dependency for the given error. Not sure how in a performant manner
buildkite test this |
Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services) |
Pinging @elastic/obs-ux-management-team (Team:obs-ux-management) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM.
💚 Build Succeeded
Metrics [docs]Async chunks
Canvas Sharable Runtime
History
To update your PR or re-run it, just comment with: |
Related: #183216
Changes
trace.id
from logs and apm errors, and retrieve the downstream service name (if available)Scenario
When running the Otel-Demo the "checkout" service is killed on purpose. This causes the failure rate of the frontend service to increase because is has a downstream dependency on the checkout service. This in turn causes alerts to be triggered.
When the user navigates to the alerts details page, and opens the insights they should be presented with the "checkout" service as the root cause.
Before
Before this change the alert insights did not capture that changes to the
checkout
service was the root causeAfter