Add downstream dependency service name to logs and errors to improve alert insights #183215

sorenlouv · 2024-05-12T12:54:41Z

Changes

Exclude APM error docs from logs and retrieve APM errors separately
Get sample trace.id from logs and apm errors, and retrieve the downstream service name (if available)
Minor prompt tweaks

Scenario

When running the Otel-Demo the "checkout" service is killed on purpose. This causes the failure rate of the frontend service to increase because is has a downstream dependency on the checkout service. This in turn causes alerts to be triggered.

When the user navigates to the alerts details page, and opens the insights they should be presented with the "checkout" service as the root cause.

Before

Before this change the alert insights did not capture that changes to the checkout service was the root cause

After

apmmachine · 2024-05-12T12:54:54Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

sorenlouv · 2024-05-12T13:22:26Z

.../server/routes/assistant_functions/get_observability_alert_details_context/get_apm_errors.ts

+    }
+
+    const downstreamServiceResource = await getDownstreamServiceResource({
+      traceId: errorGroup.traceId,


The downstream service name is resolved via a single sample trace id. The error could have multiple failed downstream dependencies. Ideally we'd get every downstream dependency for the given error. Not sure how in a performant manner

sorenlouv · 2024-05-12T14:58:36Z

buildkite test this

elasticmachine · 2024-05-13T08:26:52Z

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

elasticmachine · 2024-05-13T08:26:52Z

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

crespocarlos

Code LGTM.

kibana-ci · 2024-05-13T12:55:43Z

💚 Build Succeeded

Buildkite Build
Commit: cb57b71
Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-183215-cb57b71e4919
Observability Deployment

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`observability`	286.0KB	286.3KB	+268.0B

Canvas Sharable Runtime

The Canvas "shareable runtime" is an bundle produced to enable running Canvas workpads outside of Kibana. This bundle is included in third-party webpages that embed canvas and therefor should be as slim as possible.

id	before	after	diff
`module count`	-	5407	+5407
`total size`	-	8.8MB	+8.8MB

History

💚 Build #209419 succeeded eab4222
💛 Build #209364 was flaky 4fe52c2

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Add downstream dependency service name to logs and errors

4fe52c2

sorenlouv mentioned this pull request May 12, 2024

Observability should more clearly indicate to the user when an outage in a service is the root cause #183216

Open

sorenlouv commented May 12, 2024

View reviewed changes

sorenlouv marked this pull request as ready for review May 13, 2024 07:02

sorenlouv requested review from a team as code owners May 13, 2024 07:02

Merge branch 'main' into improve-contextual-alert-insights

eab4222

botelastic bot added ci:project-deploy-observability Create an Observability project Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team Team:obs-ux-management Observability Management User Experience Team labels May 13, 2024

sorenlouv added v8.15.0 release_note:enhancement labels May 13, 2024

Merge branch 'main' into improve-contextual-alert-insights

76602aa

shahzad31 approved these changes May 13, 2024

View reviewed changes

Remove console.log

8db9b30

crespocarlos approved these changes May 13, 2024

View reviewed changes

sorenlouv added 2 commits May 13, 2024 13:47

declare errorCategory

17a1fc2

Re-write clause for matching search query

cb57b71

sorenlouv merged commit 0fda9c4 into elastic:main May 13, 2024
21 checks passed

kibanamachine added the backport:skip This commit does not require backporting label May 13, 2024

sorenlouv deleted the improve-contextual-alert-insights branch May 13, 2024 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add downstream dependency service name to logs and errors to improve alert insights #183215

Add downstream dependency service name to logs and errors to improve alert insights #183215

sorenlouv commented May 12, 2024 •

edited

apmmachine commented May 12, 2024

sorenlouv May 12, 2024 •

edited

sorenlouv commented May 12, 2024

elasticmachine commented May 13, 2024

elasticmachine commented May 13, 2024

crespocarlos left a comment

kibana-ci commented May 13, 2024 •

edited

Add downstream dependency service name to logs and errors to improve alert insights #183215

Add downstream dependency service name to logs and errors to improve alert insights #183215

Conversation

sorenlouv commented May 12, 2024 • edited

Changes

Scenario

Before

After

apmmachine commented May 12, 2024

🤖 GitHub comments

sorenlouv May 12, 2024 • edited

Choose a reason for hiding this comment

sorenlouv commented May 12, 2024

elasticmachine commented May 13, 2024

elasticmachine commented May 13, 2024

crespocarlos left a comment

Choose a reason for hiding this comment

kibana-ci commented May 13, 2024 • edited

💚 Build Succeeded

Metrics [docs]

Async chunks

Canvas Sharable Runtime

History

sorenlouv commented May 12, 2024 •

edited

sorenlouv May 12, 2024 •

edited

kibana-ci commented May 13, 2024 •

edited