Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead lettering tests #101524

Open
carlossanlop opened this issue Apr 25, 2024 · 11 comments
Open

Dead lettering tests #101524

carlossanlop opened this issue Apr 25, 2024 · 11 comments
Labels
area-Infrastructure Known Build Error Use this to report build issues in the .NET Helix tab untriaged New issue has not been triaged by the area owner

Comments

@carlossanlop
Copy link
Member

carlossanlop commented Apr 25, 2024

Build Information

Build: https://dev.azure.com/dnceng-public/public/public%20Team/_build/results?buildId=654910
Build error leg or test failing: browser-wasm windows Release LibraryTests_Smoke_AOT

Error Message

{
  "ErrorMessage" : "If you’re reading this, that means the Helix work item you’re trying to find the logs for has dead-lettered.",
  "BuildRetry" : false,
  "ExcludeConsoleLog" : false
}
If you’re reading this, that means the Helix work item you’re trying to find the logs for has dead-lettered.

What this means:

- All attempts to retry execution of this work item were unable to complete.  This can be both for infrastructure reasons (problems within Azure) or issues with the work item (for instance, causing a machine to reboot unexpectedly or killing the Helix client on the machine will force a retry).
- No further work will be done for this specific work item, and its exit code is set to an artificial -1 (since it did not complete, there is no real exit code).

Common causes:

- Disabled queue (end-of-life Helix queues are automatically forwarded to deadletter and will fail instantly)
- Unhealthy Helix Client machine(s)
- Queue has been backed up heavily by a large amount of work and was manually purged by the engineering team
- Azure issues (e.g. Service Bus is overloaded)
- Malformed payloads; if Helix cannot download and unzip all payloads successfully, work will retry until dead-lettered.

For follow up:

- Check if your Helix Queue is still enabled, either via the metadata you see by browsing to https://helix.dot.net/api/info/queues?api-version=2019-06-17 or recent emails from the .NET Engineering Infrastructure team.
- Check that all work item payloads are accessible using a browser.
- If you are sending to a non-disabled queue and find this error repeatedly occurring, please contact the dnceng team.
- If a single, specific work item dead letters and others do not, consider local debugging; it may be causing spontaneous reboot (or trigging one intentionally).

Report

Build Definition Test Pull Request
687468 dotnet/runtime System.IO.FileSystem.Manual.Tests.WorkItemExecution
687452 dotnet/runtime System.IO.FileSystem.Primitives.Tests.WorkItemExecution
687464 dotnet/runtime System.Formats.Asn1.Tests.WorkItemExecution
687448 dotnet/runtime System.Diagnostics.TraceSource.Tests.WorkItemExecution
687450 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
687445 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
686902 dotnet/runtime System.Threading.Tasks.Parallel.Tests.WorkItemExecution
686898 dotnet/runtime Regression_3.WorkItemExecution
686906 dotnet/runtime System.Formats.Asn1.Tests.WorkItemExecution
686914 dotnet/runtime tvOS.Device.Aot.Test.WorkItemExecution
686579 dotnet/runtime System.IO.Tests.WorkItemExecution #102558
686411 dotnet/runtime System.Diagnostics.Process.Tests.WorkItemExecution
686474 dotnet/runtime System.IO.FileSystem.DriveInfo.Tests.WorkItemExecution
686466 dotnet/runtime System.IO.FileSystem.Watcher.Tests.WorkItemExecution
686486 dotnet/runtime System.Formats.Asn1.Tests.WorkItemExecution
686484 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
686472 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
686463 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
685657 dotnet/runtime Regression_3.WorkItemExecution
685661 dotnet/runtime System.Diagnostics.DiagnosticSource.Switches.Tests.WorkItemExecution
685279 dotnet/runtime System.IO.FileSystem.Manual.Tests.WorkItemExecution
685264 dotnet/runtime System.IO.FileSystem.Tests.WorkItemExecution
685334 dotnet/runtime System.Xml.Linq.Streaming.Tests.WorkItemExecution #102488
685280 dotnet/runtime System.Diagnostics.TraceSource.Tests.WorkItemExecution
685262 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
685265 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
685259 dotnet/runtime System.Formats.Asn1.Tests.WorkItemExecution
685238 dotnet/runtime Invariant.Tests.WorkItemExecution #101701
684980 dotnet/runtime System.IO.Tests.WorkItemExecution #102509
684125 dotnet/runtime System.IO.Net5Compat.Tests.WorkItemExecution #102558
684032 dotnet/runtime System.IO.FileSystem.Watcher.Tests.WorkItemExecution
684145 dotnet/runtime System.Runtime.Tests.WorkItemExecution #102493
684031 dotnet/runtime System.IO.FileSystem.Primitives.Tests.WorkItemExecution
684042 dotnet/runtime System.Formats.Cbor.Tests.WorkItemExecution
684091 dotnet/runtime System.Runtime.Tests.WorkItemExecution #102488
684039 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
684035 dotnet/runtime System.Formats.Cbor.Tests.WorkItemExecution
684028 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
683662 dotnet/runtime Regression_3.WorkItemExecution
683666 dotnet/runtime Microsoft.Extensions.Options.Tests.WorkItemExecution
683262 dotnet/runtime System.Runtime.Tests.WorkItemExecution
682820 dotnet/runtime System.IO.Net5Compat.Tests.WorkItemExecution #102497
682549 dotnet/runtime System.Runtime.Tests.WorkItemExecution
682544 dotnet/runtime System.Runtime.Tests.WorkItemExecution
682562 dotnet/runtime System.Runtime.Tests.WorkItemExecution
682557 dotnet/runtime System.Runtime.Tests.WorkItemExecution
682548 dotnet/runtime System.IO.FileSystem.Watcher.Tests.WorkItemExecution
682556 dotnet/runtime System.IO.FileSystem.Watcher.Tests.WorkItemExecution
682572 dotnet/runtime System.Diagnostics.TraceSource.Tests.WorkItemExecution
682568 dotnet/runtime System.Formats.Asn1.Tests.WorkItemExecution
682559 dotnet/runtime System.Formats.Cbor.Tests.WorkItemExecution
682546 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
682527 dotnet/runtime System.Runtime.Tests.WorkItemExecution #102482
682464 dotnet/runtime System.Runtime.Tests.WorkItemExecution #102417
682277 dotnet/runtime Regression_3.WorkItemExecution
682021 dotnet/runtime System.Threading.Tasks.Parallel.Tests.WorkItemExecution
681982 dotnet/runtime Regression_3.WorkItemExecution
681963 dotnet/runtime Regression_3.WorkItemExecution
682003 dotnet/runtime System.Globalization.CalendarsWithConfigSwitch.Tests.WorkItemExecution
681987 dotnet/runtime Microsoft.Extensions.Configuration.Json.Tests.WorkItemExecution
681751 dotnet/runtime System.Threading.Tasks.Parallel.Tests.WorkItemExecution
681981 dotnet/runtime System.Drawing.Common.Tests.WorkItemExecution
682033 dotnet/runtime System.Runtime.Tests.WorkItemExecution #102416
681941 dotnet/runtime System.Runtime.Tests.WorkItemExecution
681930 dotnet/runtime System.Runtime.Tests.WorkItemExecution #102408
681747 dotnet/runtime Regression_3.WorkItemExecution
681829 dotnet/runtime System.Globalization.CalendarsWithConfigSwitch.Tests.WorkItemExecution
681726 dotnet/runtime Microsoft.Extensions.Logging.Generators.Roslyn3.11.Tests.WorkItemExecution
681666 dotnet/runtime System.Formats.Tar.Tests.WorkItemExecution #101295
681237 dotnet/runtime System.Runtime.Tests.WorkItemExecution
681250 dotnet/runtime System.Runtime.Tests.WorkItemExecution
681239 dotnet/runtime System.Runtime.Tests.WorkItemExecution
681226 dotnet/runtime System.Runtime.Tests.WorkItemExecution
681235 dotnet/runtime System.IO.FileSystem.Primitives.Tests.WorkItemExecution
681236 dotnet/runtime System.IO.FileSystem.Manual.Tests.WorkItemExecution
681251 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
681227 dotnet/runtime System.Diagnostics.Tracing.Tests.WorkItemExecution
681247 dotnet/runtime System.Dynamic.Runtime.Tests.WorkItemExecution
681240 dotnet/runtime System.Diagnostics.Tracing.Tests.WorkItemExecution
681136 dotnet/runtime System.Runtime.Tests.WorkItemExecution #97402
680766 dotnet/runtime System.Runtime.Tests.WorkItemExecution #102424
680686 dotnet/runtime System.Collections.Concurrent.Tests.WorkItemExecution #98127
680559 dotnet/runtime System.Runtime.Tests.WorkItemExecution
680561 dotnet/runtime System.Runtime.Tests.WorkItemExecution
680555 dotnet/runtime System.Runtime.Tests.WorkItemExecution
680535 dotnet/runtime System.Runtime.Tests.WorkItemExecution
680551 dotnet/runtime System.IO.FileSystem.Primitives.Tests.WorkItemExecution
680544 dotnet/runtime System.IO.FileSystem.Primitives.Tests.WorkItemExecution
680536 dotnet/runtime System.Diagnostics.Tracing.Tests.WorkItemExecution
680543 dotnet/runtime System.Diagnostics.TraceSource.Tests.WorkItemExecution
680539 dotnet/runtime System.Diagnostics.Tracing.Tests.WorkItemExecution
680540 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
680387 dotnet/runtime System.IO.Tests.WorkItemExecution #102392
680209 dotnet/runtime System.Diagnostics.Tracing.Tests.WorkItemExecution
680356 dotnet/runtime System.Runtime.Tests.WorkItemExecution #102187
680180 dotnet/runtime System.Runtime.Tests.WorkItemExecution
680201 dotnet/runtime System.Runtime.Tests.WorkItemExecution
680186 dotnet/runtime System.Runtime.Tests.WorkItemExecution
680198 dotnet/runtime System.Runtime.Tests.WorkItemExecution
680189 dotnet/runtime System.IO.FileSystem.Net5Compat.Tests.WorkItemExecution
Displaying 100 of 535 results

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
6 92 535
@carlossanlop carlossanlop added arch-wasm WebAssembly architecture os-windows wasm-aot-test WebAssembly AOT Test Known Build Error Use this to report build issues in the .NET Helix tab os-browser Browser variant of arch-wasm labels Apr 25, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 25, 2024
Copy link
Contributor

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 25, 2024
@carlossanlop
Copy link
Member Author

In the same PR's build where the above dead-letter failure was found, there's another run for a similar queue but is Release. While it does not dead-letter immediately, it manages to print a couple lines, then dies:

Console log: 'WasmTestOnChrome-System.Runtime.Tests' from job 2736decc-83a1-4c32-9e0a-ef543e0d26f3 (windows.amd64.server2022.open.rt) using docker image mcr.microsoft.com/dotnet-buildtools/prereqs:windowsservercore-ltsc2022-helix-webassembly on a000NF0
running %HELIX_CORRELATION_PAYLOAD%\scripts\be8b1ad5c1e9498d89709f26e508c549\execute.cmd in C:\h\w\B0E6099F\w\A1EF089A\e max 3600 seconds

^ It just dies after printing the second line.

I do not want to open a KnownBuildError issue for this specific failure as it would end up grouping anything. I am concerned that people are going to get blocked on getting their PRs merged because they cannot bypass the merge on green restriction. For example, this PR is already blocked and I will only be able to merge it if I JIT elevate myself: #101498
@JulieLeeMSFT @hoyosjs @jkoritzinsky

@jkotas
Copy link
Member

jkotas commented Apr 25, 2024

I am concerned that people are going to get blocked on getting their PRs merged because they cannot bypass the merge on green restriction. For example, this PR is already blocked and I will only be able to merge it if I JIT elevate myself:

People should be able to use https://github.com/dotnet/runtime/blob/main/docs/workflow/ci/failure-analysis.md#bypassing-build-analysis . Have you tried that before JIT elevating?

@carlossanlop
Copy link
Member Author

People should be able to use https://github.com/dotnet/runtime/blob/main/docs/workflow/ci/failure-analysis.md#bypassing-build-analysis . Have you tried that before JIT elevating?

I was not aware of that. Thanks for sharing! I'll try it next time.

@agocke
Copy link
Member

agocke commented Apr 25, 2024

What does dead-lettering mean in this context? What is the case where this fails?

@vcsjones vcsjones removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 25, 2024
@lewing
Copy link
Member

lewing commented Apr 25, 2024

iirc Dead Lettering is usually infrastructure related, is that correct @steveisok ?

We often see queues fall over around branch time

If you’re reading this, that means the Helix work item you’re trying to find the logs for has dead-lettered.

What this means:

- All attempts to retry execution of this work item were unable to complete.  This can be both for infrastructure reasons (problems within Azure) or issues with the work item (for instance, causing a machine to reboot unexpectedly or killing the Helix client on the machine will force a retry).
- No further work will be done for this specific work item, and its exit code is set to an artificial -1 (since it did not complete, there is no real exit code).

Common causes:

- Disabled queue (end-of-life Helix queues are automatically forwarded to deadletter and will fail instantly)
- Unhealthy Helix Client machine(s)
- Queue has been backed up heavily by a large amount of work and was manually purged by the engineering team
- Azure issues (e.g. Service Bus is overloaded)
- Malformed payloads; if Helix cannot download and unzip all payloads successfully, work will retry until dead-lettered.

For follow up:

- Check if your Helix Queue is still enabled, either via the metadata you see by browsing to https://helix.dot.net/api/info/queues?api-version=2019-06-17 or recent emails from the .NET Engineering Infrastructure team.
- Check that all work item payloads are accessible using a browser.
- If you are sending to a non-disabled queue and find this error repeatedly occurring, please contact the dnceng team.
- If a single, specific work item dead letters and others do not, consider local debugging; it may be causing spontaneous reboot (or trigging one intentionally).

@steveisok
Copy link
Member

iirc Dead Lettering is usually infrastructure related, is that correct @steveisok ?

We often see queues fall over around branch time

Correct. @ilyas1974 is queue dead lettering manually driven or is there some automation involved?

Copy link
Contributor

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

@lewing lewing changed the title Dead lettering in wasm smoke AOT tests Dead lettering tests [disproportionately wasm] Apr 26, 2024
@lewing lewing removed the arch-wasm WebAssembly architecture label Apr 26, 2024
@ilyas1974
Copy link

For you dead lettering question, the answer is Yes - it's a manual and automated process. We manually deadletter a queue so the changes are immediate. We then add the deadletter information to the helix configuration, so he is persistent for whenever we make changes to helix.

@agocke
Copy link
Member

agocke commented Apr 26, 2024

Ok, so is the expectation that tests should be re-run when deadlettering happens? That basically, that run was invalid?

@lewing lewing changed the title Dead lettering tests [disproportionately wasm] Dead lettering tests May 2, 2024
@lewing lewing removed wasm-aot-test WebAssembly AOT Test os-browser Browser variant of arch-wasm labels May 2, 2024
@lewing
Copy link
Member

lewing commented May 2, 2024

I've removed the wasm references in the labels and title bits because wasm is no longer dominating the failures in any way (with the exception of preview4 which has known problems that are fixed in main)

@lewing lewing removed the os-windows label May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Infrastructure Known Build Error Use this to report build issues in the .NET Helix tab untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

8 participants