Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builds legs getting abandoned due to agent notification issues #35223

Closed
jaredpar opened this issue Apr 20, 2020 · 12 comments
Closed

Builds legs getting abandoned due to agent notification issues #35223

jaredpar opened this issue Apr 20, 2020 · 12 comments
Labels
area-Infrastructure-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'

Comments

@jaredpar
Copy link
Member

jaredpar commented Apr 20, 2020

Full message

##[error]The request: 5641504 was abandoned due to an infrastructure failure. Notification of assignment to an agent was never received.

Runfo Tracking Issue: Runtime jobs being abandoned due to infra

Definition Build Kind Job Name
runtime 1072738 PR 50450 CoreCLR Product Build windows arm checked
runtime 1072738 PR 50450 Libraries Build windows arm64 Release
runtime 1072738 PR 50450 Build windows x64 Release SingleFile
runtime 1072738 PR 50450 CoreCLR Product Build windows arm64 checked
runtime 1072738 PR 50450 CoreCLR Product Build windows x64 checked
runtime 1072738 PR 50450 CoreCLR Product Build windows arm64 release
runtime 1072738 PR 50450 Libraries Build windows x64 Debug
runtime 1072738 PR 50450 Libraries Build windows arm Release
runtime 1072738 PR 50450 CoreCLR Product Build windows x64 release
runtime 1072738 PR 50450 CoreCLR Product Build windows x86 release
runtime 1072738 PR 50450 Libraries Build windows x86 Release
runtime 1072738 PR 50450 CoreCLR Product Build windows x64 release PGO
runtime 1072738 PR 50450 CoreCLR Product Build windows x86 checked
runtime 1072738 PR 50450 CoreCLR Product Build windows arm release
runtime 1072731 PR 49906 Mono Product Build windows x64 debug
runtime 1072731 PR 49906 CoreCLR Product Build windows arm checked
runtime 1072731 PR 49906 CoreCLR Product Build windows x86 checked
runtime 1072731 PR 49906 Libraries Build windows allConfigurations x64 Debug
runtime 1072731 PR 49906 CoreCLR Product Build windows x64 release PGO
runtime 1072731 PR 49906 Libraries Build windows x86 Release
runtime 1072731 PR 49906 CoreCLR Product Build windows x86 release
runtime 1072731 PR 49906 Libraries Build windows x86 Debug
runtime 1072731 PR 49906 CoreCLR Product Build windows x64 release
runtime 1072731 PR 49906 CoreCLR Product Build windows arm release
runtime 1072731 PR 49906 Libraries Build windows net48 x86 Release
runtime 1072731 PR 49906 Mono Product Build windows x64 release
runtime 1072731 PR 49906 Libraries Build windows arm Release
runtime 1072731 PR 49906 Libraries Build windows x64 Debug
runtime 1072731 PR 49906 Mono Product Build windows x86 release
runtime 1072731 PR 49906 CoreCLR Product Build windows arm64 release
runtime 1072731 PR 49906 CoreCLR Product Build windows x64 checked
runtime 1072731 PR 49906 CoreCLR Product Build windows arm64 checked
runtime 1072731 PR 49906 Mono Product Build windows x86 debug
runtime 1072731 PR 49906 Libraries Build windows arm64 Release
runtime 1072731 PR 49906 Build windows x64 Release SingleFile
runtime 1072700 PR 50622 Mono Product Build windows x64 debug
runtime 1072700 PR 50622 Libraries Build windows arm64 Release
runtime 1072700 PR 50622 Libraries Build windows allConfigurations x64 Debug
runtime 1072700 PR 50622 CoreCLR Product Build windows x64 release PGO
runtime 1072700 PR 50622 Libraries Build windows x86 Release
runtime 1072700 PR 50622 CoreCLR Product Build windows x86 release
runtime 1072700 PR 50622 Libraries Build windows x86 Debug
runtime 1072700 PR 50622 CoreCLR Product Build windows x64 release
runtime 1072700 PR 50622 Libraries Build windows net48 x86 Release
runtime 1072700 PR 50622 Mono Product Build windows x64 release
runtime 1072700 PR 50622 Libraries Build windows arm Release
runtime 1072700 PR 50622 Libraries Build windows x64 Debug
runtime 1072700 PR 50622 Mono Product Build windows x86 release
runtime 1072700 PR 50622 CoreCLR Product Build windows arm64 release
runtime 1072700 PR 50622 Mono Product Build windows x86 debug
runtime 1072700 PR 50622 CoreCLR Product Build windows arm release
runtime 1072700 PR 50622 Build windows x64 Release SingleFile
runtime 1072687 PR 50732 Libraries Build windows arm64 Release
runtime 1072687 PR 50732 Build windows x64 Release SingleFile
runtime 1072687 PR 50732 CoreCLR Product Build windows x86 checked
runtime 1072687 PR 50732 Libraries Build windows allConfigurations x64 Debug
runtime 1072687 PR 50732 CoreCLR Product Build windows x64 release PGO
runtime 1072687 PR 50732 Libraries Build windows x86 Release
runtime 1072687 PR 50732 CoreCLR Product Build windows x86 release
runtime 1072687 PR 50732 Libraries Build windows x86 Debug
runtime 1072687 PR 50732 CoreCLR Product Build windows x64 release
runtime 1072687 PR 50732 CoreCLR Product Build windows arm release
runtime 1072687 PR 50732 Mono Product Build windows x64 debug
runtime 1072687 PR 50732 Mono crossaot Product Build windows x64 release
runtime 1072687 PR 50732 CoreCLR Product Build windows arm checked
runtime 1072687 PR 50732 Libraries Build windows net48 x86 Release
runtime 1072687 PR 50732 Mono Product Build windows x64 release
runtime 1072687 PR 50732 Libraries Build windows arm Release
runtime 1072687 PR 50732 Libraries Build windows x64 Debug
runtime 1072687 PR 50732 Mono Product Build windows x86 release
runtime 1072687 PR 50732 CoreCLR Product Build windows arm64 release
runtime 1072687 PR 50732 CoreCLR Product Build windows x64 checked
runtime 1072687 PR 50732 CoreCLR Product Build windows arm64 checked
runtime 1072687 PR 50732 Mono Product Build windows x86 debug
runtime 1072679 PR 50644 CoreCLR Product Build windows arm64 checked
runtime 1072679 PR 50644 CoreCLR Product Build windows x64 checked
runtime 1072679 PR 50644 CoreCLR Product Build windows arm64 release
runtime 1072679 PR 50644 Libraries Build windows x64 Debug
runtime 1072679 PR 50644 Libraries Build windows arm Release
runtime 1072679 PR 50644 CoreCLR Product Build windows arm checked
runtime 1072679 PR 50644 CoreCLR Product Build windows arm release
runtime 1072679 PR 50644 CoreCLR Product Build windows x64 release
runtime 1072679 PR 50644 CoreCLR Product Build windows x86 release
runtime 1072679 PR 50644 Libraries Build windows x86 Release
runtime 1072679 PR 50644 CoreCLR Product Build windows x64 release PGO
runtime 1072679 PR 50644 CoreCLR Product Build windows x86 checked
runtime 1072679 PR 50644 Build windows x64 Release SingleFile
runtime 1072679 PR 50644 Libraries Build windows arm64 Release
runtime 1072670 PR 50612 Mono Product Build windows x64 debug
runtime 1072670 PR 50612 CoreCLR Product Build windows arm checked
runtime 1072670 PR 50612 CoreCLR Product Build windows x86 checked
runtime 1072670 PR 50612 Libraries Build windows allConfigurations x64 Debug
runtime 1072670 PR 50612 CoreCLR Product Build windows x64 release PGO
runtime 1072670 PR 50612 Libraries Build windows x86 Release
runtime 1072670 PR 50612 CoreCLR Product Build windows x86 release
runtime 1072670 PR 50612 Libraries Build windows x86 Debug
runtime 1072670 PR 50612 CoreCLR Product Build windows x64 release
runtime 1072670 PR 50612 CoreCLR Product Build windows arm release
runtime 1072670 PR 50612 Mono crossaot Product Build windows x64 release
runtime 1072670 PR 50612 Libraries Build windows arm64 Release

Build Result Summary

Day Hit Count Week Hit Count Month Hit Count
6 6 6
@jaredpar jaredpar added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Apr 20, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-Infrastructure-coreclr untriaged New issue has not been triaged by the area owner labels Apr 20, 2020
@jashook jashook removed the untriaged New issue has not been triaged by the area owner label May 4, 2020
@ilyas1974
Copy link

Using runfo, in the last 200 builds, I'm seeing 2 instances of this - from April 14 and April17.

C:\Users\ilyas>runfo search-timeline -d runtime -c 200 -pr -v "abandoned due to an infrastructure failure"
https://dev.azure.com/dnceng/public/_build/results?buildId=606419
https://dev.azure.com/dnceng/public/_build/results?buildId=600517

Evaluated 200 builds
Impacted 2 builds
Impacted 2 jobs

Unfortunately, the logs from these builds are gone and there is nothing to investigate. Do you have additional examples or should we consider this issue closed (until we see new instances)?

/cc @ViktorHofer

@ViktorHofer
Copy link
Member

Using runfo, in the last 200 builds, I'm seeing 2 instances of this - from April 14 and April17.

Are you sure about that? Because the top post says: > Most recent failure 5/7/2020 1:12:00 AM

@jaredpar
Copy link
Member Author

@ilyas1974 try taking out the -d runtime portion of the query. This particular failure is happening across a broader range of build definitions.

@ilyas1974
Copy link

Similar results -

C:\Users\ilyas>runfo search-timeline -c 200 -pr -v "abandoned due to an infrastructure failure"

Evaluated 200 builds
Impacted 0 bulids
Impacted 0 jobs

C:\Users\ilyas>runfo search-timeline -c 200 -project internal -v "abandoned due to an infrastructure failure"

Evaluated 200 builds
Impacted 0 bulids
Impacted 0 jobs

Do you have any additional suggestions?

@jaredpar
Copy link
Member Author

Something isn't right here. I'm taking a look.

@jaredpar
Copy link
Member Author

Okay figured this out. There are two sources of confusion here:

  1. Searching back 200 builds for the entire project isn't far enough. This problem doesn't hit often enough to show up in that data. You either need to go back much further or restrict to a particular definition. My recommendation is to use -d 104 (machine learning) as that has the most recent attempts
  2. The machine learning team was pretty aggressively retrying their failed jobs that hit this error. My tooling which auto-triages builds does so immediately upon build completion and hence triaged builds before they could retry. By the time @ilyas1974 ran his query though the build had been retried and the failure was gone from the latest attempt.

To work around the second issue I added a new argument to search-timeline and timeline: -attempt. This lets you specify the build attempt that you want to get timeline information for. The first attempt is always labeled as 1 in AzDO. Once you have an updated runfo (at least 0.5.2) you can do the following search.

P:\devops-util\runfo>runfo search-timeline -c 50 -d 104 -pr  -v "abandoned due to an infrastructure failure" -a 1
https://dev.azure.com/dnceng/public/_build/results?buildId=634214
https://dev.azure.com/dnceng/public/_build/results?buildId=634214
https://dev.azure.com/dnceng/public/_build/results?buildId=634291
https://dev.azure.com/dnceng/public/_build/results?buildId=634291
https://dev.azure.com/dnceng/public/_build/results?buildId=634291
https://dev.azure.com/dnceng/public/_build/results?buildId=634291
https://dev.azure.com/dnceng/public/_build/results?buildId=634291

@ilyas1974
Copy link

Thank you @jaredpar. I have created https://github.com/dotnet/core-eng/issues/9801 for our FR team to take a look and see what we can determine about these failures.

@jaredpar
Copy link
Member Author

Did some more searching and I found pretty much every variant of this. Essentially builds that hit this on first attempt, latest attempt and for some attempt in the middle. That's not surprising on one hand but it does really mean that probably need expanded search capabilities for certain issues.

Have a tweak available locally where I can say -attempt -1 which means "search all attempts". Need to work on the presentation though. Should have some builds out tonight / tomorrow.

@jashook jashook added this to the Future milestone Jul 17, 2020
@ViktorHofer
Copy link
Member

We're seeing less of these failures and received an update from core-eng regarding a new date when the AzDO transition should be complete. See https://github.com/dotnet/core-eng/issues/9448 for more details.

@ViktorHofer
Copy link
Member

ah I'm confused. @jaredpar is this tracking something different than #34472?

@jaredpar
Copy link
Member Author

jaredpar commented Oct 1, 2020

Yes they're different problems.

@ghost ghost moved this from Future to Untriaged in Infrastructure Backlog Feb 18, 2021
@ViktorHofer
Copy link
Member

Closing as not observable anymore.

Infrastructure Backlog automation moved this from Untriaged to Done Feb 23, 2021
@runfoapp runfoapp bot removed this from the Future milestone Apr 5, 2021
@ghost ghost moved this from Done to Untriaged in Infrastructure Backlog Apr 5, 2021
@ViktorHofer ViktorHofer moved this from Untriaged to Done in Infrastructure Backlog Apr 7, 2021
@dotnet dotnet locked as resolved and limited conversation to collaborators May 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Infrastructure-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Projects
None yet
Development

No branches or pull requests

5 participants