-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable AMQP LinkAttachDetachMultipleOneSession test until it has been investigated. #5631
Disable AMQP LinkAttachDetachMultipleOneSession test until it has been investigated. #5631
Conversation
/azp run cpp - core |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run cpp - core |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I read it correctly that the worst passing rate for this test is 99.98%?
If so, shouldn't we actually investigate the test, or, the very least, run it in a loop with several retries, instead of disabling it right away?
I provided the rationale and motivation in the PR description.
Nope, it's 97%, you can look at the screenshot which lists the 19 failures out of 721. It fails at least once a day, if not more.
This is a stop-gap. Yes, we should investigate. That's why we have an existing issue for it to investigate. While that's on-going and being prioritized, I will re-state the answers to your question from the PR description. The test is quite complex and it isn't clear retries is the solution because there could be an underlying product issue (which running in a loop would hide). That's why it requires investigation, and that will take time. Pending that effort...
|
@LarryOsterman please let me know if you are OK with this change going in, as the area owner. |
I would rather that we take the fix for this problem, rather than disabling the test. |
Sounds good, will wait for your test update. If you need me to revive this PR, to get more time, let me know. |
Follow-up: #5536
The goal here is to make a concerted effort to isolate the factors that are impacting the pipeline success rate here and discover if we have other unknown reliability concerns once we pause the flakiness caused by this particular test. I want to improve the noise-to-signal ratio for whenever a pipeline run fails, over the next 14-30 days. The other goal is to improve PR velocity.
This test has been causing our live and CI pipeline runs to fail quite often, causing jobs needing to be restarted and adding delays. It has a ~97% success rate and needs to be investigated further to root cause the issue as either test reliability improvement or SDK correctness. Given how frequently we run our pipelines and the number of OSes/flavors we have, we end up seeing failures almost daily:
https://dev.azure.com/azure-sdk/internal/_test/analytics?definitionId=1615&contextType=build
Otherwise, we have a 99.99% success rate for tests in the pipelines over the last 14 days.
It is hurting our overall live test reliability, which impacts our release processes as well.
https://dev.azure.com/azure-sdk/internal/_pipeline/analytics/stageawareoutcome?definitionId=1615&contextType=build
This is the number one cause of our pipeline flakiness (~45%), with a gtest timeout build error coming in at number 2: