Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable AMQP LinkAttachDetachMultipleOneSession test until it has been investigated. #5631

Conversation

ahsonkhan
Copy link
Member

@ahsonkhan ahsonkhan commented May 16, 2024

Follow-up: #5536

The goal here is to make a concerted effort to isolate the factors that are impacting the pipeline success rate here and discover if we have other unknown reliability concerns once we pause the flakiness caused by this particular test. I want to improve the noise-to-signal ratio for whenever a pipeline run fails, over the next 14-30 days. The other goal is to improve PR velocity.
 
This test has been causing our live and CI pipeline runs to fail quite often, causing jobs needing to be restarted and adding delays. It has a ~97% success rate and needs to be investigated further to root cause the issue as either test reliability improvement or SDK correctness. Given how frequently we run our pipelines and the number of OSes/flavors we have, we end up seeing failures almost daily:
https://dev.azure.com/azure-sdk/internal/_test/analytics?definitionId=1615&contextType=build

Otherwise, we have a 99.99% success rate for tests in the pipelines over the last 14 days.

1 test causing 19 failed test results

image

It is hurting our overall live test reliability, which impacts our release processes as well.

https://dev.azure.com/azure-sdk/internal/_pipeline/analytics/stageawareoutcome?definitionId=1615&contextType=build

image

This is the number one cause of our pipeline flakiness (~45%), with a gtest timeout build error coming in at number 2:
image

@ahsonkhan ahsonkhan added test-reliability Issue that causes tests to be unreliable AMQP Issues related to the AMQP protocol Support in Azure Core labels May 16, 2024
@ahsonkhan ahsonkhan self-assigned this May 16, 2024
@ahsonkhan
Copy link
Member Author

/azp run cpp - core

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ahsonkhan
Copy link
Member Author

/azp run cpp - core

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@antkmsft antkmsft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I read it correctly that the worst passing rate for this test is 99.98%?
If so, shouldn't we actually investigate the test, or, the very least, run it in a loop with several retries, instead of disabling it right away?

@ahsonkhan
Copy link
Member Author

ahsonkhan commented May 17, 2024

I provided the rationale and motivation in the PR description.

Do I read it correctly that the worst passing rate for this test is 99.98%?

Nope, it's 97%, you can look at the screenshot which lists the 19 failures out of 721. It fails at least once a day, if not more.

It has a ~97% success rate and needs to be investigated further to root cause the issue as either test reliability improvement or SDK correctness.

If so, shouldn't we actually investigate the test, or, the very least, run it in a loop with several retries, instead of disabling it right away?

This is a stop-gap. Yes, we should investigate. That's why we have an existing issue for it to investigate. While that's on-going and being prioritized, I will re-state the answers to your question from the PR description. The test is quite complex and it isn't clear retries is the solution because there could be an underlying product issue (which running in a loop would hide). That's why it requires investigation, and that will take time. Pending that effort...

The goal here is to make a concerted effort to isolate the factors that are impacting the pipeline success rate here and discover if we have other unknown reliability concerns once we pause the flakiness caused by this particular test. I want to improve the noise-to-signal ratio for whenever a pipeline run fails, over the next 14-30 days. The other goal is to improve PR velocity.

@ahsonkhan
Copy link
Member Author

@LarryOsterman please let me know if you are OK with this change going in, as the area owner.

@LarryOsterman
Copy link
Member

@LarryOsterman please let me know if you are OK with this change going in, as the area owner.

I would rather that we take the fix for this problem, rather than disabling the test.

@ahsonkhan
Copy link
Member Author

Sounds good, will wait for your test update. If you need me to revive this PR, to get more time, let me know.

@ahsonkhan ahsonkhan closed this May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AMQP Issues related to the AMQP protocol Support in Azure Core test-reliability Issue that causes tests to be unreliable
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants