Disable AMQP LinkAttachDetachMultipleOneSession test until it has been investigated. #5631

ahsonkhan · 2024-05-16T23:15:01Z

Follow-up: #5536

The goal here is to make a concerted effort to isolate the factors that are impacting the pipeline success rate here and discover if we have other unknown reliability concerns once we pause the flakiness caused by this particular test. I want to improve the noise-to-signal ratio for whenever a pipeline run fails, over the next 14-30 days. The other goal is to improve PR velocity.

This test has been causing our live and CI pipeline runs to fail quite often, causing jobs needing to be restarted and adding delays. It has a ~97% success rate and needs to be investigated further to root cause the issue as either test reliability improvement or SDK correctness. Given how frequently we run our pipelines and the number of OSes/flavors we have, we end up seeing failures almost daily:
https://dev.azure.com/azure-sdk/internal/_test/analytics?definitionId=1615&contextType=build

Otherwise, we have a 99.99% success rate for tests in the pipelines over the last 14 days.

1 test causing 19 failed test results

It is hurting our overall live test reliability, which impacts our release processes as well.

https://dev.azure.com/azure-sdk/internal/_pipeline/analytics/stageawareoutcome?definitionId=1615&contextType=build

This is the number one cause of our pipeline flakiness (~45%), with a gtest timeout build error coming in at number 2:

…n investigated.

ahsonkhan · 2024-05-16T23:23:40Z

/azp run cpp - core

azure-pipelines · 2024-05-16T23:24:03Z

Azure Pipelines successfully started running 1 pipeline(s).

ahsonkhan · 2024-05-16T23:40:24Z

/azp run cpp - core

azure-pipelines · 2024-05-16T23:40:46Z

Azure Pipelines successfully started running 1 pipeline(s).

antkmsft

Do I read it correctly that the worst passing rate for this test is 99.98%?
If so, shouldn't we actually investigate the test, or, the very least, run it in a loop with several retries, instead of disabling it right away?

ahsonkhan · 2024-05-17T18:03:34Z

I provided the rationale and motivation in the PR description.

Do I read it correctly that the worst passing rate for this test is 99.98%?

Nope, it's 97%, you can look at the screenshot which lists the 19 failures out of 721. It fails at least once a day, if not more.

It has a ~97% success rate and needs to be investigated further to root cause the issue as either test reliability improvement or SDK correctness.

If so, shouldn't we actually investigate the test, or, the very least, run it in a loop with several retries, instead of disabling it right away?

This is a stop-gap. Yes, we should investigate. That's why we have an existing issue for it to investigate. While that's on-going and being prioritized, I will re-state the answers to your question from the PR description. The test is quite complex and it isn't clear retries is the solution because there could be an underlying product issue (which running in a loop would hide). That's why it requires investigation, and that will take time. Pending that effort...

The goal here is to make a concerted effort to isolate the factors that are impacting the pipeline success rate here and discover if we have other unknown reliability concerns once we pause the flakiness caused by this particular test. I want to improve the noise-to-signal ratio for whenever a pipeline run fails, over the next 14-30 days. The other goal is to improve PR velocity.

ahsonkhan · 2024-05-20T19:10:43Z

@LarryOsterman please let me know if you are OK with this change going in, as the area owner.

LarryOsterman · 2024-05-20T19:16:04Z

@LarryOsterman please let me know if you are OK with this change going in, as the area owner.

I would rather that we take the fix for this problem, rather than disabling the test.

ahsonkhan · 2024-05-20T20:26:44Z

Sounds good, will wait for your test update. If you need me to revive this PR, to get more time, let me know.

Disable AMQP LinkAttachDetachMultipleOneSession test until it has bee…

89c8096

…n investigated.

ahsonkhan added test-reliability Issue that causes tests to be unreliable AMQP Issues related to the AMQP protocol Support in Azure Core labels May 16, 2024

ahsonkhan self-assigned this May 16, 2024

ahsonkhan requested review from RickWinter, antkmsft, gearama and LarryOsterman as code owners May 16, 2024 23:15

Line Coverage is now at 87.8049%.

deb07e7

antkmsft reviewed May 17, 2024

View reviewed changes

ahsonkhan mentioned this pull request May 20, 2024

Remove redundant calls to gtest_discover_tests with default args in AMQP tests #5644

Merged

ahsonkhan closed this May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable AMQP LinkAttachDetachMultipleOneSession test until it has been investigated. #5631

Disable AMQP LinkAttachDetachMultipleOneSession test until it has been investigated. #5631

ahsonkhan commented May 16, 2024 •

edited

ahsonkhan commented May 16, 2024

azure-pipelines bot commented May 16, 2024

ahsonkhan commented May 16, 2024

azure-pipelines bot commented May 16, 2024

antkmsft left a comment

ahsonkhan commented May 17, 2024 •

edited

ahsonkhan commented May 20, 2024

LarryOsterman commented May 20, 2024

ahsonkhan commented May 20, 2024

Disable AMQP LinkAttachDetachMultipleOneSession test until it has been investigated. #5631

Disable AMQP LinkAttachDetachMultipleOneSession test until it has been investigated. #5631

Conversation

ahsonkhan commented May 16, 2024 • edited

ahsonkhan commented May 16, 2024

azure-pipelines bot commented May 16, 2024

ahsonkhan commented May 16, 2024

azure-pipelines bot commented May 16, 2024

antkmsft left a comment

Choose a reason for hiding this comment

ahsonkhan commented May 17, 2024 • edited

ahsonkhan commented May 20, 2024

LarryOsterman commented May 20, 2024

ahsonkhan commented May 20, 2024

ahsonkhan commented May 16, 2024 •

edited

ahsonkhan commented May 17, 2024 •

edited