Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport stable/8.0] Only check that the topology reports all replicas #10084

Merged
merged 9 commits into from Aug 17, 2022

Conversation

npepinpe
Copy link
Member

@npepinpe npepinpe commented Aug 16, 2022

Description

This PR backports #10082 to stable/8.0. At the same time, I backported two other flaky test fixes, since one included the TopologyAssert helpers used here (and possibly in further flaky test fixes).

There were no conflicts by the way.

Related issues

backports #10082, #9596, #9502

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

npepinpe and others added 9 commits August 16, 2022 16:32
Backports the expanded TopologyAssert from the main branch to
stable/8.0, as well as its tests.
Instead of using the helper `isComplete`, manually check that the
topology contains all replicas, omitting the check for the partition
health. While this would be safer, there is a race condition where the
partitions may be unhealthy for a short time on start up. For this test,
we only really want to check that the nodes find each other, not that
processing works, as it should be largely unaffected.
The IPv6 test is sometimes flaky because it may take a while for a full
cluster to start: 3 brokers and one gateway. To test IPv6, however, we
don't really need so many nodes, and having 1 broker and 1 gateway is
enough: all ports are tested properly (except the management port, which
is tested already by Spring).

Additionally, this fixes that if the test fails once, the network may
not be removed properly afterwards, which causes further iterations to
fail. We ensure the network is closed here, and also we print out a
proper error if we fail to start the network on the next run.

Finally I took the time to clean up the test using new APIs provided by
zeebe-test-container.
Previously, the rule only waited for a leader of partition 1 before
returning from `startBroker`. With this change, the leader must also
report as healthy.

This is intended to fix some flaky tests where after the startup,
commands send to the leader of partition 1 failed because no command-api
handler was registered. Since registration of handlers is required for
a healthy status, this should hopefully prevent these kinds of flaky
tests.
The `ClusterRule` uses the topology to figure out if the cluster is
ready before a test starts. There was a flaky condition where it might
start when all partitions are present, but before they are healthy. This
change was pulled from #9502 - however in that PR, it was only done in
the broker module's `EmbeddedBrokerRule`.

At the same time, the rule is updated such that everywhere it was
waiting for the topology, now also uses `TopologyAssert`.
@npepinpe
Copy link
Member Author

@Zelldon - it's a lot of changes, but as I said, it's all backports with no conflicts. The only additional commit was the last one, which I've also forward ported.

zeebe-bors-camunda bot added a commit that referenced this pull request Aug 17, 2022
10086: Avoid race condition when polling for topology r=oleschoenburg a=npepinpe

## Description

This PR fixes a potential race condition when starting the `EmbeddedBrokerRule`. We poll for the topology and wait until it's complete. Unfortunately, if we're too fast and the request is rejected, the `.join()` will throw, which will be rethrown by `Awaitility` (as when using `untilAsserted` it only catches `AssertionError`), failing the test. By ignoring exceptions here we can focus on just failing when the topology is never complete (Awaitility will print out the last failure in that case).

## Related issues

related to #10084 



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
Copy link
Member

@Zelldon Zelldon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks!

@npepinpe
Copy link
Member Author

bors merge

@zeebe-bors-camunda
Copy link
Contributor

@zeebe-bors-camunda zeebe-bors-camunda bot merged commit 9531912 into stable/8.0 Aug 17, 2022
@zeebe-bors-camunda zeebe-bors-camunda bot deleted the backport-10082-to-stable/8.0 branch August 17, 2022 14:36
@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version:8.1.0-alpha5 Marks an issue as being completely or in parts released in 8.1.0-alpha5 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants