Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans for v3.5.3 release #13894

Closed
28 tasks done
serathius opened this issue Apr 6, 2022 · 49 comments
Closed
28 tasks done

Plans for v3.5.3 release #13894

serathius opened this issue Apr 6, 2022 · 49 comments

Comments

@serathius
Copy link
Member

serathius commented Apr 6, 2022

Same as previous release #13518. let's agree what work should be included in release and track the work. The list presented below will be evolving as community decides to cut/include some fixes.

Issues/PRs that should be included for v3.5.3:

Issues/PRs not required for v3.5.3:

Please feel free to suggest new additional backports/issues worth investigating so we can discuss if they should be included.

@serathius
Copy link
Member Author

cc @ptabor @ahrtr @spzala

@mrueg
Copy link
Contributor

mrueg commented Apr 6, 2022

@serathius not sure if you want to add #13895 as well.

@ahrtr
Copy link
Member

ahrtr commented Apr 6, 2022

I think we need 13838 as well, because it will make the reproduction of the data inconsistency issue easier.

@serathius
Copy link
Member Author

serathius commented Apr 7, 2022

Hey @ahrtr, It would be great to have automatic way to reproduce this issue, however this is not required immediate release. #13838 includes changes that might negatively impact overall stability of functional tests making it hard to merge. I don't plan to make this code mergeable, to avoid investing to much time into functional testing as we might want to change our overall approach to correctness testing after the data inconsistency issue.

The data inconsistency issue should be fairly tested with #13838, but I don't see merging this change as a blocking for v3.5.3.

@riyazcom
Copy link

riyazcom commented Apr 7, 2022

@serathius is it possible to include #13460 "Support multiple values for allowed client and peer TLS identities" ?

@serathius
Copy link
Member Author

@riyazcom Even though not supporting multiple client & peer TLS identities is a limitation of etcd, it is not a bug and introducing it would be basically be an additional feature. Based on that I don't think we should back port.

@jayunit100
Copy link

what is the manual test plan for reproducing this ? is there a k8s centric test we can run ? I was doing a combination of running e2es while rebooting nodes, and didnt see this occur. I think it requires unrestricted API QoS , constrained etcd memory, and a few other tricky knobs.

@serathius
Copy link
Member Author

serathius commented Apr 8, 2022

@jayunit100 please read the original issue #13766 (comment), it includes description of reproduction and link to PR with code #13838.

Let's continue the discussion there.

@spzala
Copy link
Member

spzala commented Apr 8, 2022

@serathius #13692 doesn't seems required in 3.5.3. I am adding triage comments there.

@ahrtr
Copy link
Member

ahrtr commented Apr 8, 2022

13666 is just a question instead of an issue. I already provided answer and closed it. Definitely it isn't required in 3.5.3.

@ahrtr
Copy link
Member

ahrtr commented Apr 8, 2022

Please also remove 13690 from the list per the discussion in the PR.

@ahrtr
Copy link
Member

ahrtr commented Apr 8, 2022

Just closed 13848 with comment 13848#issuecomment-1093449532.

@yuyangbj
Copy link

yuyangbj commented Apr 11, 2022

Hi ETCD maintainers,

As the community has alerted the customers not use the ETCD 3.5.[0-2] in production env due to the issue data inconsistency, but is there any plan to speed up the 3.5.3 release considering some products has already blocked by the issue?

@serathius
Copy link
Member Author

@yuyangbj Thanks for your interest in v3.5.3 release. Apart of just "fixing" the issue, we also need to make sure we do proper diligence to test and qualify the release. We would really not want to make a mistake here and be forced to send another alert. Unfortunately this takes time and with only 3-4 people working on the release it will take us some time.

Data inconsistency issue was non-trivial and it took us longer time to figure out a proper fix. Fortunately we finally managed to merge the fix last Friday, so we can start qualifying the release and decide to skip some minor issues proposed here.

Best thing I can recommend for products blocked on the release is stop waiting passively and get involved into the release proces. We need more people to test, document and qualify the release.

@dims
Copy link
Contributor

dims commented Apr 11, 2022

@serathius Apologies!!

@serathius
Copy link
Member Author

New issue that might impact v3.5.3 release. Please someone take a look #13922

@endocrimes
Copy link
Contributor

#13846 shouldn't be a critical release blocker - but here's a PR #13923

@serathius
Copy link
Member Author

Thanks @endocrimes, agree it's not a critical release blocker, however for now let's assume it will go in, as qualification will take some time.

@endocrimes
Copy link
Contributor

@serathius 👍 - what's necessary for qualification? (not sure if there's a test process that gets followed or if setting up some soak/scale tests out-of-tree would be helpful for example)

@serathius
Copy link
Member Author

serathius commented Apr 11, 2022

That's great question, apparently v3.5 release haven't followed the same qualification as previous releases (#13839 (comment)). With maintainer rotation we lost knowlegde about processes previously executed manually, no mention of qualification testing in https://etcd.io/docs/v3.3/dev-internal/release/.

I would be great to get idea what kind of tests were done before. @xiang90 @jpbetz any recommendation about the testing?

@serathius
Copy link
Member Author

In addition to the qualification mentioned before, would be good to confirm that two data inconsistency issues found were fixed. Instructions for their reproduction:

@dims
Copy link
Contributor

dims commented Apr 11, 2022

@serathius Have we in the past have some method to try a k8s (presubmit?) testing with a branch of etcd? (curious, if so, then we could try that?)

@serathius
Copy link
Member Author

@serathius Have we in the past have some method to try a k8s (presubmit?) testing with a branch of etcd? (curious, if so, then we could try that?)

We run v3.5.0 in K8s presubmits for long time (we still are), however they didn't find any issues we fixed for v3.5.3 so I expect they are not good qualification method.

@endocrimes
Copy link
Contributor

@serathius I'm also bumping the jepsen etcd tests for 3.5.x too. (Starting with running them against 3.4.x and 3.5.2 as a baseline).

I don't think they're high scale enough tests to catch the corruption issues we hit here, but potentially a nice thing to get running semi continuously somewhere.

@yuyangbj
Copy link

@yuyangbj Thanks for your interest in v3.5.3 release. Apart of just "fixing" the issue, we also need to make sure we do proper diligence to test and qualify the release. We would really not want to make a mistake here and be forced to send another alert. Unfortunately this takes time and with only 3-4 people working on the release it will take us some time.

Data inconsistency issue was non-trivial and it took us longer time to figure out a proper fix. Fortunately we finally managed to merge the fix last Friday, so we can start qualifying the release and decide to skip some minor issues proposed here.

Best thing I can recommend for products blocked on the release is stop waiting passively and get involved into the release proces. We need more people to test, document and qualify the release.

Thanks Marek @serathius for the prompt response. And I understand everything needs a complete tests. I do not want to push the release, so I am sorry if it makes a misunderstanding like it. And our downstream team has actively involved to reproduce the data inconsistency issue, will update to here if we have test results.

@ahrtr
Copy link
Member

ahrtr commented Apr 12, 2022

New issue that might impact v3.5.3 release. Please someone take a look #13922

Just resolved in pull/13930, it should be a legacy issue (or an old regression) instead of a regression.

@serathius
Copy link
Member Author

With last data inconsistency issue fixed we can focus on qualification and cutting the scope for the release. Moving #13862 and #13846 as they are not critical. Authors can continue to work on them, however we should prioritize the release.

@serathius
Copy link
Member Author

serathius commented Apr 12, 2022

Current plan for qualification:

We would be happy to accept any other ideas to make sure the release is stable and no new data corruption issues are found.

@serathius
Copy link
Member Author

@shalinmangar in #13839 (comment) you mentioned that you have experience with jepsen. It would be great if you can help @endocrimes as we cannot reproduce the data inconsistencies from v3.5.2 there.

@serathius
Copy link
Member Author

Qualification work is almost finished, I would like to prepare to start the release process.

First let's collect acts from people involved: @ptabor @spzala @ahrtr @endocrimes
Please let me know if there is anything else that should be done before the release.

@endocrimes
Copy link
Contributor

endocrimes commented Apr 12, 2022

So current state of jepsen testing:

  • I have the existing jepsen test suite jepsen-io/etcd bumped to the latest parts of various etcd components
  • I've run the full suite against etcd 3.5.2 in various scale scenarios trying to trigger a corruption case (and failed to do so on my 32t/128GB ram workstation without running out of memory first)
  • I'm now setting up an env to run the 3.5.3 release branch through regression scenarios, should be done in an hour or so.

Future plan:

  • I've started working on cases that would potentially be able to catch future cases, but probably won't be done before 3.5.3 is ready. Going to flesh this out so we can have regular testing of main somewhere.
  • Going to think more deeply about how we can do scale/soak testing for future releases (aside from just jepsen). [need to file an issue]

Update: 18:14

Watch and Lock tests failed on my first run - investigating whether it was a jepsen issue or etcd issue.

@ptabor
Copy link
Contributor

ptabor commented Apr 12, 2022

That list looks complete to me. Thank you @serathius and @ahrtr.

I think that these items are below the bar (but calling them explicitly):

  • submit & backport the data-inconsistency repro test
  • more defence in depth (e.g. verification that consistency index does not decrease)

@serathius: Out of curiosity: Have you been running the repro test for several hours with patched 3.5.2 or just the O(10)min test ?

@endocrimes
Copy link
Contributor

Jepsen Watch test failures seem to be a jepsen issue, but I haven't run into those issues in any of the previous version runs, so I'm... curious.

Moving on to lock tests for now.

@endocrimes
Copy link
Contributor

Locks (leases) seem to be consistently flaky - The lease ends up closing early, and thus the test fails. Gonna investigate further after dinner.

@endocrimes
Copy link
Contributor

endocrimes commented Apr 12, 2022

Gonna look into this with fresh eyes tomorrow - getting a headache - but it feels like potentially a clojure/client issue

@dims
Copy link
Contributor

dims commented Apr 12, 2022

Sooo close!!

@ahrtr
Copy link
Member

ahrtr commented Apr 12, 2022

I think it's worth to mention @serathius 's PR pull/13927. Ran the load test against mounted RAM filesystem might be an easier way to reproduce the data inconsistent issue. I will give it a try.

Locks (leases) seem to be consistently flaky - The lease ends up closing early, and thus the test fails. Gonna investigate further after dinner.

Please note that I recently fixed a lease related issue, see pull/13932, which isn't included in 3.5.2. Do you still see flaky on release-3.5 (3.5.3)? If yes, please file an issue, and let's discuss it there.

submit & backport the data-inconsistency repro test

@serathius already submitted a PR for this, and I will take a closer review today & tomorrow.

more defence in depth (e.g. verification that consistency index does not decrease)

Yes, it's a good point. The consistent_index should never decrease based on current design/implementation. But the term might be. I am not totally convinced that the PR pull/13903 fixes the issue issues/13514, because raft should reject a message with old term.

@spzala
Copy link
Member

spzala commented Apr 13, 2022

Qualification work is almost finished, I would like to prepare to start the release process.

First let's collect acts from people involved: @ptabor @spzala @ahrtr @endocrimes Please let me know if there is anything else that should be done before the release.

@serathius awesome, thanks so much! It looks good to me. If we are running more tests, that's great, but otherwise, from the positive feedback on the data inconsistency fix and the tests you and @ahrtr ran, +1 for release work. We can iterate over the release with any new findings, post-release feedback and test improvements under discussion.

@serathius
Copy link
Member Author

@serathius: Out of curiosity: Have you been running the repro test for several hours with patched 3.5.2 or just the O(10)min test ?

I run:

@serathius
Copy link
Member Author

Yes, it's a good point. The consistent_index should never decrease based on current design/implementation. But the term might be. I am not totally convinced that the PR pull/13903 fixes the issue issues/13514, because raft should reject a message with old term.

You are right that raft should never allow a message with an old term, however #13514 covers not officially supported downgrade flow, that went around downgrade check, and showed us that term in db cannot be trusted in v3.5. With #13903 goal was not to fix downgrades, but to remove incorrect onlyGrow check that turned out to be incorrectly introduced during rafactor.

@ahrtr
Copy link
Member

ahrtr commented Apr 13, 2022

We might need to get this issue 13937 sorted out before releasing 3.5.3.

@serathius
Copy link
Member Author

@ahrtr Is this a new problem in v3.5? Based on the original issue it seems to be present in v3.3. For v3.5.3 our goal should be addressing new issues that were introduced in v3.5 to ensure there are not regressions when upgrading from v3.4 to v3.5.

Fact that there is a long standing issue is worrying, however it should not block v3.5.3 release.

@shalinmangar
Copy link

@shalinmangar in #13839 (comment) you mentioned that you have experience with jepsen. It would be great if you can help @endocrimes as we cannot reproduce the data inconsistencies from v3.5.2 there.

Happy to help. I see @endocrimes has already made great progress. Is there an issue for Jepsen testing? or should we communicate over the Etcd-dev mailing list?

@endocrimes
Copy link
Contributor

@shalinmangar #13939

@serathius
Copy link
Member Author

Talked with @endocrimes, lease flakes should not impact v3.5.3 release.
With that I will start the release process.

@dims
Copy link
Contributor

dims commented Apr 13, 2022

w00t! sounds good @serathius

@serathius
Copy link
Member Author

ok, after 4 hours, with breaks for restoring sanity and eating, I managed to successfully run release script.

@ahrtr
Copy link
Member

ahrtr commented Apr 13, 2022

@ahrtr Is this a new problem in v3.5? Based on the original issue it seems to be present in v3.3. For v3.5.3 our goal should be addressing new issues that were introduced in v3.5 to ensure there are not regressions when upgrading from v3.4 to v3.5.

Fact that there is a long standing issue is worrying, however it should not block v3.5.3 release.

My immediate feeling is it should be legacy issue instead of a regression. Will take a closer look today. I don't think it's a blocker for 3.5.3 either.

@ahrtr
Copy link
Member

ahrtr commented Apr 14, 2022

@ahrtr Is this a new problem in v3.5? Based on the original issue it seems to be present in v3.3. For v3.5.3 our goal should be addressing new issues that were introduced in v3.5 to ensure there are not regressions when upgrading from v3.4 to v3.5.

Fact that there is a long standing issue is worrying, however it should not block v3.5.3 release.

This turns out to be a regression. But the good news is it should not be a big problem. Please see the explanation in pull/13942

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests