New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(artifacts): when committing artifacts, don't retry 409 Conflict errors #4260
Conversation
d8c7673
to
a1c0196
Compare
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #4260 +/- ##
==========================================
+ Coverage 74.06% 83.03% +8.97%
==========================================
Files 258 258
Lines 32863 32862 -1
==========================================
+ Hits 24339 27287 +2948
+ Misses 8524 5575 -2949
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it make sense to me. The only point is that for local server users the first case you described will return 409 instead of 500 (until they update to the relevant version of the server), and hence will not retry this case?
…no-retry-conflict
This is a good point! I've spent two months kinda wringing my hands about this. I think it's worth merging anyway. We only started retrying 409s in late June anyway (#3843) -- I bet the intersection of [people who are attached to this behavior] and [people who haven't upgraded their single-tenant installs and have good reasons to not do so] is small compared to the set of [people who wish we wouldn't hang indefinitely when they have multiple concurrent runs creating new versions of an artifact]. |
Ya I agree with Spencer here, the current behavior of hanging user's scripts indefinitely seems a lot worse in my opinion. I propose we merge this. |
…on/no-retry-conflict
…rrors (#4260) Co-authored-by: Katia <katia@wandb.com> Co-authored-by: Dmitry Duev <dima@wandb.com>
Fixes WB-10937
Description
Conflicts during artifact-commit can arise for two reasons:
Concurrent runs are attempting to commit new versions of the same artifact; they race to grab the next version-index; one wins, and the other gets rejected because that version-index is now taken. (see https://wandb.atlassian.net/browse/WB-10808 )
The server is wrong to return a 409 in this case: it should be a 500, because (a) it should be impossible and (b) it's transient. This is fixed in https://github.com/wandb/core/pull/10765 .
Two runs try to commit different versions at the same time; the two versions both added a file with digest
abcdef
, so each version thinks it “owns” that file. One artifact is committed first; the other, then, can’t be, because its manifest claims that it created this file which actually already exists.The server is right to return a 409 in this case, and no number of retries will help. We should not retry this.
Testing
Triggered both of the ^above failure modes:
Code:
Output:
All the artifacts were created successfully, as expected.
Code:
Output:
As expected one of the runs uploaded an artifact successfully; the other did not.