New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(artifacts): when artifact-commit 409s, retry entire artifact-creation, not just commit #4272
fix(artifacts): when artifact-commit 409s, retry entire artifact-creation, not just commit #4272
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## spencerpearson/no-retry-conflict #4272 +/- ##
====================================================================
- Coverage 82.75% 82.54% -0.22%
====================================================================
Files 256 244 -12
Lines 32630 31585 -1045
====================================================================
- Hits 27004 26072 -932
+ Misses 5626 5513 -113
Flags with carried forward coverage won't be shown. Click here to find out more.
|
…etry-conflict-higher-level
@@ -30,7 +44,13 @@ def __call__( | |||
pass | |||
|
|||
|
|||
def _manifest_json_from_proto(manifest: "wandb_internal_pb2.ArtifactManifest") -> Dict: | |||
class ArtifactCommitError(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is your sense of having this error in the same location as other errors (i.e. errors/init.py). Honestly, we are not consistent about it, so it is fine either way...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it makes sense to me, based on the description. The only thing is that it is a bit of a different pattern with having the retry part of the save logic also the retry function logic is a bit forced that it returns both bool and timedelta
…etry-conflict-higher-level
…-conflict-higher-level
This is not a satisfactory solution: see https://www.notion.so/wandbai/Artifact-Rebasing-6dd7364045d04141a1a1eae5a926798d |
Fixes WB-10946
Description
After #4260 merges, when the SDK tries to commit an artifact and gets back a "409 Conflict" status code, it will fail. That's better than what we do now, which is retry the
commitArtifact
request indefinitely even if there's no hope of success (see WB-10888); but it'd be even better to retry the operation at a higher level to resolve the conflict.The right way to do this would be to have the SDK somehow figure out which files are responsible for the conflict (probably by asking the server); but, as a hopefully-temporary fix, we can just restart the entire "create-upload-commit" process from the beginning.
Testing
Manually tested with:
Both those runs (1, 2) successfully uploaded their artifacts. (Before, one would win, and the other would just fail forever.)