Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(artifacts): when artifact-commit 409s, retry entire artifact-creation, not just commit #4272

Conversation

speezepearson
Copy link
Contributor

@speezepearson speezepearson commented Sep 15, 2022

Fixes WB-10946

Description

After #4260 merges, when the SDK tries to commit an artifact and gets back a "409 Conflict" status code, it will fail. That's better than what we do now, which is retry the commitArtifact request indefinitely even if there's no hope of success (see WB-10888); but it'd be even better to retry the operation at a higher level to resolve the conflict.

The right way to do this would be to have the SDK somehow figure out which files are responsible for the conflict (probably by asking the server); but, as a hopefully-temporary fix, we can just restart the entire "create-upload-commit" process from the beginning.

Testing

Manually tested with:

$ F="temp.txt"; date > "$F"; for i in {1..2}; do python3 -c 'import wandb, sys; r = wandb.init(project="p1706"); a = wandb.Artifact("conflict-demo","dataset"); a.add_file(sys.argv[1], name="temp.txt"); r.log_artifact(a)' "$F" & done; wait
[1] 93566
[2] 93567
wandb: Currently logged in as: spencerpearson-wandb (wandb). Use `wandb login --relogin` to force relogin
wandb: Currently logged in as: spencerpearson-wandb (wandb). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.13.4.dev1
wandb: Run data is saved locally in /Users/pears/src/client/wandb/run-20220915_141957-1rtg6n3e
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run eager-shape-19
wandb: ⭐️ View project at https://wandb.ai/wandb/p1706
wandb: 🚀 View run at https://wandb.ai/wandb/p1706/runs/1rtg6n3e
wandb: Tracking run with wandb version 0.13.4.dev1
wandb: Run data is saved locally in /Users/pears/src/client/wandb/run-20220915_141957-3aab449z
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run volcanic-sponge-20
wandb: ⭐️ View project at https://wandb.ai/wandb/p1706
wandb: 🚀 View run at https://wandb.ai/wandb/p1706/runs/3aab449z
wandb: Waiting for W&B process to finish... (success).
wandb: Waiting for W&B process to finish... (success).
wandb: ERROR Error while calling W&B API: An internal error occurred. Please contact support. (<Response [500]>)
wandb: ERROR Error while calling W&B API: conflict detected for file digest X466QBdbS1uW4VbfmamGzA==, rebase required (<Response [409]>)
wandb:
wandb: Synced volcanic-sponge-20: https://wandb.ai/wandb/p1706/runs/3aab449z
wandb: Synced 4 W&B file(s), 0 media file(s), 1 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20220915_141957-3aab449z/logs
wandb:
wandb: Synced eager-shape-19: https://wandb.ai/wandb/p1706/runs/1rtg6n3e
wandb: Synced 4 W&B file(s), 0 media file(s), 1 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20220915_141957-1rtg6n3e/logs
[1]-  Done                    python3 -c 'import wandb, sys; r = wandb.init(project="p1706"); a = wandb.Artifact("conflict-demo","dataset"); a.add_file(sys.argv[1], name="temp.txt"); r.log_artifact(a)' "$F"
[2]+  Done                    python3 -c 'import wandb, sys; r = wandb.init(project="p1706"); a = wandb.Artifact("conflict-demo","dataset"); a.add_file(sys.argv[1], name="temp.txt"); r.log_artifact(a)' "$F"

Both those runs (1, 2) successfully uploaded their artifacts. (Before, one would win, and the other would just fail forever.)

@codecov
Copy link

codecov bot commented Sep 15, 2022

Codecov Report

Merging #4272 (1c719e0) into spencerpearson/no-retry-conflict (316a38a) will decrease coverage by 0.21%.
The diff coverage is 73.58%.

Additional details and impacted files

Impacted file tree graph

@@                         Coverage Diff                          @@
##           spencerpearson/no-retry-conflict    #4272      +/-   ##
====================================================================
- Coverage                             82.75%   82.54%   -0.22%     
====================================================================
  Files                                   256      244      -12     
  Lines                                 32630    31585    -1045     
====================================================================
- Hits                                  27004    26072     -932     
+ Misses                                 5626     5513     -113     
Flag Coverage Δ
functest 55.77% <73.58%> (-0.20%) ⬇️
unittest 73.11% <73.58%> (-0.40%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
wandb/sdk/internal/sender.py 90.97% <ø> (ø)
wandb/sdk/internal/artifacts.py 79.84% <72.09%> (-7.66%) ⬇️
wandb/filesync/step_upload.py 93.60% <80.00%> (-1.45%) ⬇️
wandb/sdk/internal/internal_api.py 86.58% <0.00%> (-0.39%) ⬇️
wandb/sdk/wandb_run.py 90.89% <0.00%> (-0.25%) ⬇️
wandb/cli/cli.py 68.80% <0.00%> (-0.10%) ⬇️
wandb/catboost/__init__.py
wandb/lightgbm/__init__.py
wandb/sacred/__init__.py
... and 11 more

@@ -30,7 +44,13 @@ def __call__(
pass


def _manifest_json_from_proto(manifest: "wandb_internal_pb2.ArtifactManifest") -> Dict:
class ArtifactCommitError(Exception):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is your sense of having this error in the same location as other errors (i.e. errors/init.py). Honestly, we are not consistent about it, so it is fine either way...

Copy link
Contributor

@kptkin kptkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it makes sense to me, based on the description. The only thing is that it is a bit of a different pattern with having the retry part of the save logic also the retry function logic is a bit forced that it returns both bool and timedelta

@speezepearson speezepearson marked this pull request as ready for review September 19, 2022 22:25
@speezepearson speezepearson requested a review from a team as a code owner September 22, 2022 06:02
@speezepearson
Copy link
Contributor Author

This is not a satisfactory solution: see https://www.notion.so/wandbai/Artifact-Rebasing-6dd7364045d04141a1a1eae5a926798d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants