Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(artifacts): when committing artifacts, don't retry 409 Conflict errors #4260

Merged
merged 19 commits into from Nov 4, 2022

Conversation

speezepearson
Copy link
Contributor

@speezepearson speezepearson commented Sep 10, 2022

Fixes WB-10937

Description

Conflicts during artifact-commit can arise for two reasons:

  1. Concurrent runs are attempting to commit new versions of the same artifact; they race to grab the next version-index; one wins, and the other gets rejected because that version-index is now taken. (see https://wandb.atlassian.net/browse/WB-10808 )

    The server is wrong to return a 409 in this case: it should be a 500, because (a) it should be impossible and (b) it's transient. This is fixed in https://github.com/wandb/core/pull/10765 .

  2. Two runs try to commit different versions at the same time; the two versions both added a file with digest abcdef, so each version thinks it “owns” that file. One artifact is committed first; the other, then, can’t be, because its manifest claims that it created this file which actually already exists.

    The server is right to return a 409 in this case, and no number of retries will help. We should not retry this.

Testing

Triggered both of the ^above failure modes:

  1. Code:

    for i in {1..3}; do F="temp.$i.txt"; (date; echo $i) > "$F"; python3 -c 'import wandb, sys; r = wandb.init(project="spencerpearson-artifacts"); a = wandb.Artifact("conflict-demo","dataset"); a.add_file(sys.argv[1], name="temp.txt"); r.log_artifact(a)' "$F" & done; wait

    Output:

    <...snip...>
    wandb: Waiting for W&B process to finish... (success).
    wandb: Waiting for W&B process to finish... (success).
    wandb: Waiting for W&B process to finish... (success).
    wandb: ERROR Error while calling W&B API: An internal error occurred. Please contact support. (<Response [500]>)
    wandb: ERROR Error while calling W&B API: An internal error occurred. Please contact support. (<Response [500]>)
    <...snip...>
    wandb: ERROR Error while calling W&B API: An internal error occurred. Please contact support. (<Response [500]>)
    wandb: Synced fast-feather-161: https://wandb.ai/wandb/spencerpearson-artifacts/runs/216enedk
    <...snip...>
    wandb: Synced unique-microwave-160: https://wandb.ai/wandb/spencerpearson-artifacts/runs/12bv3kjp
    <...snip...>
    wandb: Synced soft-sunset-159: https://wandb.ai/wandb/spencerpearson-artifacts/runs/3rbajign
    <...snip...>
    [1]   Done                    python3 -c 'import wandb, sys; r = wandb.init(project="spencerpearson-artifacts"); a = wandb.Artifact("conflict-demo","dataset"); a.add_file(sys.argv[1], name="temp.txt"); r.log_artifact(a)' "$F"
    [2]-  Done                    python3 -c 'import wandb, sys; r = wandb.init(project="spencerpearson-artifacts"); a = wandb.Artifact("conflict-demo","dataset"); a.add_file(sys.argv[1], name="temp.txt"); r.log_artifact(a)' "$F"
    [3]+  Done                    python3 -c 'import wandb, sys; r = wandb.init(project="spencerpearson-artifacts"); a = wandb.Artifact("conflict-demo","dataset"); a.add_file(sys.argv[1], name="temp.txt"); r.log_artifact(a)' "$F"
    ~/s/client (spencerpearson/no-retry-conflict) $
    

    All the artifacts were created successfully, as expected.

  2. Code:

    F="temp.txt"; date > "$F"; for i in {1..2}; do WANDB_BASE_URL=http://api.wandb.test python3 -c 'import wandb, sys; r = wandb.init(project="p1706"); a = wandb.Artifact("conflict-demo","dataset"); a.add_file(sys.argv[1], name="temp.txt"); r.log_artifact(a)' "$F" & done;

    Output:

    <...snip...>
    wandb: Waiting for W&B process to finish... (success).
    wandb: Waiting for W&B process to finish... (success).
    wandb: ERROR Error while calling W&B API: An internal error occurred. Please contact support. (<Response [500]>)
    <...snip...>
    wandb: ERROR Error while calling W&B API: conflict detected for file digest +IBy2j2t2lxTKyZYkIMNMg==, rebase required (<Response [409]>)
    Exception in thread Thread-12:
    Traceback (most recent call last):
    <...snip...>
    File "/Users/pears/src/client/wandb/filesync/step_upload.py", line 222, in _maybe_commit_artifact
        self._api.commit_artifact(artifact_id)
    <...snip...>
    File "/Users/pears/.pyenv/versions/example/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status
        raise HTTPError(http_error_msg, response=self)
    requests.exceptions.HTTPError: 409 Client Error: Conflict for url: http://api.wandb.test/graphql
    
    <...hanging...>
    ^C
    

    As expected one of the runs uploaded an artifact successfully; the other did not.

@speezepearson speezepearson changed the title on commitArtifact conflict, retry at a higher level when committing artifacts, don't retry 409 Conflict errors Sep 13, 2022
@codecov
Copy link

codecov bot commented Sep 13, 2022

Codecov Report

Merging #4260 (9c1fe42) into main (a3eec65) will increase coverage by 8.97%.
The diff coverage is n/a.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4260      +/-   ##
==========================================
+ Coverage   74.06%   83.03%   +8.97%     
==========================================
  Files         258      258              
  Lines       32863    32862       -1     
==========================================
+ Hits        24339    27287    +2948     
+ Misses       8524     5575    -2949     
Flag Coverage Δ
functest 56.86% <ø> (+<0.01%) ⬆️
unittest 73.13% <ø> (+15.87%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
wandb/sdk/internal/internal_api.py 88.91% <ø> (+8.35%) ⬆️
wandb/sdk/service/server_sock.py 91.87% <0.00%> (-1.02%) ⬇️
wandb/sdk/lib/sock_client.py 93.12% <0.00%> (ø)
wandb/env.py 74.89% <0.00%> (+0.86%) ⬆️
wandb/sdk/internal/tb_watcher.py 88.03% <0.00%> (+0.99%) ⬆️
wandb/sdk/lib/ipython.py 41.89% <0.00%> (+1.35%) ⬆️
wandb/sdk/lib/mailbox.py 93.49% <0.00%> (+1.36%) ⬆️
wandb/sdk/wandb_manager.py 94.07% <0.00%> (+1.48%) ⬆️
wandb/sdk/data_types/helper_types/image_mask.py 90.62% <0.00%> (+1.56%) ⬆️
wandb/sdk/wandb_setup.py 88.44% <0.00%> (+2.01%) ⬆️
... and 88 more

Copy link
Contributor

@kptkin kptkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it make sense to me. The only point is that for local server users the first case you described will return 409 instead of 500 (until they update to the relevant version of the server), and hence will not retry this case?

@speezepearson speezepearson changed the title when committing artifacts, don't retry 409 Conflict errors fix(artifacts): when committing artifacts, don't retry 409 Conflict errors Sep 15, 2022
@speezepearson
Copy link
Contributor Author

Overall, it make sense to me. The only point is that for local server users the first case you described will return 409 instead of 500 (until they update to the relevant version of the server), and hence will not retry this case?

This is a good point! I've spent two months kinda wringing my hands about this.

I think it's worth merging anyway. We only started retrying 409s in late June anyway (#3843) -- I bet the intersection of [people who are attached to this behavior] and [people who haven't upgraded their single-tenant installs and have good reasons to not do so] is small compared to the set of [people who wish we wouldn't hang indefinitely when they have multiple concurrent runs creating new versions of an artifact].

@jlzhao27
Copy link
Contributor

jlzhao27 commented Nov 1, 2022

Ya I agree with Spencer here, the current behavior of hanging user's scripts indefinitely seems a lot worse in my opinion. I propose we merge this.

@speezepearson speezepearson merged commit 7454cc8 into main Nov 4, 2022
@speezepearson speezepearson deleted the spencerpearson/no-retry-conflict branch November 4, 2022 05:51
@kptkin kptkin added this to the sdk-2022-12.1 milestone Nov 4, 2022
andrewtruong pushed a commit that referenced this pull request Dec 2, 2022
…rrors (#4260)

Co-authored-by: Katia <katia@wandb.com>
Co-authored-by: Dmitry Duev <dima@wandb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants