Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(sdk): harden internal thread management in SystemMetrics #4439

Merged
merged 12 commits into from Dec 5, 2022
Merged

Conversation

dmitryduev
Copy link
Member

@dmitryduev dmitryduev commented Nov 4, 2022

Fixes WB-NNNN
Fixes #NNNN

Description

  • Saw a few cannot join thread before it is started errors in Sentry after we released 0.13.5 being triggered by an asset.finish() statement in a Colab environment. Attempted many things, but could not repro. To mitigate, added guardrails around thread joining to SystemMetrics-related code.
  • Saw another error like the following:
Cmd('git') failed due to: exit code(128)
  cmdline: git rev-parse --show-toplevel
  stderr: 'fatal: unsafe repository ('/home/<user>' is owned by someone else)

in git root probing and added some guardrails there as well.

Testing

Tried to break things with Colab, but no luck, unfortunately :(

Checklist

  • Include reference to internal ticket "Fixes WB-NNNN" and/or GitHub issue "Fixes #NNNN" (if applicable)
  • Ensure PR title compliance with the conventional commits standards

@codecov
Copy link

codecov bot commented Nov 4, 2022

Codecov Report

Merging #4439 (a0606c2) into main (202de65) will increase coverage by 0.01%.
The diff coverage is 46.66%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4439      +/-   ##
==========================================
+ Coverage   83.07%   83.09%   +0.01%     
==========================================
  Files         267      267              
  Lines       33739    33750      +11     
==========================================
+ Hits        28030    28044      +14     
+ Misses       5709     5706       -3     
Flag Coverage Δ
functest 55.83% <46.66%> (+0.03%) ⬆️
unittest 73.43% <46.66%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
wandb/sdk/wandb_settings.py 94.46% <ø> (ø)
wandb/sdk/lib/git.py 77.71% <40.00%> (-1.31%) ⬇️
wandb/sdk/internal/system/assets/interfaces.py 91.95% <50.00%> (-3.23%) ⬇️
wandb/sdk/internal/system/system_monitor.py 90.72% <50.00%> (-1.84%) ⬇️
wandb/integration/tensorboard/monkeypatch.py 91.35% <0.00%> (-2.47%) ⬇️
wandb/sdk/wandb_run.py 90.89% <0.00%> (+0.05%) ⬆️
wandb/sdk/internal/internal_api.py 89.79% <0.00%> (+0.51%) ⬆️
wandb/sdk/lib/printer.py 89.24% <0.00%> (+0.63%) ⬆️
wandb/sdk/lib/sock_client.py 94.17% <0.00%> (+1.05%) ⬆️
... and 1 more

@dmitryduev dmitryduev enabled auto-merge (squash) November 8, 2022 07:16
@dmitryduev dmitryduev requested a review from a team November 8, 2022 07:16
@dmitryduev dmitryduev added this to the sdk-2022-11.2 milestone Nov 10, 2022
Copy link
Member

@raubitsj raubitsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve that it is likely fine, but some small things to consider.

try:
return self.repo.git.rev_parse("--show-toplevel")
except exc.GitCommandError as e:
logger.error(f"git root error: {e}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocker
we should probably capture some telemetry for this if it is easy..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... what would we do with that information? it usually happens when the home dir of the user running the script is owned by someone else.
it's easy to do, just kind of clumsy: I'd set a flag on the GitRepo object if that happened, then would check for that in Settings as it's one of the places where it's used, and then pick it up in init. Should I still do it?

if self._process is not None:
if self._process is None:
return None
if self._process.is_alive():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not so sure about this is_alive - join pairing.
I think the thing that saves you is only if shutdown_event is alway set in this case? is it?

otherwise your monitor loop might still be called on a thread that was just about to be started even thbough you just nulled out the process attribute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the is_alive check, it's useless and won't fix anything. Replaced with a try/except: if the thread has not been started for some reason (in colab, we would stop the system metrics process with a pause request, so I could imagine something going wrong just before we do that while thing were still being set up), attempting to join it won't fail, but we'd still null the attribute.

@dmitryduev dmitryduev modified the milestones: sdk-2022-11.2, sdk-2022-12.1 Dec 3, 2022
@dmitryduev dmitryduev requested review from raubitsj and a team December 5, 2022 07:53
@github-actions github-actions bot added cc-fix and removed cc-fix labels Dec 5, 2022
@dmitryduev dmitryduev merged commit 2c59875 into main Dec 5, 2022
@dmitryduev dmitryduev deleted the sm-join branch December 5, 2022 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants