New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(sdk): harden internal thread management in SystemMetrics #4439
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #4439 +/- ##
==========================================
+ Coverage 83.07% 83.09% +0.01%
==========================================
Files 267 267
Lines 33739 33750 +11
==========================================
+ Hits 28030 28044 +14
+ Misses 5709 5706 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approve that it is likely fine, but some small things to consider.
try: | ||
return self.repo.git.rev_parse("--show-toplevel") | ||
except exc.GitCommandError as e: | ||
logger.error(f"git root error: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a blocker
we should probably capture some telemetry for this if it is easy..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... what would we do with that information? it usually happens when the home dir of the user running the script is owned by someone else.
it's easy to do, just kind of clumsy: I'd set a flag on the GitRepo object if that happened, then would check for that in Settings as it's one of the places where it's used, and then pick it up in init. Should I still do it?
if self._process is not None: | ||
if self._process is None: | ||
return None | ||
if self._process.is_alive(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not so sure about this is_alive - join pairing.
I think the thing that saves you is only if shutdown_event is alway set in this case? is it?
otherwise your monitor loop might still be called on a thread that was just about to be started even thbough you just nulled out the process attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the is_alive check, it's useless and won't fix anything. Replaced with a try/except: if the thread has not been started for some reason (in colab, we would stop the system metrics process with a pause request, so I could imagine something going wrong just before we do that while thing were still being set up), attempting to join it won't fail, but we'd still null the attribute.
Fixes WB-NNNN
Fixes #NNNN
Description
cannot join thread before it is started
errors in Sentry after we released0.13.5
being triggered by anasset.finish()
statement in a Colab environment. Attempted many things, but could not repro. To mitigate, added guardrails around thread joining to SystemMetrics-related code.in git root probing and added some guardrails there as well.
Testing
Tried to break things with Colab, but no luck, unfortunately :(
Checklist