Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown statsbeat after failure threshold is met #1127

Merged
merged 9 commits into from
Jun 14, 2022

Conversation

lzchen
Copy link
Contributor

@lzchen lzchen commented Jun 7, 2022

Following specs

Similar to Node js, retry only occurs on successes (200) so shutdown occurs only when 3 attempts are reached.

@hectorhdzg @heyams

@lzchen lzchen requested review from a team, aabmass, hectorhdzg and songy23 as code owners June 7, 2022 20:10
not state.get_statsbeat_initial_success():
# If ingestion threshold during statsbeat initialization is reached, return back code to shut it down
if _statsbeat_failed_to_ingest():
return -2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to use some kind of constant instead of the value here, -2 is shutdown signal, but looking at the code here I have no clue what -1 means.

Copy link
Contributor Author

@lzchen lzchen Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 is the exception signal for telemetry in general. -2 is the shutdown signal for only statsbeat exporter. I agree a constant would be better but that would probably require a refactor of all the return signals which I would prefer leaving to a different pr.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well using these kinds of numbers instead of enumerators or constants is usually a pretty bad practice in other languages, code is harder to understand and maintain by other developers, maybe this is the way to go in Python, just my two cents here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is -2 introduce in this PR? if so, how much work does it take to refactor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hectorhdzg @heyams
Created new issue to track this refactor #1128

@@ -143,6 +155,17 @@ def _transmit(self, envelopes):
data = json.loads(text)
except Exception:
pass

if self._is_stats_exporter() and \
not state.get_statsbeat_shutdown() and \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see you check if shutdown was called in several places, is the exporter process for Statsbeat expected to keep running after shutdown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for very specific race conditions in which multiple threads could be accessing the same piece of "check if we need to shutdown" logic. It also serves as a good sanity check to prevent from any statsbeat logic from executing if the statsbeat exporte ris already shutdown.

@heyams
Copy link

heyams commented Jun 8, 2022

@lzchen you're tagging the wrong helen.

contrib/opencensus-ext-azure/CHANGELOG.md Outdated Show resolved Hide resolved
@@ -71,6 +71,11 @@ def export_metrics(self, metrics):
for batch in batched_envelopes:
batch = self.apply_telemetry_processors(batch)
result = self._transmit(batch)
# If statsbeat exporter and received signal to shutdown
if self._is_stats_exporter() and result == -2:
Copy link

@heyams heyams Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_statsbeat_failed_to_ingest above can return the counter and here just check if the counter is >= 3. -2 seems so random.

Copy link
Contributor Author

@lzchen lzchen Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_statsbeat_failed_to_ingest is a private function used only within transport to handle the count as well as determining whether it is reached. The returning of the result code back to the exporter is by design. I agree the codes are a bit random (-1, -2, etc) but changing them can be part of a different PR. See my response here as well.

Copy link
Collaborator

@hectorhdzg hectorhdzg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lzchen lzchen merged commit 7cbf82f into census-instrumentation:master Jun 14, 2022
@lzchen lzchen deleted the stats branch June 14, 2022 17:21
@lzchen lzchen added the azure Microsoft Azure label Nov 9, 2022
inirudebwoy pushed a commit to inirudebwoy/opencensus-python that referenced this pull request Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
azure Microsoft Azure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants