Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metric for kernel restarts #1241

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

minrk
Copy link
Contributor

@minrk minrk commented Mar 22, 2023

labels: kernel_name = kernel name, source = "restarter" or "user"

I don't love type as a label, but it's already used in running kernels, so it's probably more important that it match other metrics than it be the more logical name or kernel_name, used elsewhere in the API. Same goes for source to distinguish API ('user') restarts from auto restarts by the KernelRestarter ('restarter'). I also considered trigger=user|crash (or 'exit' instead of crash, which is slightly more precise).

closes #1240

labels: type = kernel name, source = "restarter" or "user"
@minrk
Copy link
Contributor Author

minrk commented Mar 22, 2023

nbclassic's shimming appears to do some weird stuff, causing this same module to get imported twice under different names, and the canonical name attempts to import the deprecated name. Will have to think about the best way to resolve that, since I think the current try: import except: define doesn't make sense if nbclassic is up-to-date and notebook is not present.

@yuvipanda
Copy link
Contributor

I agree that kernel_name (or whatever we actually use in kernelspec) is a better label to use. We can add that to the existing kernel metric, and use just that (and deprecate 'type' in the existing metric) - what do you think of that? That allows us to be consistent moving forward as we add more metrics. Matching terminology between kernelspec and metrics seems important.

@yuvipanda
Copy link
Contributor

I do like 'source' better, as that looks like something we can be more 'sure' of - we know the restarter restarted it, or the user restarted it. I feel 'trigger' implies more causality than exists.

…notebook

which happens when `notebook` is entirely not present,
and results in the same module being imported twice,
ensuring all metrics are always defined twice
@minrk
Copy link
Contributor Author

minrk commented Mar 22, 2023

We can add that to the existing kernel metric, and use just that (and deprecate 'type' in the existing metric) - what do you think of that?

I like it, except it's complicated by the notebook->jupyter server migration due to prometheus enforcing uniqueness at import time, so we can't change the notebook metric.

I think it would probably be good to put a jupyter_ or jupyter_server_ prefix on all our metrics. That may be a path forward to avoid the circular import problem, too. It's technically a backward-incompatible change, though, so I'm not sure the best way to approach it.

@yuvipanda
Copy link
Contributor

YESSS let's prefix it all. We do that for jupyterhub too.

Maybe we leave the current ones as they are, just replicate them anew with new metrics with the prefix, and get rid of the old ones in a major point release?

@codecov
Copy link

codecov bot commented Mar 22, 2023

Codecov Report

Patch coverage: 58.33% and project coverage change: +1.22 🎉

Comparison is base (60d7cca) 79.11% compared to head (44c5bb8) 80.33%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1241      +/-   ##
==========================================
+ Coverage   79.11%   80.33%   +1.22%     
==========================================
  Files          68       68              
  Lines        8263     8274      +11     
  Branches     1600     1602       +2     
==========================================
+ Hits         6537     6647     +110     
+ Misses       1304     1203     -101     
- Partials      422      424       +2     
Impacted Files Coverage Δ
jupyter_server/prometheus/metrics.py 69.23% <50.00%> (-30.77%) ⬇️
jupyter_server/services/kernels/kernelmanager.py 81.38% <50.00%> (-0.44%) ⬇️
jupyter_server/services/kernels/handlers.py 89.18% <100.00%> (+0.30%) ⬆️

... and 4 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@minrk
Copy link
Contributor Author

minrk commented Mar 22, 2023

Yeah, I think that makes sense. For now, I just updated the new metric to follow that pattern, and I can do a separate PR to deprecate the old metrics.

@yuvipanda
Copy link
Contributor

Should it be jupyter_ or jupyter_server_?

@minrk
Copy link
Contributor Author

minrk commented Mar 22, 2023

Should it be jupyter_ or jupyter_server_?

I literally changed it 3 times while writing it. I don't know! jupyter_ makes a little more sense to me, since you get things like jupyter_kernel_restarts, but I'm certainly open to jupyter_server. It's pretty long, especially since everything also gets a subsection_ prefix like http_, kernel_, etc.

@kevin-bates
Copy link
Member

Thanks for this PR @minrk.

I think a "standard" jupyter_ prefix is best (and consistent with "factory" env names). I also agree with @yuvipanda to use the term from the kernelspec, which would be "name" over "type"; "kernel_name" works as well.

Regarding "Should the metric include the kernel ID?", I think all metrics should include a "subject" indicator when applicable (and not considered PII) and, in this case, kernel_id is extremely applicable. I could definitely see cluster admins needing to correlate the "source = restarter" restart metric to a particular kernel (and therefore user) when they see resources being depleted on various nodes because the kernel's restart (due to OOM) is bouncing around the cluster.

@minrk
Copy link
Contributor Author

minrk commented Mar 23, 2023

@kevin-bates thanks! For metrics, prometheus recommends avoiding labels with too much churn, because each unique value for a label can be costly. I think this might be a case where monitoring (prometheus) metrics, and something like a Jupyter server Event might have different levels of detail (I think we should indeed have both for restart):

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

kernel IDs being UUIDs, and unique for every kernel across time seems to fit this description (kubernetes pod name also fits this, but while it's technically unbounded, it usually grows relatively slowly on the order of prometheus data retention)

So my inclination is to exclude the kernel id for now.

particular kernel (and therefore user)

Identifying the server is definitely useful. Metrics from each instance are still stored separately, but I think not identified by default. I believe prometheus scrape config can add the server as a label via e.g. __meta_kubernetes_pod_name (in kubernetes) or some such. I'm not sure how easy that is, or if we should somehow add self-identifying labels to all metrics in the server.

@minrk
Copy link
Contributor Author

minrk commented Mar 23, 2023

I'm also on the fence for whether to use kernel_name vs name. I think the kernel_ prefix is useful when the namespace is ambiguous, and prometheus metric labels are a flat namespace, which can merge information from multiple sources (e.g. adding kubernetes and prometheus metadata fields), so I think the kernel_ prefix may be appropriate here. I don't feel strongly, though, and matching exactly makes sense, too!

@yuvipanda
Copy link
Contributor

IMO, I've somewhat strong feelings about jupyter_server vs jupyter_. jupyter as a word is super diffuse, and prefixing the metric with the software package where that is coming from is extremely useful. If I just see a jupyter_ prefix, it could be coming from literally hundreds of packages by now...

1 similar comment
@yuvipanda
Copy link
Contributor

IMO, I've somewhat strong feelings about jupyter_server vs jupyter_. jupyter as a word is super diffuse, and prefixing the metric with the software package where that is coming from is extremely useful. If I just see a jupyter_ prefix, it could be coming from literally hundreds of packages by now...

@kevin-bates
Copy link
Member

For metrics, prometheus recommends avoiding labels with too much churn, because each unique value for a label can be costly.

I didn't realize prometheus had this "pattern/footprint" behavior and was going to ask about events as well. Given prometheus' recommendation I would agree that not including the kernel id makes sense - although much less helpful from a diagnostics perspective. The corresponding event will need to include the kernel_id.

Since the prometheus namespace is flat using kernel_name as the label makes sense as well although can 'kernel_' be inferred from the metric name: jupyter_kernel_restarts? Can you provide the result of a metric capture?

IMO, I've somewhat strong feelings about jupyter_server vs jupyter_. jupyter as a word is super diffuse, and prefixing the metric with the software package where that is coming from is extremely useful. If I just see a jupyter_ prefix, it could be coming from literally hundreds of packages by now...

It makes sense to use the package name as the prefix although I'm not ever going to propose a prefix of jupyter_enterprise_gateway_ unless we as a community adopt it as a convention. I guess I was hoping folks could infer the source of the metric based on other contextual information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add 'kernel_restarts' prometheus metric
3 participants