add metric for kernel restarts #1241

minrk · 2023-03-22T13:46:25Z

labels: kernel_name = kernel name, source = "restarter" or "user"

I don't love type as a label, but it's already used in running kernels, so it's probably more important that it match other metrics than it be the more logical name or kernel_name, used elsewhere in the API. Same goes for source to distinguish API ('user') restarts from auto restarts by the KernelRestarter ('restarter'). I also considered trigger=user|crash (or 'exit' instead of crash, which is slightly more precise).

closes #1240

labels: type = kernel name, source = "restarter" or "user"

minrk · 2023-03-22T14:22:50Z

nbclassic's shimming appears to do some weird stuff, causing this same module to get imported twice under different names, and the canonical name attempts to import the deprecated name. Will have to think about the best way to resolve that, since I think the current try: import except: define doesn't make sense if nbclassic is up-to-date and notebook is not present.

yuvipanda · 2023-03-22T14:25:27Z

I agree that kernel_name (or whatever we actually use in kernelspec) is a better label to use. We can add that to the existing kernel metric, and use just that (and deprecate 'type' in the existing metric) - what do you think of that? That allows us to be consistent moving forward as we add more metrics. Matching terminology between kernelspec and metrics seems important.

yuvipanda · 2023-03-22T14:28:01Z

I do like 'source' better, as that looks like something we can be more 'sure' of - we know the restarter restarted it, or the user restarted it. I feel 'trigger' implies more causality than exists.

…notebook which happens when `notebook` is entirely not present, and results in the same module being imported twice, ensuring all metrics are always defined twice

minrk · 2023-03-22T14:40:06Z

We can add that to the existing kernel metric, and use just that (and deprecate 'type' in the existing metric) - what do you think of that?

I like it, except it's complicated by the notebook->jupyter server migration due to prometheus enforcing uniqueness at import time, so we can't change the notebook metric.

I think it would probably be good to put a jupyter_ or jupyter_server_ prefix on all our metrics. That may be a path forward to avoid the circular import problem, too. It's technically a backward-incompatible change, though, so I'm not sure the best way to approach it.

yuvipanda · 2023-03-22T14:42:50Z

YESSS let's prefix it all. We do that for jupyterhub too.

Maybe we leave the current ones as they are, just replicate them anew with new metrics with the prefix, and get rid of the old ones in a major point release?

…for the kernel name field

codecov · 2023-03-22T14:45:08Z

Codecov Report

Patch coverage: 58.33% and project coverage change: +1.22 🎉

Comparison is base (60d7cca) 79.11% compared to head (44c5bb8) 80.33%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1241      +/-   ##
==========================================
+ Coverage   79.11%   80.33%   +1.22%     
==========================================
  Files          68       68              
  Lines        8263     8274      +11     
  Branches     1600     1602       +2     
==========================================
+ Hits         6537     6647     +110     
+ Misses       1304     1203     -101     
- Partials      422      424       +2

Impacted Files	Coverage Δ
jupyter_server/prometheus/metrics.py	`69.23% <50.00%> (-30.77%)`	⬇️
jupyter_server/services/kernels/kernelmanager.py	`81.38% <50.00%> (-0.44%)`	⬇️
jupyter_server/services/kernels/handlers.py	`89.18% <100.00%> (+0.30%)`	⬆️

... and 4 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

minrk · 2023-03-22T14:45:23Z

Yeah, I think that makes sense. For now, I just updated the new metric to follow that pattern, and I can do a separate PR to deprecate the old metrics.

yuvipanda · 2023-03-22T14:47:32Z

Should it be jupyter_ or jupyter_server_?

minrk · 2023-03-22T15:02:22Z

Should it be jupyter_ or jupyter_server_?

I literally changed it 3 times while writing it. I don't know! jupyter_ makes a little more sense to me, since you get things like jupyter_kernel_restarts, but I'm certainly open to jupyter_server. It's pretty long, especially since everything also gets a subsection_ prefix like http_, kernel_, etc.

kevin-bates · 2023-03-22T15:42:16Z

Thanks for this PR @minrk.

I think a "standard" jupyter_ prefix is best (and consistent with "factory" env names). I also agree with @yuvipanda to use the term from the kernelspec, which would be "name" over "type"; "kernel_name" works as well.

Regarding "Should the metric include the kernel ID?", I think all metrics should include a "subject" indicator when applicable (and not considered PII) and, in this case, kernel_id is extremely applicable. I could definitely see cluster admins needing to correlate the "source = restarter" restart metric to a particular kernel (and therefore user) when they see resources being depleted on various nodes because the kernel's restart (due to OOM) is bouncing around the cluster.

minrk · 2023-03-23T08:25:18Z

@kevin-bates thanks! For metrics, prometheus recommends avoiding labels with too much churn, because each unique value for a label can be costly. I think this might be a case where monitoring (prometheus) metrics, and something like a Jupyter server Event might have different levels of detail (I think we should indeed have both for restart):

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

kernel IDs being UUIDs, and unique for every kernel across time seems to fit this description (kubernetes pod name also fits this, but while it's technically unbounded, it usually grows relatively slowly on the order of prometheus data retention)

So my inclination is to exclude the kernel id for now.

particular kernel (and therefore user)

Identifying the server is definitely useful. Metrics from each instance are still stored separately, but I think not identified by default. I believe prometheus scrape config can add the server as a label via e.g. __meta_kubernetes_pod_name (in kubernetes) or some such. I'm not sure how easy that is, or if we should somehow add self-identifying labels to all metrics in the server.

minrk · 2023-03-23T09:25:10Z

I'm also on the fence for whether to use kernel_name vs name. I think the kernel_ prefix is useful when the namespace is ambiguous, and prometheus metric labels are a flat namespace, which can merge information from multiple sources (e.g. adding kubernetes and prometheus metadata fields), so I think the kernel_ prefix may be appropriate here. I don't feel strongly, though, and matching exactly makes sense, too!

yuvipanda · 2023-03-23T11:32:59Z

IMO, I've somewhat strong feelings about jupyter_server vs jupyter_. jupyter as a word is super diffuse, and prefixing the metric with the software package where that is coming from is extremely useful. If I just see a jupyter_ prefix, it could be coming from literally hundreds of packages by now...

yuvipanda · 2023-03-23T11:33:46Z

IMO, I've somewhat strong feelings about jupyter_server vs jupyter_. jupyter as a word is super diffuse, and prefixing the metric with the software package where that is coming from is extremely useful. If I just see a jupyter_ prefix, it could be coming from literally hundreds of packages by now...

kevin-bates · 2023-03-23T15:00:45Z

For metrics, prometheus recommends avoiding labels with too much churn, because each unique value for a label can be costly.

I didn't realize prometheus had this "pattern/footprint" behavior and was going to ask about events as well. Given prometheus' recommendation I would agree that not including the kernel id makes sense - although much less helpful from a diagnostics perspective. The corresponding event will need to include the kernel_id.

Since the prometheus namespace is flat using kernel_name as the label makes sense as well although can 'kernel_' be inferred from the metric name: jupyter_kernel_restarts? Can you provide the result of a metric capture?

IMO, I've somewhat strong feelings about jupyter_server vs jupyter_. jupyter as a word is super diffuse, and prefixing the metric with the software package where that is coming from is extremely useful. If I just see a jupyter_ prefix, it could be coming from literally hundreds of packages by now...

It makes sense to use the package name as the prefix although I'm not ever going to propose a prefix of jupyter_enterprise_gateway_ unless we as a community adopt it as a convention. I guess I was hoping folks could infer the source of the metric based on other contextual information.

add metric for kernel restarts

429b233

labels: type = kernel name, source = "restarter" or "user"

minrk added the enhancement label Mar 22, 2023

minrk requested a review from yuvipanda March 22, 2023 13:46

minrk mentioned this pull request Mar 22, 2023

Add 'kernel_restarts' prometheus metric #1240

Open

avoid circular import if nbclassic has shimmed jupyter_server in for …

c75004d

…notebook which happens when `notebook` is entirely not present, and results in the same module being imported twice, ensuring all metrics are always defined twice

add jupyter_ prefix to new metric, and use more standard kernel_name …

fb3c19a

…for the kernel name field

lint: ignore unused import

44c5bb8

Zsailer mentioned this pull request Mar 30, 2023

Meeting Notes 2023 jupyter-server/team-compass#45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add metric for kernel restarts #1241

add metric for kernel restarts #1241

minrk commented Mar 22, 2023 •

edited

minrk commented Mar 22, 2023

yuvipanda commented Mar 22, 2023

yuvipanda commented Mar 22, 2023

minrk commented Mar 22, 2023

yuvipanda commented Mar 22, 2023

codecov bot commented Mar 22, 2023 •

edited

minrk commented Mar 22, 2023

yuvipanda commented Mar 22, 2023

minrk commented Mar 22, 2023

kevin-bates commented Mar 22, 2023

minrk commented Mar 23, 2023 •

edited

minrk commented Mar 23, 2023

yuvipanda commented Mar 23, 2023

yuvipanda commented Mar 23, 2023

kevin-bates commented Mar 23, 2023

add metric for kernel restarts #1241

Are you sure you want to change the base?

add metric for kernel restarts #1241

Conversation

minrk commented Mar 22, 2023 • edited

minrk commented Mar 22, 2023

yuvipanda commented Mar 22, 2023

yuvipanda commented Mar 22, 2023

minrk commented Mar 22, 2023

yuvipanda commented Mar 22, 2023

codecov bot commented Mar 22, 2023 • edited

Codecov Report

minrk commented Mar 22, 2023

yuvipanda commented Mar 22, 2023

minrk commented Mar 22, 2023

kevin-bates commented Mar 22, 2023

minrk commented Mar 23, 2023 • edited

minrk commented Mar 23, 2023

yuvipanda commented Mar 23, 2023

yuvipanda commented Mar 23, 2023

kevin-bates commented Mar 23, 2023

minrk commented Mar 22, 2023 •

edited

codecov bot commented Mar 22, 2023 •

edited

minrk commented Mar 23, 2023 •

edited