[Incident] Dask Gateway fails with Unavailability of Compute Engine Resources - turned out to be a transient issue #1031

choldgraf · 2022-02-24T18:03:47Z

Summary

Background

The Pangeo community recently tried to update their user environment image via the Configurator. When user sessions started with the new image, Dask Gateway was no longer functional.

When the image was bumped to 2022.02.04 then running the following code:

from dask_gateway import GatewayCluster
cluster = GatewayCluster()
cluster.scale(20)

Would not function properly. Looking at some of the logs in the GCP console, we saw these errors:

Resolution

Ultimately, we realized that this was likely a transient issue due to Google Cloud's resource limits being reached, and the fact that we are not paying for dedicated guaranteed resources.

Relevant information

Hub URL: http://us-central1-b.gcp.pangeo.io/
FreshDesk ref: https://2i2c.freshdesk.com/a/tickets/82
Pangeo image repo is here: https://github.com/pangeo-data/pangeo-docker-images
A diff of the repo changes between these two versions: pangeo-data/pangeo-docker-images@2021.10.19...2022.02.04

Actions to resolve

We'll need to collaborate with folks in the Pangeo community who are familiar with the user environment images (maybe that is @scottyhq and @rabernat ? Please cc others who may have insight). We could then do some combination of:

Look at the diff in the user image, and try to spot anything that would obviously stop Dask Gateway from scaling: pangeo-data/pangeo-docker-images@2021.10.19...2022.02.04
If that doesn't work, then work backwards with earlier releases of the image, running the code snippet above to see if it works. When we find an image environment that works, then we can figure out the exact diff that broke it.
Work with members of the Pangeo team to identify the problem and resolve it
Incident has been dealt with or is over
Sections below are filled out
Incident title and after-action report is cleaned up
All actionable items above have linked GitHub Issues

Timeline

All times in US/Pacific.

2022-02-17 - First reports

@rabernat reported that the Dask Gateway cluster was not scaling properly
After some investigation, we realized that this happened just after bumping the user image to 2022.02.04.
Reverting the image made things work again, but now we cannot update the user image without breaking things.

2022-02-18 - 2022-02-22 - Investigation

2i2c team investigated what could be going on, and suspect that the issue was a change in the environment image dependencies (see links to relevant diffs etc above). We identified several process / technical improvements that could have prevented this (see below), but still unsure of the specific incompatibility in the image.

2022-02-22 - 2022-03-08

Tried various investigations into things that are going wrong, including:

Trying the same operation on the AWS Hub (surprised that it worked): [Incident] Dask Gateway fails with Unavailability of Compute Engine Resources - turned out to be a transient issue #1031 (comment)
Investigating the series of commits to the Pangeo user image: [Incident] Dask Gateway fails with Unavailability of Compute Engine Resources - turned out to be a transient issue #1031 (comment)

During this time, the Pangeo hub was functional because it was using the previous user image.

2022-03-09

We noted that Dask Gateway hadn't had a release, so it is unlikely that the problem was due to a Dask Gateway incompatibility
Investigating the error message "
Concluded that this was a transient issue due to GCP's resources being maxxed out: [Incident] Dask Gateway fails with Unavailability of Compute Engine Resources - turned out to be a transient issue #1031 (comment)

After-action report

What went wrong

Google's resources on the cluster were maxed out. Because we do not pay GCP for guaranteed cloud resources, occasionally they will not be available. When the Dask Gateway cluster was created, this resulted in the error shown above.
This happened to coincide with a bump in the user image for this hub, and so we originally thought the error was due to that issue.
After a lot of investigation, we couldn't figure out why the user environment was the cause of the problem.
Finally we discovered that this was due to the resource constraints issue above, and that this was transient.

Where we got lucky

We were lucky that it was possible to quickly revert the image, when the previous one wasn't working.
We were unlucky that the transient issue happened to coincide with a user action, which led us down the wrong debugging path.

Future action items

Process improvements

Define a policy about what kind of guidance 2i2c can provide to communities who manage their own environments. Tracked in: How to ensure reliability of custom user environments #1016

Technical improvements

We need a way to validate the environment that is present in user images, to ensure some basic validity checks as a part of CI/CD: How to ensure reliability of custom user environments #1016
We should automatically test Dask Gateway as part of new hub deploys - doing this could have helped us catch this issue earlier on. Tracked in Automatically test a dask-gateway notebook for daskhubs #152

The text was updated successfully, but these errors were encountered:

scottyhq · 2022-02-25T02:31:31Z

Thanks for documenting this @choldgraf! No solutions off the top of my head, but some ideas to get started

I can't tell from the console messages pasted above what could be going on. In the past I've found it necessary to look at the k8s pod logs to figure out whats going on. If there is some way to expose such logs to the community hub admin that could help.

It might also be helpful to hone in on the conda environment changes, which for this particular case you can use the following URL: https://github.com/pangeo-data/pangeo-docker-images/compare/2021.10.19..2022.02.04#diff-ceee658209456cc3bd347679717bdb5d95ee7fb5a91ffa1dc6d2e2d556144987

For example I see some possibly relevant changes dask-2021.9.1 -> dask-2022.1.1 and jupyterhub-singleuser-1.4.2 -> jupyterhub-singleuser-2.1.1 etc.

To fully test for dask-gateway compatibility, it seems like you need some CI to actually launch a small dask-gateway cluster with the new image, right? Could either be with a 'test' cluster or maybe there is even a way to securely connect to the pangeo-hub itself and programmatically launch a cluster with the new image?

scottyhq · 2022-02-25T03:06:50Z

I was able to launch a functional cluster with the 2022.02.04 image on the AWS hub, so this seems to be due to some incompatibility with the config (autoscaler version? daskhub chart version?) and image...

choldgraf · 2022-02-25T21:24:57Z

Maybe the thing to do is to intentionally break the user image on the Pangeo staging hub, so that we can look at the logs once somebody tries a scaling event w/ Dask Gateway. I think that the configurator works the same way on https://staging.us-central1-b.gcp.pangeo.io/ as it does on prod, so it shouldn't bottleneck on having merge rights to infrastructure/.

If we can break staging by upgrading the image, then prod will stay functional and it should be easier to debug

@scottyhq I was trying to figure out which PR upgraded the version of the relevant packages you posted there, and it seems like many of them have numerous pinnings throughout the repository, and don't necessarily all change at once. What is the easiest way to answer the question "when did package X get updated in the Pangeo image, and which release is associated with it?"

rabernat · 2022-02-25T21:38:27Z

Just wanted to ping here to

apologize for not being more involved in the conversation yet...I triggered the initial issue but then got sucked into urgent ocean sciences prep work this week
say that I strongly support @choldgraf's plan to debug this on the staging cluster. That's exactly what it is for IMO. No one is depending on that for any sort of day-to-day work, so we can temporarily break it all we want as we work though this issue.

What is the easiest way to answer the question "when did package X get updated in the Pangeo image, and which release is associated with it?"

https://github.com/pangeo-data/pangeo-docker-images/blame/master/pangeo-notebook/conda-linux-64.lock

scottyhq · 2022-02-26T01:52:47Z

What is the easiest way to answer the question "when did package X get updated in the Pangeo image, and which release is associated with it?"

I find the git blame interface confusing and somewhat slow to navigate, so I added a script (PR above) that gets you an answer a little more quickly!

rabernat · 2022-03-02T15:34:36Z

Thanks for the help here Scott!

My best guess would be that the 2i2c dask gateway version is incompatible with the image version. Is it possible to upgrade dask gateway just for staging? Or is there one gateway for both clusters?

FWIW, the latest image DOES work with dask gateway on the Pangeo AWS binder: https://hub.aws-uswest2-binder.pangeo.io/v2/gh/pangeo-data/pangeo-docker-images/2022.02.04

yuvipanda · 2022-03-10T05:19:46Z

Hello! I'm actually slightly confused about what happened here - dask-gateway hasn't had a new release, and that's what is the most likely culprit here. I'm going to take a look on staging.

yuvipanda · 2022-03-10T05:39:21Z

Googling the first error message in the screenshot posted takes me to https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-vm-creation, which has a lot of useful information. In particular, it states Because this situation is temporary and can change frequently based on fluctuating demand, try your request again later.. I think this is because dask workers are preemptible instances (similar to spot AWS instances), and hence during the time this scale up was attempted, Google Cloud was just 'full'.

I just tried the latest image on the staging cluster, and could spin up 20 dask workers and do some computation with the newest image (2022.02.04)

So I think the image change was pure coincidence, and what really happened was that the cloud was full and put you on hold.

Ref 2i2c-org#1031 See 2i2c-org#1031 (comment) for investigation of the issue there.

yuvipanda · 2022-03-10T06:27:21Z

https://staging.us-central1-b.gcp.pangeo.io has the new image, and in #1080 I'm pushing it out to the prod hub as well. I want to push this out soon, but wanted to give folks an opportunity to test as well.

choldgraf · 2022-03-10T22:48:04Z

So to summarize, I think this is the situation:

the user image was bumped on the Pangeo hub via the configurator
we then tried to scale with a Dask Gateway, and got the error message posted above
we reasonably concluded that the error was due to something that changed in the image
debugging the user image didn't yield a clear answer, especially since the same image on the AWS hub worked
we did some more google on the error message and found it was associated with preemptible nodes (which the dask gateway uses)
we concluded this issue might have been transient, and only coincidentally related to the user image
we tried bumping the image in the same way again, and it worked
we conclude that the user image does not require any changes and is compatible with our hub infrastructure

Is that right?

yuvipanda · 2022-03-10T23:29:32Z

we did some more google on the error message and found it was associated with preemptible nodes (which the dask gateway uses)

It's associated with both preemptible nodes but also with general cloud capacity. We think of cloud resources as infinity, but physical restrictions do exist - we just don't often run into them. If we really want to make sure this never happens, you can buy commitments (https://cloud.google.com/compute/docs/instances/reservations-overview) that you pay for regardless. So it's possible this error would also occur without preemptible nodes - although less likely.

we concluded this issue might have been transient, and only coincidentally related to the user image

Based on the info we have, I think it was transient (not might have been) and unrelated to the image.

Otherwise, this is right!

rabernat · 2022-03-11T20:57:29Z

We just discovered a serious bug in the 2022.02.04 image (pangeo-data/pangeo-docker-images#297), so we need to go to 2022.03.08.

I would make the PR myself, but I am having trouble understanding where exactly this is configured. I browsed around
https://github.com/2i2c-org/infrastructure/tree/master/config/clusters/pangeo-hubs but couldn't figure it out.

scottyhq · 2022-03-11T20:59:24Z

I browsed around
https://github.com/2i2c-org/infrastructure/tree/master/config/clusters/pangeo-hubs but couldn't figure it out.

https://us-central1-b.gcp.pangeo.io/services/configurator

Docs:
https://docs.2i2c.org/en/latest/admin/howto/configurator.html

People with admin access can change the image:

infrastructure/config/clusters/pangeo-hubs/staging.values.yaml

Lines 47 to 51 in 6e66c84

    
           admin_users: 
        
             - rabernat 
        
             - jhamman 
        
             - scottyhq 
        
             - TomAugspurger

yuvipanda · 2022-03-11T21:06:01Z

@rabernat you can just use the configurator for now. I agree that's confusing on when to use the configurator vs deploy via the hub config here, something to sort out.

yuvipanda · 2022-03-11T21:06:24Z

However, if you wanna make a PR, the tag is specified in

infrastructure/config/clusters/pangeo-hubs/staging.values.yaml

Line 64 in 6e66c84

tag: 2022.02.04

rabernat · 2022-03-11T21:08:01Z

you can just use the configurator for now.

My understanding is that the configurator overriding the value in the chart was the source of some problems up thread. So I am reluctant to use the configurator.

However, if you wanna make a PR, the tag is specified in

This points to staging. Does it also apply to prod? There is no equivalent line in https://github.com/2i2c-org/infrastructure/blob/6e66c84176c8d7f91b1a8ad4b9d71b5f2256d076/config/clusters/pangeo-hubs/prod.values.yaml

sgibson91 · 2022-03-11T21:09:12Z

@rabernat Short answer, yes it also gets parsed to prod

Long answer, we've done a lot of work recently to split out helm config so that we can explicitly isolate a single hub's config. This means we can now validate a single hub's config on deploy (and in PRs that affect config), and I'm currently working on being able to deploy hubs in parallel, not just clusters. Or run a deploy to a single hub in CI/CD based on changed filepaths. All the files that define a single hub are explicitly listed in the cluster.yaml file

infrastructure/config/clusters/pangeo-hubs/cluster.yaml

Lines 12 to 36 in 6e66c84

    
           - name: staging 
        
             display_name: "Pangeo (staging)" 
        
             domain: staging.us-central1-b.gcp.pangeo.io 
        
             helm_chart: daskhub 
        
             auth0: 
        
               enabled: false 
        
             helm_chart_values_files: 
        
               # The order in which you list files here is the order the will be passed 
        
               # to the helm upgrade command in, and that has meaning. Please check 
        
               # that you intend for these files to be applied in this order. 
        
               - staging.values.yaml 
        
               - enc-staging.secret.values.yaml 
        
           - name: prod 
        
             display_name: "Pangeo (prod)" 
        
             domain: us-central1-b.gcp.pangeo.io 
        
             helm_chart: daskhub 
        
             auth0: 
        
               enabled: false 
        
             helm_chart_values_files: 
        
               # The order in which you list files here is the order the will be passed 
        
               # to the helm upgrade command in, and that has meaning. Please check 
        
               # that you intend for these files to be applied in this order. 
        
               - staging.values.yaml 
        
               - prod.values.yaml 
        
               - enc-prod.secret.values.yaml

Though I think I need to tweak the new structure further such that, for a cluster like pangeo, we have common, staging and prod files to save confusion.

If you're interested, you can read more about the new config structure here https://infrastructure.2i2c.org/en/latest/topic/config.html#id1

yuvipanda · 2022-03-11T21:10:32Z

@rabernat if you see #1031 (comment), the image change actually had nothing to do with the problem here at all. It was just a coincidence.

choldgraf · 2022-03-15T23:23:42Z

Hey all - I believe that this one is now resolved, and I've updated the top comment with a timeline and overview of the problem. Annoyingly I could not think of too many follow-ups to improve upon, because the root problem here was transient, and we mostly just needed to "try it again". We do have a few issues to track enforcing specifications in the user environment which should help at least narrow down the potential problems in the future.

I'm going to close this one, but if anyone has suggestions for other process / tech improvements we need, please suggest them and/or open issues!

choldgraf added type: Hub Incident labels Feb 24, 2022

scottyhq mentioned this issue Feb 26, 2022

small bash script to get package versions in each tag pangeo-data/pangeo-docker-images#290

Merged

This was referenced Mar 2, 2022

How to ensure reliability of custom user environments #1016

Open

[New Hub] LEAP Pangeo #1050

Closed

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Mar 10, 2022

Bump up image tag for pangeo hub

f342947

Ref 2i2c-org#1031 See 2i2c-org#1031 (comment) for investigation of the issue there.

yuvipanda mentioned this issue Mar 10, 2022

Bump up image tag for pangeo hub #1080

Merged

sgibson91 mentioned this issue Mar 14, 2022

Use common.values.yaml files where config is shared between hubs on a cluster #1092

Merged

choldgraf changed the title ~~[Incident] Pangeo environment changes are not compatible with our Dask Gateway setup~~ [Incident] Dask Gateway fails with Unavailability of Compute Engine Resources - turned out to be a transient issue Mar 15, 2022

choldgraf closed this as completed Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Incident] Dask Gateway fails with Unavailability of Compute Engine Resources - turned out to be a transient issue #1031

[Incident] Dask Gateway fails with Unavailability of Compute Engine Resources - turned out to be a transient issue #1031

choldgraf commented Feb 24, 2022 •

edited

scottyhq commented Feb 25, 2022

scottyhq commented Feb 25, 2022

choldgraf commented Feb 25, 2022 •

edited

rabernat commented Feb 25, 2022

scottyhq commented Feb 26, 2022

rabernat commented Mar 2, 2022

yuvipanda commented Mar 10, 2022

yuvipanda commented Mar 10, 2022 •

edited

yuvipanda commented Mar 10, 2022

choldgraf commented Mar 10, 2022 •

edited

yuvipanda commented Mar 10, 2022

rabernat commented Mar 11, 2022

scottyhq commented Mar 11, 2022 •

edited

yuvipanda commented Mar 11, 2022

yuvipanda commented Mar 11, 2022

rabernat commented Mar 11, 2022

sgibson91 commented Mar 11, 2022 •

edited

yuvipanda commented Mar 11, 2022

choldgraf commented Mar 15, 2022

[Incident] Dask Gateway fails with Unavailability of Compute Engine Resources - turned out to be a transient issue #1031

[Incident] Dask Gateway fails with Unavailability of Compute Engine Resources - turned out to be a transient issue #1031

Comments

choldgraf commented Feb 24, 2022 • edited

Summary

Background

Resolution

Relevant information

Actions to resolve

Timeline

2022-02-17 - First reports

2022-02-18 - 2022-02-22 - Investigation

2022-02-22 - 2022-03-08

2022-03-09

After-action report

What went wrong

Where we got lucky

Future action items

Process improvements

Technical improvements

scottyhq commented Feb 25, 2022

scottyhq commented Feb 25, 2022

choldgraf commented Feb 25, 2022 • edited

rabernat commented Feb 25, 2022

scottyhq commented Feb 26, 2022

rabernat commented Mar 2, 2022

yuvipanda commented Mar 10, 2022

yuvipanda commented Mar 10, 2022 • edited

yuvipanda commented Mar 10, 2022

choldgraf commented Mar 10, 2022 • edited

yuvipanda commented Mar 10, 2022

rabernat commented Mar 11, 2022

scottyhq commented Mar 11, 2022 • edited

yuvipanda commented Mar 11, 2022

yuvipanda commented Mar 11, 2022

rabernat commented Mar 11, 2022

sgibson91 commented Mar 11, 2022 • edited

yuvipanda commented Mar 11, 2022

choldgraf commented Mar 15, 2022

choldgraf commented Feb 24, 2022 •

edited

choldgraf commented Feb 25, 2022 •

edited

yuvipanda commented Mar 10, 2022 •

edited

choldgraf commented Mar 10, 2022 •

edited

scottyhq commented Mar 11, 2022 •

edited

sgibson91 commented Mar 11, 2022 •

edited