-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Incident] Dask Gateway fails with Unavailability of Compute Engine Resources - turned out to be a transient issue #1031
Comments
Thanks for documenting this @choldgraf! No solutions off the top of my head, but some ideas to get started I can't tell from the console messages pasted above what could be going on. In the past I've found it necessary to look at the k8s pod logs to figure out whats going on. If there is some way to expose such logs to the community hub admin that could help. It might also be helpful to hone in on the conda environment changes, which for this particular case you can use the following URL: https://github.com/pangeo-data/pangeo-docker-images/compare/2021.10.19..2022.02.04#diff-ceee658209456cc3bd347679717bdb5d95ee7fb5a91ffa1dc6d2e2d556144987 For example I see some possibly relevant changes To fully test for dask-gateway compatibility, it seems like you need some CI to actually launch a small dask-gateway cluster with the new image, right? Could either be with a 'test' cluster or maybe there is even a way to securely connect to the pangeo-hub itself and programmatically launch a cluster with the new image? |
I was able to launch a functional cluster with the 2022.02.04 image on the AWS hub, so this seems to be due to some incompatibility with the config (autoscaler version? daskhub chart version?) and image... |
Maybe the thing to do is to intentionally break the user image on the Pangeo staging hub, so that we can look at the logs once somebody tries a scaling event w/ Dask Gateway. I think that the configurator works the same way on https://staging.us-central1-b.gcp.pangeo.io/ as it does on If we can break staging by upgrading the image, then prod will stay functional and it should be easier to debug @scottyhq I was trying to figure out which PR upgraded the version of the relevant packages you posted there, and it seems like many of them have numerous pinnings throughout the repository, and don't necessarily all change at once. What is the easiest way to answer the question "when did package X get updated in the Pangeo image, and which release is associated with it?" |
Just wanted to ping here to
https://github.com/pangeo-data/pangeo-docker-images/blame/master/pangeo-notebook/conda-linux-64.lock |
I find the git blame interface confusing and somewhat slow to navigate, so I added a script (PR above) that gets you an answer a little more quickly! |
Thanks for the help here Scott! My best guess would be that the 2i2c dask gateway version is incompatible with the image version. Is it possible to upgrade dask gateway just for staging? Or is there one gateway for both clusters? FWIW, the latest image DOES work with dask gateway on the Pangeo AWS binder: https://hub.aws-uswest2-binder.pangeo.io/v2/gh/pangeo-data/pangeo-docker-images/2022.02.04 |
Hello! I'm actually slightly confused about what happened here - dask-gateway hasn't had a new release, and that's what is the most likely culprit here. I'm going to take a look on staging. |
Googling the first error message in the screenshot posted takes me to https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-vm-creation, which has a lot of useful information. In particular, it states Because this situation is temporary and can change frequently based on fluctuating demand, try your request again later.. I think this is because dask workers are preemptible instances (similar to spot AWS instances), and hence during the time this scale up was attempted, Google Cloud was just 'full'. I just tried the latest image on the staging cluster, and could spin up 20 dask workers and do some computation with the newest image (2022.02.04) So I think the image change was pure coincidence, and what really happened was that the cloud was full and put you on hold. |
Ref 2i2c-org#1031 See 2i2c-org#1031 (comment) for investigation of the issue there.
https://staging.us-central1-b.gcp.pangeo.io has the new image, and in #1080 I'm pushing it out to the prod hub as well. I want to push this out soon, but wanted to give folks an opportunity to test as well. |
So to summarize, I think this is the situation:
Is that right? |
It's associated with both preemptible nodes but also with general cloud capacity. We think of cloud resources as infinity, but physical restrictions do exist - we just don't often run into them. If we really want to make sure this never happens, you can buy commitments (https://cloud.google.com/compute/docs/instances/reservations-overview) that you pay for regardless. So it's possible this error would also occur without preemptible nodes - although less likely.
Based on the info we have, I think it was transient (not might have been) and unrelated to the image. Otherwise, this is right! |
We just discovered a serious bug in the 2022.02.04 image (pangeo-data/pangeo-docker-images#297), so we need to go to 2022.03.08. I would make the PR myself, but I am having trouble understanding where exactly this is configured. I browsed around |
https://us-central1-b.gcp.pangeo.io/services/configurator Docs: People with admin access can change the image: infrastructure/config/clusters/pangeo-hubs/staging.values.yaml Lines 47 to 51 in 6e66c84
|
@rabernat you can just use the configurator for now. I agree that's confusing on when to use the configurator vs deploy via the hub config here, something to sort out. |
However, if you wanna make a PR, the tag is specified in
|
My understanding is that the configurator overriding the value in the chart was the source of some problems up thread. So I am reluctant to use the configurator.
This points to staging. Does it also apply to prod? There is no equivalent line in https://github.com/2i2c-org/infrastructure/blob/6e66c84176c8d7f91b1a8ad4b9d71b5f2256d076/config/clusters/pangeo-hubs/prod.values.yaml |
@rabernat Short answer, yes it also gets parsed to prod Long answer, we've done a lot of work recently to split out helm config so that we can explicitly isolate a single hub's config. This means we can now validate a single hub's config on deploy (and in PRs that affect config), and I'm currently working on being able to deploy hubs in parallel, not just clusters. Or run a deploy to a single hub in CI/CD based on changed filepaths. All the files that define a single hub are explicitly listed in the infrastructure/config/clusters/pangeo-hubs/cluster.yaml Lines 12 to 36 in 6e66c84
Though I think I need to tweak the new structure further such that, for a cluster like pangeo, we have If you're interested, you can read more about the new config structure here https://infrastructure.2i2c.org/en/latest/topic/config.html#id1 |
@rabernat if you see #1031 (comment), the image change actually had nothing to do with the problem here at all. It was just a coincidence. |
Hey all - I believe that this one is now resolved, and I've updated the top comment with a timeline and overview of the problem. Annoyingly I could not think of too many follow-ups to improve upon, because the root problem here was transient, and we mostly just needed to "try it again". We do have a few issues to track enforcing specifications in the user environment which should help at least narrow down the potential problems in the future. I'm going to close this one, but if anyone has suggestions for other process / tech improvements we need, please suggest them and/or open issues! |
Summary
Background
The Pangeo community recently tried to update their user environment image via the Configurator. When user sessions started with the new image, Dask Gateway was no longer functional.
When the image was bumped to
2022.02.04
then running the following code:Would not function properly. Looking at some of the logs in the GCP console, we saw these errors:
Resolution
Ultimately, we realized that this was likely a transient issue due to Google Cloud's resource limits being reached, and the fact that we are not paying for dedicated guaranteed resources.
Relevant information
Actions to resolve
We'll need to collaborate with folks in the Pangeo community who are familiar with the user environment images (maybe that is @scottyhq and @rabernat ? Please cc others who may have insight). We could then do some combination of:
Timeline
All times in US/Pacific.
2022-02-17 - First reports
2022.02.04
.2022-02-18 - 2022-02-22 - Investigation
2i2c team investigated what could be going on, and suspect that the issue was a change in the environment image dependencies (see links to relevant diffs etc above). We identified several process / technical improvements that could have prevented this (see below), but still unsure of the specific incompatibility in the image.
2022-02-22 - 2022-03-08
Tried various investigations into things that are going wrong, including:
During this time, the Pangeo hub was functional because it was using the previous user image.
2022-03-09
After-action report
What went wrong
Where we got lucky
Future action items
Process improvements
Technical improvements
The text was updated successfully, but these errors were encountered: