Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Incident] Dask Gateway fails with Unavailability of Compute Engine Resources - turned out to be a transient issue #1031

Closed
5 tasks done
choldgraf opened this issue Feb 24, 2022 · 19 comments

Comments

@choldgraf
Copy link
Member

choldgraf commented Feb 24, 2022

Summary

Background

The Pangeo community recently tried to update their user environment image via the Configurator. When user sessions started with the new image, Dask Gateway was no longer functional.

When the image was bumped to 2022.02.04 then running the following code:

from dask_gateway import GatewayCluster
cluster = GatewayCluster()
cluster.scale(20)

Would not function properly. Looking at some of the logs in the GCP console, we saw these errors:

Resolution

Ultimately, we realized that this was likely a transient issue due to Google Cloud's resource limits being reached, and the fact that we are not paying for dedicated guaranteed resources.

Relevant information

Actions to resolve

We'll need to collaborate with folks in the Pangeo community who are familiar with the user environment images (maybe that is @scottyhq and @rabernat ? Please cc others who may have insight). We could then do some combination of:

  • Look at the diff in the user image, and try to spot anything that would obviously stop Dask Gateway from scaling: pangeo-data/pangeo-docker-images@2021.10.19...2022.02.04
  • If that doesn't work, then work backwards with earlier releases of the image, running the code snippet above to see if it works. When we find an image environment that works, then we can figure out the exact diff that broke it.
  • Work with members of the Pangeo team to identify the problem and resolve it
  • Incident has been dealt with or is over
  • Sections below are filled out
  • Incident title and after-action report is cleaned up
  • All actionable items above have linked GitHub Issues

Timeline

All times in US/Pacific.

2022-02-17 - First reports

  • @rabernat reported that the Dask Gateway cluster was not scaling properly
  • After some investigation, we realized that this happened just after bumping the user image to 2022.02.04.
  • Reverting the image made things work again, but now we cannot update the user image without breaking things.

2022-02-18 - 2022-02-22 - Investigation

2i2c team investigated what could be going on, and suspect that the issue was a change in the environment image dependencies (see links to relevant diffs etc above). We identified several process / technical improvements that could have prevented this (see below), but still unsure of the specific incompatibility in the image.

2022-02-22 - 2022-03-08

Tried various investigations into things that are going wrong, including:

During this time, the Pangeo hub was functional because it was using the previous user image.

2022-03-09

After-action report

What went wrong

  • Google's resources on the cluster were maxed out. Because we do not pay GCP for guaranteed cloud resources, occasionally they will not be available. When the Dask Gateway cluster was created, this resulted in the error shown above.
  • This happened to coincide with a bump in the user image for this hub, and so we originally thought the error was due to that issue.
  • After a lot of investigation, we couldn't figure out why the user environment was the cause of the problem.
  • Finally we discovered that this was due to the resource constraints issue above, and that this was transient.

Where we got lucky

  • We were lucky that it was possible to quickly revert the image, when the previous one wasn't working.
  • We were unlucky that the transient issue happened to coincide with a user action, which led us down the wrong debugging path.

Future action items

Process improvements

Technical improvements

@scottyhq
Copy link
Contributor

Thanks for documenting this @choldgraf! No solutions off the top of my head, but some ideas to get started

I can't tell from the console messages pasted above what could be going on. In the past I've found it necessary to look at the k8s pod logs to figure out whats going on. If there is some way to expose such logs to the community hub admin that could help.

It might also be helpful to hone in on the conda environment changes, which for this particular case you can use the following URL: https://github.com/pangeo-data/pangeo-docker-images/compare/2021.10.19..2022.02.04#diff-ceee658209456cc3bd347679717bdb5d95ee7fb5a91ffa1dc6d2e2d556144987

For example I see some possibly relevant changes dask-2021.9.1 -> dask-2022.1.1 and jupyterhub-singleuser-1.4.2 -> jupyterhub-singleuser-2.1.1 etc.

To fully test for dask-gateway compatibility, it seems like you need some CI to actually launch a small dask-gateway cluster with the new image, right? Could either be with a 'test' cluster or maybe there is even a way to securely connect to the pangeo-hub itself and programmatically launch a cluster with the new image?

@scottyhq
Copy link
Contributor

I was able to launch a functional cluster with the 2022.02.04 image on the AWS hub, so this seems to be due to some incompatibility with the config (autoscaler version? daskhub chart version?) and image...
image

@choldgraf
Copy link
Member Author

choldgraf commented Feb 25, 2022

Maybe the thing to do is to intentionally break the user image on the Pangeo staging hub, so that we can look at the logs once somebody tries a scaling event w/ Dask Gateway. I think that the configurator works the same way on https://staging.us-central1-b.gcp.pangeo.io/ as it does on prod, so it shouldn't bottleneck on having merge rights to infrastructure/.

If we can break staging by upgrading the image, then prod will stay functional and it should be easier to debug

@scottyhq I was trying to figure out which PR upgraded the version of the relevant packages you posted there, and it seems like many of them have numerous pinnings throughout the repository, and don't necessarily all change at once. What is the easiest way to answer the question "when did package X get updated in the Pangeo image, and which release is associated with it?"

@rabernat
Copy link
Contributor

Just wanted to ping here to

  • apologize for not being more involved in the conversation yet...I triggered the initial issue but then got sucked into urgent ocean sciences prep work this week
  • say that I strongly support @choldgraf's plan to debug this on the staging cluster. That's exactly what it is for IMO. No one is depending on that for any sort of day-to-day work, so we can temporarily break it all we want as we work though this issue.

What is the easiest way to answer the question "when did package X get updated in the Pangeo image, and which release is associated with it?"

https://github.com/pangeo-data/pangeo-docker-images/blame/master/pangeo-notebook/conda-linux-64.lock

@scottyhq
Copy link
Contributor

What is the easiest way to answer the question "when did package X get updated in the Pangeo image, and which release is associated with it?"

I find the git blame interface confusing and somewhat slow to navigate, so I added a script (PR above) that gets you an answer a little more quickly!

@rabernat
Copy link
Contributor

rabernat commented Mar 2, 2022

Thanks for the help here Scott!

My best guess would be that the 2i2c dask gateway version is incompatible with the image version. Is it possible to upgrade dask gateway just for staging? Or is there one gateway for both clusters?

FWIW, the latest image DOES work with dask gateway on the Pangeo AWS binder: https://hub.aws-uswest2-binder.pangeo.io/v2/gh/pangeo-data/pangeo-docker-images/2022.02.04

@yuvipanda
Copy link
Member

Hello! I'm actually slightly confused about what happened here - dask-gateway hasn't had a new release, and that's what is the most likely culprit here. I'm going to take a look on staging.

@yuvipanda
Copy link
Member

yuvipanda commented Mar 10, 2022

Googling the first error message in the screenshot posted takes me to https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-vm-creation, which has a lot of useful information. In particular, it states Because this situation is temporary and can change frequently based on fluctuating demand, try your request again later.. I think this is because dask workers are preemptible instances (similar to spot AWS instances), and hence during the time this scale up was attempted, Google Cloud was just 'full'.

I just tried the latest image on the staging cluster, and could spin up 20 dask workers and do some computation with the newest image (2022.02.04)

image

So I think the image change was pure coincidence, and what really happened was that the cloud was full and put you on hold.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Mar 10, 2022
Ref 2i2c-org#1031

See 2i2c-org#1031 (comment)
for investigation of the issue there.
@yuvipanda
Copy link
Member

https://staging.us-central1-b.gcp.pangeo.io has the new image, and in #1080 I'm pushing it out to the prod hub as well. I want to push this out soon, but wanted to give folks an opportunity to test as well.

@choldgraf
Copy link
Member Author

choldgraf commented Mar 10, 2022

So to summarize, I think this is the situation:

  • the user image was bumped on the Pangeo hub via the configurator
  • we then tried to scale with a Dask Gateway, and got the error message posted above
  • we reasonably concluded that the error was due to something that changed in the image
  • debugging the user image didn't yield a clear answer, especially since the same image on the AWS hub worked
  • we did some more google on the error message and found it was associated with preemptible nodes (which the dask gateway uses)
  • we concluded this issue might have been transient, and only coincidentally related to the user image
  • we tried bumping the image in the same way again, and it worked
  • we conclude that the user image does not require any changes and is compatible with our hub infrastructure

Is that right?

@yuvipanda
Copy link
Member

we did some more google on the error message and found it was associated with preemptible nodes (which the dask gateway uses)

It's associated with both preemptible nodes but also with general cloud capacity. We think of cloud resources as infinity, but physical restrictions do exist - we just don't often run into them. If we really want to make sure this never happens, you can buy commitments (https://cloud.google.com/compute/docs/instances/reservations-overview) that you pay for regardless. So it's possible this error would also occur without preemptible nodes - although less likely.

we concluded this issue might have been transient, and only coincidentally related to the user image

Based on the info we have, I think it was transient (not might have been) and unrelated to the image.

Otherwise, this is right!

@rabernat
Copy link
Contributor

We just discovered a serious bug in the 2022.02.04 image (pangeo-data/pangeo-docker-images#297), so we need to go to 2022.03.08.

I would make the PR myself, but I am having trouble understanding where exactly this is configured. I browsed around
https://github.com/2i2c-org/infrastructure/tree/master/config/clusters/pangeo-hubs but couldn't figure it out.

@scottyhq
Copy link
Contributor

scottyhq commented Mar 11, 2022

I browsed around
https://github.com/2i2c-org/infrastructure/tree/master/config/clusters/pangeo-hubs but couldn't figure it out.

https://us-central1-b.gcp.pangeo.io/services/configurator

Docs:
https://docs.2i2c.org/en/latest/admin/howto/configurator.html

People with admin access can change the image:

admin_users:
- rabernat
- jhamman
- scottyhq
- TomAugspurger

@yuvipanda
Copy link
Member

@rabernat you can just use the configurator for now. I agree that's confusing on when to use the configurator vs deploy via the hub config here, something to sort out.

@yuvipanda
Copy link
Member

However, if you wanna make a PR, the tag is specified in

@rabernat
Copy link
Contributor

you can just use the configurator for now.

My understanding is that the configurator overriding the value in the chart was the source of some problems up thread. So I am reluctant to use the configurator.

However, if you wanna make a PR, the tag is specified in

This points to staging. Does it also apply to prod? There is no equivalent line in https://github.com/2i2c-org/infrastructure/blob/6e66c84176c8d7f91b1a8ad4b9d71b5f2256d076/config/clusters/pangeo-hubs/prod.values.yaml

@sgibson91
Copy link
Member

sgibson91 commented Mar 11, 2022

@rabernat Short answer, yes it also gets parsed to prod

Long answer, we've done a lot of work recently to split out helm config so that we can explicitly isolate a single hub's config. This means we can now validate a single hub's config on deploy (and in PRs that affect config), and I'm currently working on being able to deploy hubs in parallel, not just clusters. Or run a deploy to a single hub in CI/CD based on changed filepaths. All the files that define a single hub are explicitly listed in the cluster.yaml file

- name: staging
display_name: "Pangeo (staging)"
domain: staging.us-central1-b.gcp.pangeo.io
helm_chart: daskhub
auth0:
enabled: false
helm_chart_values_files:
# The order in which you list files here is the order the will be passed
# to the helm upgrade command in, and that has meaning. Please check
# that you intend for these files to be applied in this order.
- staging.values.yaml
- enc-staging.secret.values.yaml
- name: prod
display_name: "Pangeo (prod)"
domain: us-central1-b.gcp.pangeo.io
helm_chart: daskhub
auth0:
enabled: false
helm_chart_values_files:
# The order in which you list files here is the order the will be passed
# to the helm upgrade command in, and that has meaning. Please check
# that you intend for these files to be applied in this order.
- staging.values.yaml
- prod.values.yaml
- enc-prod.secret.values.yaml

Though I think I need to tweak the new structure further such that, for a cluster like pangeo, we have common, staging and prod files to save confusion.

If you're interested, you can read more about the new config structure here https://infrastructure.2i2c.org/en/latest/topic/config.html#id1

@yuvipanda
Copy link
Member

@rabernat if you see #1031 (comment), the image change actually had nothing to do with the problem here at all. It was just a coincidence.

@choldgraf choldgraf changed the title [Incident] Pangeo environment changes are not compatible with our Dask Gateway setup [Incident] Dask Gateway fails with Unavailability of Compute Engine Resources - turned out to be a transient issue Mar 15, 2022
@choldgraf
Copy link
Member Author

Hey all - I believe that this one is now resolved, and I've updated the top comment with a timeline and overview of the problem. Annoyingly I could not think of too many follow-ups to improve upon, because the root problem here was transient, and we mostly just needed to "try it again". We do have a few issues to track enforcing specifications in the user environment which should help at least narrow down the potential problems in the future.

I'm going to close this one, but if anyone has suggestions for other process / tech improvements we need, please suggest them and/or open issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants