Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec for a GKE Kubernetes GPU Cluster w/ Cloud Run #4

Open
minimaxir opened this issue May 18, 2019 · 6 comments
Open

Spec for a GKE Kubernetes GPU Cluster w/ Cloud Run #4

minimaxir opened this issue May 18, 2019 · 6 comments
Labels
enhancement New feature or request

Comments

@minimaxir
Copy link
Owner

Create a k8s .yaml file spec which will create a cluster that can support GPT-2 APIs w/ GPU for faster serving

Goal

  • Each Node has as much GPU utilization as possible.
  • Able to scale down to zero (for real, GKE is picky about that)

Proposal

  • A single f1-micro Node so the GPU-pods can scale to 0 (a single f1-micro is free)
  • Other Node is a 16 vCPU /14GB RAM (n1-highcpu-16).
  • Each Pod uses 4 vCPU, 1 K80 GPU, and a has a Cloud Run concurrency of 4.

Therefore, a single Node can accommodate up to 4 different GPT-2 APIs or the same API scaled up, which is neat.

In testing a single K80 can generate about 20 texts at a time before going OOM, so setting a maximum of 16 should give enough of a buffer for storing the model. If not, using T4 GPUs should give a RAM boost.

@minimaxir minimaxir added the enhancement New feature or request label May 18, 2019
@minimaxir
Copy link
Owner Author

Cloud Run may not work well here because it does not allow you to configure number of vCPU per service.

It may be better to use raw Knative for it until Google adds that feature.

@minimaxir
Copy link
Owner Author

Interesting issue with trying to put K80s on a n1-highcpu-16:

The number of GPU dies is linked to the number of CPU cores and memory selected for this instance. For the current configuration, you can select no fewer than 2 GPU dies of this type

So T4 it is.

@minimaxir
Copy link
Owner Author

Better solution; actually leverage Python's async to minimize dedicated resources needed, so we can actually use K80s.

With gpt-2-simple, the generation is done completely in the GPU, so that might work. We might be able to get away with a 4 vCPU n1-standard-4 system (1 vCPU per Pod), and use a K80 (but still 4 concurrent users per Pod, 16 users per Node). The total cost is less than half of what was proposed.

And since it would be 1 vCPU used, we could set up Cloud Run with it, which might be easier than working with Knative.

@minimaxir
Copy link
Owner Author

Unfortunately, this is not as easy expected since a tf.Session cannot be shared between threads and processes, therefore dramatically reducing the async possibilities.

For the initial release I might be OK without, especially if the GPU has high enough throughput.

@minimaxir
Copy link
Owner Author

minimaxir commented May 24, 2019

Update: you can share a tf.Session, but it's not easy and might not necessarily result in a performance gain. It however saves GPU vRAM, which is a necessary precondition. (estimate 2.5GB ceiling when generating 4 predictions at a time, so 4 containers will fit in a 12GB vRAM GPU).

Best architecture is still a 4vCPU + 1GPU w/ 4 containers, but it may be better to see if Cloud Run can assign each container 4vCPUs and then share threads (as Flask's native server is threaded by default and route accordingly). And then see if it causes any deadlocks.

@kshtzgupta1
Copy link

kshtzgupta1 commented Sep 5, 2019

Hi Max! Thank you so much for creating gpt-2-cloud-run. It's been really useful and inspiring for my GPT-2 webapp. For this webapp I'm trying to deploy a finetuned 345M GPT-2 Model (~1.4 GB) through Cloud Run on GKE but I am unsure about the spec of the GKE Cluster as well as what concurrency should I set.

Can you please advice on the number of nodes, machine type and concurrency I should be using for maximum cost effectiveness? Currently, I have a concurrency of 1 along with just 1 node (n1-standard-2; 7.5GB; 2vCPU) and a K80 attached to that node but I'm not sure if this is the most cost-effective GKE spec.

I would really appreciate any insights on this! If it helps I intend to deploy only this model and don't plan on having any more GPT-2 webapps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants