Spec for a GKE Kubernetes GPU Cluster w/ Cloud Run #4

minimaxir · 2019-05-18T23:31:54Z

Create a k8s .yaml file spec which will create a cluster that can support GPT-2 APIs w/ GPU for faster serving

Goal

Each Node has as much GPU utilization as possible.
Able to scale down to zero (for real, GKE is picky about that)

Proposal

A single f1-micro Node so the GPU-pods can scale to 0 (a single f1-micro is free)
Other Node is a 16 vCPU /14GB RAM (n1-highcpu-16).
Each Pod uses 4 vCPU, 1 K80 GPU, and a has a Cloud Run concurrency of 4.

Therefore, a single Node can accommodate up to 4 different GPT-2 APIs or the same API scaled up, which is neat.

In testing a single K80 can generate about 20 texts at a time before going OOM, so setting a maximum of 16 should give enough of a buffer for storing the model. If not, using T4 GPUs should give a RAM boost.

The text was updated successfully, but these errors were encountered:

minimaxir · 2019-05-18T23:49:38Z

Cloud Run may not work well here because it does not allow you to configure number of vCPU per service.

It may be better to use raw Knative for it until Google adds that feature.

minimaxir · 2019-05-19T01:14:01Z

Interesting issue with trying to put K80s on a n1-highcpu-16:

The number of GPU dies is linked to the number of CPU cores and memory selected for this instance. For the current configuration, you can select no fewer than 2 GPU dies of this type

So T4 it is.

minimaxir · 2019-05-19T17:47:29Z

Better solution; actually leverage Python's async to minimize dedicated resources needed, so we can actually use K80s.

With gpt-2-simple, the generation is done completely in the GPU, so that might work. We might be able to get away with a 4 vCPU n1-standard-4 system (1 vCPU per Pod), and use a K80 (but still 4 concurrent users per Pod, 16 users per Node). The total cost is less than half of what was proposed.

And since it would be 1 vCPU used, we could set up Cloud Run with it, which might be easier than working with Knative.

minimaxir · 2019-05-23T03:59:57Z

Unfortunately, this is not as easy expected since a tf.Session cannot be shared between threads and processes, therefore dramatically reducing the async possibilities.

For the initial release I might be OK without, especially if the GPU has high enough throughput.

minimaxir · 2019-05-24T20:16:51Z

Update: you can share a tf.Session, but it's not easy and might not necessarily result in a performance gain. It however saves GPU vRAM, which is a necessary precondition. (estimate 2.5GB ceiling when generating 4 predictions at a time, so 4 containers will fit in a 12GB vRAM GPU).

Best architecture is still a 4vCPU + 1GPU w/ 4 containers, but it may be better to see if Cloud Run can assign each container 4vCPUs and then share threads (as Flask's native server is threaded by default and route accordingly). And then see if it causes any deadlocks.

kshtzgupta1 · 2019-09-05T05:03:38Z

Hi Max! Thank you so much for creating gpt-2-cloud-run. It's been really useful and inspiring for my GPT-2 webapp. For this webapp I'm trying to deploy a finetuned 345M GPT-2 Model (~1.4 GB) through Cloud Run on GKE but I am unsure about the spec of the GKE Cluster as well as what concurrency should I set.

Can you please advice on the number of nodes, machine type and concurrency I should be using for maximum cost effectiveness? Currently, I have a concurrency of 1 along with just 1 node (n1-standard-2; 7.5GB; 2vCPU) and a K80 attached to that node but I'm not sure if this is the most cost-effective GKE spec.

I would really appreciate any insights on this! If it helps I intend to deploy only this model and don't plan on having any more GPT-2 webapps.

minimaxir added the enhancement New feature or request label May 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec for a GKE Kubernetes GPU Cluster w/ Cloud Run #4

Spec for a GKE Kubernetes GPU Cluster w/ Cloud Run #4

minimaxir commented May 18, 2019

minimaxir commented May 18, 2019

minimaxir commented May 19, 2019

minimaxir commented May 19, 2019

minimaxir commented May 23, 2019

minimaxir commented May 24, 2019 •

edited

kshtzgupta1 commented Sep 5, 2019 •

edited

Spec for a GKE Kubernetes GPU Cluster w/ Cloud Run #4

Spec for a GKE Kubernetes GPU Cluster w/ Cloud Run #4

Comments

minimaxir commented May 18, 2019

Goal

Proposal

minimaxir commented May 18, 2019

minimaxir commented May 19, 2019

minimaxir commented May 19, 2019

minimaxir commented May 23, 2019

minimaxir commented May 24, 2019 • edited

kshtzgupta1 commented Sep 5, 2019 • edited

minimaxir commented May 24, 2019 •

edited

kshtzgupta1 commented Sep 5, 2019 •

edited