Skip to content

piercefreeman/cloud-gpu-reliability

Repository files navigation

cloud-gpu-reliability

After encountering some reliability issues with on-demand provisioning of GPU resources, I put together this benchmarking harness to test AWS vs. GCP availability.

To maximize the statistical and practical significance of results:

  • Each provisioning uses the same GPU configurations (currently a T4). GCP provides more flexibility here since their accelerators can be mounted to any hardware configuration whereas AWS only provisions these more powerful GPUs on designated VM configurations.
  • Each deployment runs at the same approximate time, roughly 48 times a day. We handle this spawning via separate threads because async support isn't yet available for the official AWS and GCP Python APIs .
  • It performs a random search for what times during the day we should perform the trial. This attempts to account for the variability during daily demand of jobs that don't fit a set schedule.

At the risk of stating the obvious: running this locally will create cloud resources that you'll have to pay for while they run. This package takes every care to cleanup resources once it creates them but run at your own risk.

Getting Started

This repo manages dependencies with poetry. A regular pip install -e . should work fine but might not pull in dependency versions that are tested.

poetry install

You'll also have to configure an .env file with your AWS and GCP credentials in order to execute. This should be relatively straightforward given the key names that are specified in Settings. To encode the GCP service key, you'll have to do something like:

cat ~/personal-gcp-service-key.json | base64

When you're ready to run the trial:

docker-compose up

Errors

GCP:

Operation type [insert] failed with message "The zone 'projects/{project}/zones/{zone}' does not have enough resources available to fulfill the request. Try a different zone, or try again later."
Resource exhausted (HTTP 429): ZONE_RESOURCE_POOL_EXHAUSTED

About

A mini container to log cloud gpu reliability

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published