Skip to content

CSUA/slurm-docker-cluster

 
 

Repository files navigation

The below documentation was written as a roadmap for future implementation, and may be irrelevant/inaccurate to how Latte is operated today. Reach out to the VP of tech (vp@csua.berkeley.edu) or root staff (via Discord) if you have any questions.

About Latte

Latte is a GPU server, donated in part by NVIDIA Corp. for use by the CS community. It features 8 datacenter-class NVIDIA Tesla P100 GPUs, which offer a large speedup for machine learning and related GPU computing tasks. The Tensorflow and PyTorch libraries are available for use as well.

User Guide

Getting Started

To begin using latte, you need to have a CSUA account.

To get a CSUA account, please create one here, or visit our office in 311 Soda. An officer will create an account for you.

Once you have an account, you can log into latte.csua.berkeley.edu over SSH. SSH access from off campus is not allowed, so if you are currently off campus, proxy jump through Soda via ssh <username>@latte.csua.berkeley.edu -J <username>@soda.csua.berkeley.edu. From here, you can begin setting up your jobs.

For information on how to best use the server, send an email to latte@csua.berkeley.edu with the following:

  • Name
  • CSUA Username
  • Intended use

Most jobs can be run similarly on latte to how they are run on any other Linux-based machine.

Slurm is an optional feature used to manage job scheduling.

Testing Your Jobs

slurmctld is meant for testing only. There are limits to the amount of compute you can use while in this machine.

The /datasets/ directory has some publicly-available datasets to use in /datasets/share/. If you are using your own dataset, please place them in /datasets/ because the contents of /home/ are mounted over a network filesystem and will be slower.

Once you run your program and it works, you can submit a job.

Running Your Jobs via Slurm

To run a job, you need to submit it using the srun command. You can read about how to use Slurm here.

This will send the job to one of the GPU nodes and run the job.

Contact

If you have any questions, please email latte@csua.berkeley.edu.

Developer Guide

This repo contains the configurations used to test and deploy the slurm docker cluster known as latte. The important commands can be found in the contents of Makefile.

The cluster is created using docker-compose, specifically using nvidia-docker-compose. There are a number of other pieces of software involved, however.

How docker-compose works

(Copied from https://docs.docker.com/compose/overview/ )

Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration.

Using Compose is basically a three-step process:

  1. Define your app’s environment with a Dockerfile so it can be reproduced anywhere.

  2. Define the services that make up your app in docker-compose.yml so they can be run together in an isolated environment.

  3. Run docker-compose up and Compose starts and runs your entire app.

About the Makefile

The Makefile describes all the necessary commands for building and testing the cluster.

Slurm Docker Cluster (Documentation from original Repo)

This is a multi-container Slurm cluster using docker-compose. The compose file creates named volumes for persistent storage of MySQL data files as well as Slurm state and log directories.

Containers and Volumes

The compose file will run the following containers:

  • mysql
  • slurmdbd
  • slurmctld
  • c1 (slurmd)
  • c2 (slurmd)

The compose file will create the following named volumes:

  • etc_munge ( -> /etc/munge )
  • etc_slurm ( -> /etc/slurm )
  • slurm_jobdir ( -> /data )
  • var_lib_mysql ( -> /var/lib/mysql )
  • var_log_slurm ( -> /var/log/slurm )

Building the Docker Image

Build the image locally:

$ docker build -t slurm-docker-cluster:17.02.9 .

Starting the Cluster

Run docker-compose to instantiate the cluster:

$ docker-compose up -d

Register the Cluster with SlurmDBD

To register the cluster to the slurmdbd daemon, run the register_cluster.sh script:

$ ./register_cluster.sh

Note: You may have to wait a few seconds for the cluster daemons to become ready before registering the cluster. Otherwise, you may get an error such as sacctmgr: error: Problem talking to the database: Connection refused.

You can check the status of the cluster by viewing the logs: docker-compose logs -f

Accessing the Cluster

Use docker exec to run a bash shell on the controller container:

$ docker exec -it slurmctld bash

From the shell, execute slurm commands, for example:

[root@slurmctld /]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 5-00:00:00      2   idle c[1-2]

Submitting Jobs

The slurm_jobdir named volume is mounted on each Slurm container as /data. Therefore, in order to see job output files while on the controller, change to the /data directory when on the slurmctld container and then submit a job:

[root@slurmctld /]# cd /data/
[root@slurmctld data]# sbatch --wrap="uptime"
Submitted batch job 2
[root@slurmctld data]# ls
slurm-2.out

Stopping and Restarting the Cluster

$ docker-compose stop
$ docker-compose start

Deleting the Cluster

To remove all containers and volumes, run:

$ docker-compose rm -sf
$ docker volume rm slurmdockercluster_etc_munge slurmdockercluster_etc_slurm slurmdockercluster_slurm_jobdir slurmdockercluster_var_lib_mysql slurmdockercluster_var_log_slurm

About

Latte: A Slurm cluster using docker-compose

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 54.5%
  • Dockerfile 28.9%
  • Makefile 16.6%