About Latte

The below documentation was written as a roadmap for future implementation, and may be irrelevant/inaccurate to how Latte is operated today. Reach out to the VP of tech (vp@csua.berkeley.edu) or root staff (via Discord) if you have any questions.

About Latte

Latte is a GPU server, donated in part by NVIDIA Corp. for use by the CS community. It features 8 datacenter-class NVIDIA Tesla P100 GPUs, which offer a large speedup for machine learning and related GPU computing tasks. The Tensorflow and PyTorch libraries are available for use as well.

User Guide

Getting Started

To begin using latte, you need to have a CSUA account.

To get a CSUA account, please create one here, or visit our office in 311 Soda. An officer will create an account for you.

Once you have an account, you can log into latte.csua.berkeley.edu over SSH. SSH access from off campus is not allowed, so if you are currently off campus, proxy jump through Soda via ssh <username>@latte.csua.berkeley.edu -J <username>@soda.csua.berkeley.edu. From here, you can begin setting up your jobs.

For information on how to best use the server, send an email to latte@csua.berkeley.edu with the following:

Name
CSUA Username
Intended use

Most jobs can be run similarly on latte to how they are run on any other Linux-based machine.

Slurm is an optional feature used to manage job scheduling.

Testing Your Jobs

slurmctld is meant for testing only. There are limits to the amount of compute you can use while in this machine.

The /datasets/ directory has some publicly-available datasets to use in /datasets/share/. If you are using your own dataset, please place them in /datasets/ because the contents of /home/ are mounted over a network filesystem and will be slower.

Once you run your program and it works, you can submit a job.

Running Your Jobs via Slurm

To run a job, you need to submit it using the srun command. You can read about how to use Slurm here.

This will send the job to one of the GPU nodes and run the job.

Contact

If you have any questions, please email latte@csua.berkeley.edu.

Developer Guide

This repo contains the configurations used to test and deploy the slurm docker cluster known as latte. The important commands can be found in the contents of Makefile.

The cluster is created using docker-compose, specifically using nvidia-docker-compose. There are a number of other pieces of software involved, however.

How docker-compose works

(Copied from https://docs.docker.com/compose/overview/ )

Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration.

Using Compose is basically a three-step process:

Define your app’s environment with a Dockerfile so it can be reproduced anywhere.
Define the services that make up your app in docker-compose.yml so they can be run together in an isolated environment.
Run docker-compose up and Compose starts and runs your entire app.

About the Makefile

The Makefile describes all the necessary commands for building and testing the cluster.

Slurm Docker Cluster (Documentation from original Repo)

This is a multi-container Slurm cluster using docker-compose. The compose file creates named volumes for persistent storage of MySQL data files as well as Slurm state and log directories.

Containers and Volumes

The compose file will run the following containers:

mysql
slurmdbd
slurmctld
c1 (slurmd)
c2 (slurmd)

The compose file will create the following named volumes:

etc_munge ( -> /etc/munge )
etc_slurm ( -> /etc/slurm )
slurm_jobdir ( -> /data )
var_lib_mysql ( -> /var/lib/mysql )
var_log_slurm ( -> /var/log/slurm )

Building the Docker Image

Build the image locally:

$ docker build -t slurm-docker-cluster:17.02.9 .

Starting the Cluster

Run docker-compose to instantiate the cluster:

$ docker-compose up -d

Register the Cluster with SlurmDBD

To register the cluster to the slurmdbd daemon, run the register_cluster.sh script:

$ ./register_cluster.sh

Note: You may have to wait a few seconds for the cluster daemons to become ready before registering the cluster. Otherwise, you may get an error such as sacctmgr: error: Problem talking to the database: Connection refused.

You can check the status of the cluster by viewing the logs: docker-compose logs -f

Accessing the Cluster

Use docker exec to run a bash shell on the controller container:

$ docker exec -it slurmctld bash

From the shell, execute slurm commands, for example:

[root@slurmctld /]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 5-00:00:00      2   idle c[1-2]

Submitting Jobs

The slurm_jobdir named volume is mounted on each Slurm container as /data. Therefore, in order to see job output files while on the controller, change to the /data directory when on the slurmctld container and then submit a job:

[root@slurmctld /]# cd /data/
[root@slurmctld data]# sbatch --wrap="uptime"
Submitted batch job 2
[root@slurmctld data]# ls
slurm-2.out

Stopping and Restarting the Cluster

$ docker-compose stop

$ docker-compose start

Deleting the Cluster

To remove all containers and volumes, run:

$ docker-compose rm -sf
$ docker volume rm slurmdockercluster_etc_munge slurmdockercluster_etc_slurm slurmdockercluster_slurm_jobdir slurmdockercluster_var_lib_mysql slurmdockercluster_var_log_slurm

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
docker-sqlite3		docker-sqlite3
pam.d		pam.d
.gitignore		.gitignore
.post-commit.sh		.post-commit.sh
.pre-commit.sh		.pre-commit.sh
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
daemon.json		daemon.json
docker-compose.yml_prod		docker-compose.yml_prod
docker-compose.yml_test		docker-compose.yml_test
docker-entrypoint.sh_prod		docker-entrypoint.sh_prod
docker-entrypoint.sh_test		docker-entrypoint.sh_test
nscd.conf		nscd.conf
nsswitch.conf		nsswitch.conf
register_cluster.sh_prod		register_cluster.sh_prod
register_cluster.sh_test		register_cluster.sh_test
slurm.conf_prod		slurm.conf_prod
slurm.conf_test		slurm.conf_test
slurmdbd.conf_prod		slurmdbd.conf_prod
slurmdbd.conf_test		slurmdbd.conf_test
sshd_config		sshd_config

CSUA/slurm-docker-cluster

Folders and files

Latest commit

History

Repository files navigation

About Latte

User Guide

Getting Started

Testing Your Jobs

Running Your Jobs via Slurm

Contact

Developer Guide

How docker-compose works

About the Makefile

Slurm Docker Cluster (Documentation from original Repo)

Containers and Volumes

Building the Docker Image

Starting the Cluster

Register the Cluster with SlurmDBD

Accessing the Cluster

Submitting Jobs

Stopping and Restarting the Cluster

Deleting the Cluster

About

Resources

Stars

Watchers

Forks

Languages