Skip to content

ecs-vlc/iridis-useful-scripts

Repository files navigation

Iridis Useful Scripts

This repository is designed to help people using the Iridis Supercomputer at the University of Southampton. Feel free to share what scripts you are using and what might be useful for other people!

Contents

Useful external resources

Princeton University supercomputer
Note: even if the partitions are named different, most of the examples can be easily adapted to Iridis 5.
There are a lot of examples with links to Pytorch implementations (e.g. Distributed training)

Wiki page for Iridis5 Iridis 5 University of Southampton

The HPC team is very active and willing to help. You can find them here: HPC teams

Slurm documentation page: Slurm wiki

Instead of your boring MAC terminal, you can use Termius. It is really neat! Termius

Overview of Iridis5 GPU partitions

GPU availability status

Check the availability of the GPU nodes (gtx1080, gpu, ecsstaff, ecsstudent).

  • Nodes containing Nvidia 1080ti and Tesla v100 GPUs are locked when a user is granted access. This means that even if the user uses only 1 out of 4 available GPUs ( e.g. gtx1080 nodes) the others are not available to any other users.
  • Nodes containing Nvidia rtx8000 GPUs are shared, meaning that if a user is granted access to 1 out of 2 GPUs available, the other GPU is still accessible by other users (this implies shared CPU and RAM).
  • The 'ecsall' partition is a resource scavanger partition (using resources that would normally not be available). Your job could be preempted!
# Run the following script to get the availabilty of the GPUs
./status.sh

To make things easier, you can define an alias that runs the status.sh script.

# 1. Place the script in your $HOME folder
mv status.sh $HOME

# 2. Open the ~/.bashrc file and add the following line
vim ~/.bashrc
alias status=". $HOME/status.sh"

# 3. Run the file
. ~/.bashrc

Now in your terminal you can run:

status
#example output:

-------------------------NODE STATUS-----------------------

PARTITION             AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
gpu                      up 2-12:00:00         1/7/2/10 indigo[51-60]
gtx1080                  up 2-12:00:00         4/4/2/10 pink[51-60]
ecsstaff                 up 5-00:00:00          3/0/0/3 alpha[51-53]
ecsstudents              up   12:00:00          2/1/0/3 alpha[54-56]
Note: allocated/idle/other/total


--------------------------GPU STATUS-----------------------

------------------------------------------
|PARTITION|       |USED|         |NR GPUS|
------------------------------------------
ecsstudent        9                  12
ecsstaff          5                  12
Note: gtx1080 and v100 are GPUS locked to users on the node
      rtx8000 are not locked to node users

Monitor GPU usage

It is importat to know if your GPUs are running at full capacity or there is a CPU (dataloading) bottleneck. Use the following comand to see the actual GPU usage:

ssh <slurm node> # e.g. indigo51

watch -n 1 nvidia-smi

# example output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:58:00.0 Off |                    0 |
| N/A   61C    P0   182W / 250W |  15194MiB / 16384MiB |     96%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:D8:00.0 Off |                    0 |
| N/A   56C    P0   198W / 250W |  13248MiB / 16384MiB |     96%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    237303      C   ...nda/envs/prose/bin/python    15190MiB |
|    1   N/A  N/A    237304      C   ...nda/envs/prose/bin/python    13244MiB |
+-----------------------------------------------------------------------------+

Create recycle bin on Iridis

Be careful using the 'rm' command as there is no way of getting those files back. Instead, define a new commad that moves unwated files to a directory on scratch.

# 1. Open the ~/.bashrc file and add the following line
vim ~/.bashrc
alias binrm='mv -t /scratch/<your user name>/recycle-bin/'

# 2. Run the file
. ~/.bashrc

Now in your terminal you can run:

binrm <unwanted file>

Get into the habbit of using this command from now on. If you happen to make a mistake you will have the recycle-bin directory keeping your files for a while (before Iridis removes them).

Monitoring slurm job output

If you want to monitor the output of your job in real time, you can run the 'tail' command every 1 second and keep displaying the most recent 50 lines. This can be useful for faster debugging.

# Add the following line in  ~/.bashrc 
alias analyse="watch -n 1 tail -n 50"

# example use:
analyse slurm-1294659.out

or

tail -f slurm-1294659.out

Submitting to multiple partitions at once

If you want to run your job on a partition regardless of the GPU memory and you want it to run as quick as possible, you can submit to multiple partitions at once. The job would run on the first partition that has available resources.

In your slurm script add the following line:

#SBATCH --partition=gtx1080,ecsall,ecsstaff,gpu

Optionally you could remove 'ecsall' to prevent preemption.

Checking information about a job

If you don't remember what a submitted job was for, run the following:

scontrol show job <jobid>

# example output:
JobId=3107069 JobName=flip-all-gpu
   UserId=ii1g17(81851) GroupId=fp(245) MCS_label=N/A
   Priority=3504 Nice=0 Account=ecsstaff QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:11:37 TimeLimit=1-06:00:00 TimeMin=N/A
   SubmitTime=2023-05-02T10:43:20 EligibleTime=2023-05-02T10:43:20
   AccrueTime=2023-05-02T10:43:20
   StartTime=2023-05-02T10:43:26 EndTime=2023-05-03T16:43:26 Deadline=N/A
   PreemptEligibleTime=2023-05-02T10:43:26 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-05-02T10:43:26 Scheduler=Main
   Partition=ecsall AllocNode:Sid=cyan51:257724
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=alpha54
   BatchHost=alpha54
   NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=30G,node=1,billing=8,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=8 MinMemoryNode=30G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/mainfs/home/ii1g17/protein-embeddings/proemb/iridis-scripts/flip/1080ti.sh --flip_data_path=/scratch/ii1g17/protein-embeddings/data/FLIP --model_path=iridis-scripts/saved_models/multitask/v100-6gpu/3096239/iter_240000_checkpoint.pt --remote=True --split=one_vs_many
   WorkDir=/mainfs/home/ii1g17/protein-embeddings/proemb/iridis-scripts/flip
   StdErr=/mainfs/home/ii1g17/protein-embeddings/proemb/iridis-scripts/flip/slurm-3107069.out
   StdIn=/dev/null
   StdOut=/mainfs/home/ii1g17/protein-embeddings/proemb/iridis-scripts/flip/slurm-3107069.out
   Power=
   TresPerNode=gres:gpu:1

This way you can see the actual command and parameters when the job was submitted.

Mounting scratch dir to your own device

If you need to work with data which resides on the /scratch partition you can mount the folder.

On MacOS install: sshfs

In your own device terminal type:

sshfs <username>@iridis5_a.soton.ac.uk:/scratch/<username>/ <path to where you want to mount the folder>

Backing up scratch directory to Onedrive

The data in the scratch directory is not backed up, but in most of the cases you have to use that partition to save some parts of your work. You can sync that directory with the OneDrive provided by the university (5TB). First you need to mount the OneDrive on your device such that it appears as a directory.

In your own device terminal type:

rsync -avz --stats --progress <username>@iridis5_a.soton.ac.uk:/scratch/<username>  <the directory of your OneDrive>

This will upload everything to OneDrive. You will have to re-run it in order to sync new files. You can automate this with a crontab job. If something is deleted from your scratch directory you can then sync it back with the contents from your OneDrive.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published