GPU Cluster
Contents
General Information
Cluster access
To get access to the GPU cluster for a project or thesis at the AD chair, send an email to Frank (dal-ri@informatik.uni-freiburg.de), with your supervisor in the Cc.
Additionally, you must register in the Support Ticket System, to be able to read the Cluster's FAQ. Use your informatik.uni-freiburg.de e-mail address. Registration page: Ticket Support System
Getting help
There is a great FAQ page, which answers many questions (you need to sign in to be able to read it): FAQ
If you have problems accessing the cluster, send an email to Frank, with your supervisor and Matthias in the Cc. For any other issues, ask your supervisor and Matthias.
Logging in to the cluster
University network
To be able to log in to the cluster, you must be in the university’s network.
We recommend using the university’s VPN. See the various information pages:
https://www.rz.uni-freiburg.de/inhalt/dokumente/pdfs/anleitungen/installation-openconnect-vpn-ubuntu/ (Ubuntu only)
https://www.rz.uni-freiburg.de/services-en/netztel-en/vpn/vpn-einleitung-en?set_language=en
Alternatively, you can access the university's network by logging in to login.informatik.uni-freiburg.de via SSH:
ssh <user>@login.informatik.uni-freiburg.de
Cluster login
There are three login nodes:
- kislogin1.rz.ki.privat
- kislogin2.rz.ki.privat
- kislogin3.rz.ki.privat
Log in via SSH:
ssh <user>@kislogin1.rz.ki.privat
Status information
Use the command sfree to get status information for the cluster partitions. It shows all accessible partitions and the number of used and total GPUs per partition.
You can also watch the partitions and your jobs in the dashboard: Dashboard (log in with your TF-account)
Workspaces
Staff persons automatically have access to their home directory on the cluster. Student's homes can not be mounted at the moment, but you can create a workspace for your project (see below).
Since the size of your home directory is limited to a few GB, we recommend using a workspace. A workspace is a directory that can be accessed from all nodes of the cluster.
Creating a workspace
Use the command ws_allocate to create a new workspace. For help, type man ws_allocate.
Example:
ws_allocate -r 10 -m <user>@informatik.uni-freiburg.de test-workspace 30
This command creates a workspace at /work/dlclarge1/<user>-test-workspace, which expires in 30 days. The maximum lifetime is 180 days. Ten days before the expiration, a notification is sent to the specified email address (this should be controllable with the -r argument; for some reason, however, I get a whole lot of e-mails before my workspace expires).
Find your workspace
To list the paths of your workspaces, type ws_list.
Extending a workspace
When a workspace expires, all content is lost. To extend the workspace, use the command:
ws_allocate -x <ID> <DAYS>
Running jobs
There are two kinds of sessions: interactive sessions and jobs.
Interactive session
Use the following command to start an interactive session with 1 GPU (per default):
srun -p alldlc_gpu-rtx2080 --pty bash
To check if you have access to the GPU, run:
python3 -c "import torch; print(torch.cuda.is_available())"
The result should be True.
You can now run arbitrary commands interactively. For example, you can create a permanent terminal with tmux to train a neural network in the background (detach from the terminal with (strg+b)+d, and attach again with tmux attach -t <number>, list all terminals with tmux ls).
Run exit to exit the interactive session.
To access a specific node of the cluster, add the argument -w <node> to the command srun. The nodes dlcgpu02 to dlcgpu48 are available to users from the AD chair.
If your program requires a lot of RAM, you can request RAM with the argument --mem=<X>G, where X is the RAM in GB and must not exceed 500. (Side note: when I do this, I still get assigned to a node with less free RAM than specified - If you need a lot of RAM, it might be good to look for a free node in the dashboard (see above) and access it with the -w argument to avoid conflict with other users.)
Submitting jobs
Instead of running an interactive session, you can submit a job as a bash file. The job will be scheduled, and executed when a node with the necessary resources is available.
First, write a bash file with all instructions for your job. Here is a minimal example:
#!/bin/bash python3 -c "import torch; print(torch.cuda.is_available())"
Then, submit your job with sbatch:
sbatch -p alldlc_gpu-rtx2080 <bash_file>
The output of your job will be written to a file slurm-<jobid>.out in your current directory.
To see the status of your jobs, run:
sacct --user=$USER
To list all your running jobs, run:
squeue --user=$USER
Code and data usage
You can clone your code via GitHub or SVN. Alternatively, you can copy code from another machine via SSH using either scp or rsync.
To be able to access GitHub via SSH, add the following lines to the file ~/.ssh/config
Host github.com ProxyCommand ssh -q login.informatik.uni-freiburg.de nc %h %p
Datasets can be copied from another machine (which must be in the university's network, see above) to the workspace with scp, as follows:
scp -r <file_or_folder> kis2bat1:/work/dlclarge1/<workspace>/<path>
The argument -r means "recursive", that is, all subdirectories and files will be copied.
Virtual environment
An easy way to install python packages is by using a virtual environment.
Go to your workspace and create the virtual environment:
python3 -m venv venv
This creates a virtual environment named "venv". The virtual environment is activated with the command (assuming you are in the virtual environment's parent directory):
source venv/bin/activate
Now you can install python packages in the virtual environment using pip3.
Run deactivate to deactivate the virtual environment.