AD Research Wiki:

GPU Cluster

General Information

Cluster access

To get access to the GPU cluster for a project or thesis at the AD chair, send an email to Frank (dal-ri@informatik.uni-freiburg.de), with your supervisor in the Cc.

Additionally, you must register in the Support Ticket System, to be able to read the Cluster's FAQ. Use your informatik.uni-freiburg.de e-mail address. Registration page: Ticket Support System

Getting help

There is a great FAQ page, which answers many questions (you need to sign in to be able to read it): FAQ

If you have problems accessing the cluster, send an email to Frank, with your supervisor and Matthias in the Cc. For any other issues, ask your supervisor and Matthias.

Logging in to the cluster

University network

To be able to log in to the cluster, you must be in the university’s network.

We recommend using the university’s VPN. See the various information pages:

Alternatively, you can access the university's network by logging in to login.informatik.uni-freiburg.de via SSH:

ssh <user>@login.informatik.uni-freiburg.de

Cluster login

There are three login nodes:

Log in via SSH:

ssh <user>@kislogin1.rz.ki.privat

Status information

Use the command sfree to get status information for the cluster partitions. It shows all accessible partitions and the number of used and total GPUs per partition.

You can also watch the partitions and your jobs in the dashboard: Dashboard (log in with your TF-account)

Workspaces

Staff persons automatically have access to their home directory on the cluster. Student's homes can not be mounted at the moment, but you can create a workspace for your project (see below).

Since the size of your home directory is limited to a few GB, we recommend using a workspace. A workspace is a directory that can be accessed from all nodes of the cluster.

Creating a workspace

Use the command ws_allocate to create a new workspace. For help, type man ws_allocate.

Example:

ws_allocate -r 10 -m <user>@informatik.uni-freiburg.de test-workspace 30

This command creates a workspace at /work/dlclarge1/<user>-test-workspace, which expires in 30 days. The maximum lifetime is 180 days. Ten days before the expiration, a notification is sent to the specified email address (this should be controllable with the -r argument; for some reason, however, I get a whole lot of e-mails before my workspace expires).

Find your workspace

To list the paths of your workspaces, type ws_list.

Extending a workspace

When a workspace expires, all content is lost. To extend the workspace, use the command:

ws_allocate -x <ID> <DAYS>

Running jobs

There are two kinds of sessions: interactive sessions and jobs.

Interactive session

Use the following command to start an interactive session with 1 GPU (per default):

srun -p alldlc_gpu-rtx2080 --pty bash

To check if you have access to the GPU, run:

python3 -c "import torch; print(torch.cuda.is_available())"

The result should be True.

You can now run arbitrary commands interactively. For example, you can create a permanent terminal with tmux to train a neural network in the background (detach from the terminal with (strg+b)+d, and attach again with tmux attach -t <number>, list all terminals with tmux ls).

Run exit to exit the interactive session.

To access a specific node of the cluster, add the argument -w <node> to the command srun. The nodes dlcgpu02 to dlcgpu48 are available to users from the AD chair.

If your program requires a lot of RAM, you can request RAM with the argument --mem=<X>G, where X is the RAM in GB and must not exceed 500. (Side note: when I do this, I still get assigned to a node with less free RAM than specified - If you need a lot of RAM, it might be good to look for a free node in the dashboard (see above) and access it with the -w argument to avoid conflict with other users.)

Submitting jobs

Instead of running an interactive session, you can submit a job as a bash file. The job will be scheduled, and executed when a node with the necessary resources is available.

First, write a bash file with all instructions for your job. Here is a minimal example:

#!/bin/bash
python3 -c "import torch; print(torch.cuda.is_available())"

Then, submit your job with sbatch:

sbatch -p alldlc_gpu-rtx2080 <bash_file>

The output of your job will be written to a file slurm-<jobid>.out in your current directory.

To see the status of your jobs, run:

sacct --user=$USER

To list all your running jobs, run:

squeue --user=$USER

Code and data usage

You can clone your code via GitHub or SVN. Alternatively, you can copy code from another machine via SSH using either scp or rsync.

To be able to access GitHub via SSH, add the following lines to the file ~/.ssh/config

Host github.com
  ProxyCommand ssh -q login.informatik.uni-freiburg.de nc %h %p

Datasets can be copied from another machine (which must be in the university's network, see above) to the workspace with scp, as follows:

scp -r <file_or_folder> kis2bat1:/work/dlclarge1/<workspace>/<path>

The argument -r means "recursive", that is, all subdirectories and files will be copied.

Virtual environment

An easy way to install python packages is by using a virtual environment.

Go to your workspace and create the virtual environment:

python3 -m venv venv

This creates a virtual environment named "venv". The virtual environment is activated with the command (assuming you are in the virtual environment's parent directory):

source venv/bin/activate

Now you can install python packages in the virtual environment using pip3.

Run deactivate to deactivate the virtual environment.

AD Research Wiki: HowTos/GpuCluster (last edited 2023-03-29 10:10:21 by Matthias Hertel)