= GPU Cluster =
<<TableOfContents(3)>>

== General Information ==

=== Cluster access ===

To get access to the GPU cluster for a project or thesis at the AD chair, send an email to Frank (dal-ri@informatik.uni-freiburg.de), with your supervisor in the Cc.

Additionally, you must register in the Support Ticket System, to be able to read the Cluster's FAQ. Use your informatik.uni-freiburg.de e-mail address. Registration page: [[https://osticket.informatik.uni-freiburg.de/account.php?do=create|Ticket Support System]]

=== Getting help ===

There is a great FAQ page, which answers many questions (you need to sign in to be able to read it): [[https://osticket.informatik.uni-freiburg.de/kb/faq.php?cid=1|FAQ]]

If you have problems accessing the cluster, send an email to Frank, with your supervisor and Matthias in the Cc. For any other issues, ask your supervisor and Matthias.

== Logging in to the cluster ==

=== University network ===

To be able to log in to the cluster, you must be in the university’s network.

We recommend using the university’s VPN. See the various information pages:
 * https://www.rz.uni-freiburg.de/inhalt/dokumente/pdfs/anleitungen/installation-openconnect-vpn-ubuntu/ (Ubuntu only)
 * http://mopoinfo.vpn.uni-freiburg.de/node
 * https://www.rz.uni-freiburg.de/services-en/netztel-en/vpn/vpn-einleitung-en?set_language=en
 * https://wiki.uni-freiburg.de/rz/doku.php?id=vpn

Alternatively, you can access the university's network by logging in to login.informatik.uni-freiburg.de via SSH:
{{{
ssh <user>@login.informatik.uni-freiburg.de
}}}

=== Cluster login ===

There are three login nodes:
 * kislogin1.rz.ki.privat
 * kislogin2.rz.ki.privat
 * kislogin3.rz.ki.privat

Log in via SSH:
{{{
ssh <user>@kislogin1.rz.ki.privat
}}}

=== Status information ===

Use the command {{{sfree}}} to get status information for the cluster partitions. It shows all accessible partitions and the number of used and total GPUs per partition.

You can also watch the partitions and your jobs in the dashboard: [[https://kislurm-dashboard.informatik.intra.uni-freiburg.de:3000/d/spTRj8IMz/kislurm2?orgId=1|Dashboard]] (log in with your TF-account)

== Workspaces ==

Staff persons automatically have access to their home directory on the cluster. Student's homes can not be mounted at the moment, but you can create a workspace for your project (see below).

Since the size of your home directory is limited to a few GB, we recommend using a workspace. A workspace is a directory that can be accessed from all nodes of the cluster.

=== Creating a workspace ===

Use the command {{{ws_allocate}}} to create a new workspace. For help, type {{{man ws_allocate}}}.

Example:
{{{
ws_allocate -r 10 -m <user>@informatik.uni-freiburg.de test-workspace 30
}}}
This command creates a workspace at {{{/work/dlclarge1/<user>-test-workspace}}}, which expires in 30 days. The maximum lifetime is 180 days. Ten days before the expiration, a notification is sent to the specified email address ~-(this should be controllable with the {{{-r}}} argument; for some reason, however, I get a whole lot of e-mails before my workspace expires)-~.

=== Find your workspace ===

To list the paths of your workspaces, type {{{ws_list}}}.

=== Extending a workspace ===

When a workspace expires, all content is lost. To extend the workspace, use the command:
{{{
ws_allocate -x <ID> <DAYS>
}}}

== Running jobs ==

There are two kinds of sessions: interactive sessions and jobs.

=== Interactive session ===

Use the following command to start an interactive session with 1 GPU (per default):
{{{
srun -p alldlc_gpu-rtx2080 --pty bash
}}}

To check if you have access to the GPU, run:
{{{
python3 -c "import torch; print(torch.cuda.is_available())"
}}}
The result should be {{{True}}}.

You can now run arbitrary commands interactively. For example, you can create a permanent terminal with {{{tmux}}} to train a neural network in the background (detach from the terminal with {{{(strg+b)+d}}}, and attach again with {{{tmux attach -t <number>}}}, list all terminals with {{{tmux ls}}}).

Run {{{exit}}} to exit the interactive session.

To access a specific node of the cluster, add the argument {{{-w <node>}}} to the command {{{srun}}}. The nodes dlcgpu02 to dlcgpu48 are available to users from the AD chair.

If your program requires a lot of RAM, you can request RAM with the argument {{{--mem=<X>G}}}, where X is the RAM in GB and must not exceed 500. ~-(Side note: when I do this, I still get assigned to a node with less free RAM than specified - If you need a lot of RAM, it might be good to look for a free node in the dashboard (see above) and access it with the {{{-w}}} argument to avoid conflict with other users.)-~

=== Submitting jobs ===

Instead of running an interactive session, you can submit a job as a bash file. The job will be scheduled, and executed when a node with the necessary resources is available.

First, write a bash file with all instructions for your job. Here is a minimal example:

{{{
#!/bin/bash
#!/bin/bash
python3 -c "import torch; print(torch.cuda.is_available())"
}}}

Then, submit your job with {{{sbatch}}}:

{{{
sbatch -p alldlc_gpu-rtx2080 <bash_file>
}}}

The output of your job will be written to a file {{{slurm-<jobid>.out}}} in your current directory.

To see the status of your jobs, run:
{{{
sacct --user=$USER
}}}

To list all your running jobs, run:
{{{
squeue --user=$USER
}}}

== Code and data usage ==

You can clone your code via GitHub or SVN. Alternatively, you can copy code from another machine via SSH using either {{{scp}}} or {{{rsync}}}.

To be able to access GitHub via SSH, add the following lines to the file {{{~/.ssh/config}}}
{{{
Host github.com
  ProxyCommand ssh -q login.informatik.uni-freiburg.de nc %h %p
}}}

Datasets can be copied from another machine ~-(which must be in the university's network, see above)-~ to the workspace with {{{scp}}}, as follows:
{{{
scp -r <file_or_folder> kis2bat1:/work/dlclarge1/<workspace>/<path>
}}}
The argument {{{-r}}} means "recursive", that is, all subdirectories and files will be copied.

== Virtual environment ==

An easy way to install python packages is by using a virtual environment.

Go to your workspace and create the virtual environment:
{{{
python3 -m venv venv
}}}

This creates a virtual environment named "venv". The virtual environment is activated with the command ~-(assuming you are in the virtual environment's parent directory)-~:
{{{
source venv/bin/activate
}}}

Now you can install python packages in the virtual environment using {{{pip3}}}.

Run {{{deactivate}}} to deactivate the virtual environment.