AD Teaching Wiki:

This page describes our reproducibility requirements for all projects and theses supervised by someone from our group. You should read it carefully when you begin your work. If you leave it until the end, it is much more work. If you consider it right from the beginning, it is actually quite rewarding. Before you read this, you should first read the general page on projects and theses at our chair.

Access to our SVN, one of our machines, and our file system

Right after the very first meeting with your supervisor you will be assigned the following. If for some reason, this does not happen within a day, you should send an email to your supervisor and our system administrator Frank Dal-Ri.

1. A subfolder in our SVN with URL[projects|theses]/<firstname>-<lastname>. Authentication works via your RZ Account (initials + number).

2. The name of one of our machines, on which you can work. Authentication works via your Informatik Account (first seven letters of your family name + first letter of your given name). This is the username referred to in the next two items.

3. A directory /local/data/<username> for large datasets on a local disk of the machine which you have been assigned. The local disks are fast, so this is great for IO-heavy code (for example, a search engine which frequently reads large segments of data from disk). This directory will be deleted, once you have given your presentation and received your grade.

4. A directory /nfs/students/<firstname-lastname> for large datasets on our network file system (NFS). Access to these files can be (and often is) significantly slower, because data packets are routed via the network. However, this directory will be kept after you have given your presentation and received your grade. It should contain a tidied up version of all your data that is worth preserving and was too large to be uploaded to our SVN (see item 1).

Note: if you are exclusively working on your own machine (not the preferred option, but possible if you know what you are doing), than Items 2 and 3 are not relevant for you.

Coding and Data Standards

Your code should be properly documented, it should have a consistent style, and there should be at least a simple unit test for each non-trivial function (this is easy and actually quite rewarding for properly designed functions). For the common languages C++, Java, and Python you find examples in our Coding Standards.

In your subfolder in our SVN, there should be a README.txt or, in which you clearly explain how you organized your files and what can be found where. If as part of your project or thesis you generated valuable data (= data, which can only be recreated with large effort or not at all), you should put this data in the folder /nfs/students/<firstname-lastname> (see above) and mention this in the README as well. There should be a README file in your /nfs/students directory as well.

Reproducibility via Docker and Make

Your work should be made easily reproducible using Docker as described in this Docker example. In particular, there should be a file Dockerfile in the top-level directory of your SVN folder such that we can reproduce your results as follows:

svn co[projects|theses]/<firstname>-<lastname>
cd <firstname>-<lastname>
docker build -t <name> .
docker run -it -v /nfs/students/<firstname>-<lastname>:/extern/data <name>

These commands should build and run a docker container, in which everything is properly prepared for your code to run seamlessly. The purpose of the -v option is that the data from /nfs/students is available in the container via /extern/data. Note that you can make the contents of the SVN folder available in the docker container via COPY in the Dockerfile (that is, you don't have to check out the SVN folder again in the container), see the example linked to above.

In the docker container, there should be a Makefile with targets of your choice, so that we can run your various experiments or pipelines or services or whatever it is that you have done. The proper choice of targets is up to you, but the first target in your Makefile should always be help so that just make will print some information on what can done with your Makefile. See the Docker example linked to above for a simple example of such a Makefile.

For each of the targets, please specify the following: (1) which files are read, (2) which files are produced, (3) how much time will it take approximately (second or minutes or hours or days), (4) how much RAM and disk space will this need approximately (a few KB, a few MB, many GBs?). If this information is too complex, it's probably a good idea to let make help just print the high-level info (which targets there are and a short description what they do) and have a make help-<target> which prints more detailed info for each target (and the make help page should mention that).

Important: If some of your targets produce new data, they must not overwrite data which is already in /nfs/students/<firstname>-<lastname>. If the produced data is small, it's OK to have it in the container. if the produced data is large, it's best to mount another volume specifically for output data when running the container, for example -v /path/of/choice:/extern/output.

Testing your Dockerfile

Of course, you want to test whether your Dockerfile works. However, you cannot run docker build or docker run on our machines, because that would pose a security risk (with the right arguments, you could then become root on our machines).

As a remedy, we provide a wharfer command, which you can use just like docker, but without the mentioned security risks. The use of wharfer is documented here. On our machines tapoa, atlantis, fiji, nkaba, sirba, galera, alicudi, panarea and metropolis it is already installed and you can just use it. If you have been assigned a different machine for your work, ask for access to one of these machines once your are ready to test your Dockerfile.


See DockerTroubleshooting for how to deal with some typical problems which we encountered on our machines so far.

AD Teaching Wiki: Reproducibility (last edited 2019-02-14 17:48:56 by adpult)