Differences between revisions 5 and 45 (spanning 40 versions)

This page describes our reproducibility requirements for all projects and theses supervised by someone from our group. You should read it carefully when you begin your work. If you leave it until the end, it is much more work. If you consider it right from the beginning, it is actually quite rewarding. Before you read this, you should first read the general page on projects and theses at our chair.

Contents

Access to our SVN, one of our machines, and our file system
Coding and Data Standards
Reproducibility via Docker and Make
Testing your Dockerfile
Troubleshooting

Access to our SVN, one of our machines, and our file system

Right after the very first meeting with your supervisor you will be assigned the following. If for some reason, this does not happen within a day, you should send an email to your supervisor and our system administrator Frank Dal-Ri.

1. A subfolder in our SVN with URL https://ad-svn.informatik.uni-freiburg.de/student-[projects|theses]/<firstname>-<lastname>. Authentication works via your RZ Account (initials + number).

2. The name of one of our machines, on which you can work. Authentication works via your Informatik Account (first seven letters of your family name + first letter of your given name). This is the username referred to in the next two items.

3. A directory /local/data/<username> for large datasets on a local disk of the machine which you have been assigned. The local disks are fast, so this is great for IO-heavy code (for example, a search engine which frequently reads large segments of data from disk). This directory will be deleted, once you have given your presentation and received your grade.

4. A directory /nfs/students/<firstname-lastname> for large datasets on our network file system (NFS). Access to these files can be (and often is) significantly slower, because data packets are routed via the network. However, this directory will be kept after you have given your presentation and received your grade. It should contain a tidied up version of all your data that is worth preserving and was too large to be uploaded to our SVN (see item 1).

Note: if you are exclusively working on your own machine (not the preferred option, but possible if you know what you are doing), than Items 2 and 3 are not relevant for you.

Coding and Data Standards

Your code should be properly documented, it should have a consistent style, and there should be at least a simple unit test for each non-trivial function (this is easy and actually quite rewarding for properly designed functions). For the common languages C++, Java, and Python you find examples in our Coding Standards.

In your subfolder in our SVN, there should be a README.txt or README.md, in which you clearly explain how you organized your files and what can be found where. If as part of your project or thesis you generated valuable data (= data, which can only be recreated with large effort or not at all), you should put this data in the folder /nfs/students/<firstname-lastname> (see above) and mention this in the README as well. There should be a README file in your /nfs/students directory as well.

Reproducibility via Docker and Make

Your work should be made easily reproducible using Docker as described in this Docker example. In particular, there should be a file Dockerfile in the top-level directory of your SVN folder such that we can reproduce your results as follows:

svn co https://ad-svn.informatik.uni-freiburg.de/student-[projects|theses]/<firstname>-<lastname>
cd <firstname>-<lastname>
docker build -t <name> .
docker run -it -v /nfs/students/<firstname>-<lastname>:/extern/data <name>

These commands should build and run a docker container, in which everything is properly prepared for your code to run seamlessly. The purpose of the -v option is that the data from /nfs/students is available in the container via /extern/data. Note that you can make the contents of the SVN folder available in the docker container via COPY in the Dockerfile (that is, you don't have to check out the SVN folder again in the container), see the example linked to above.

In the docker container, there should be a Makefile with targets of your choice, so that we can run your various experiments or pipelines or services or whatever it is that you have done. The proper choice of targets is up to you, but the first target in your Makefile should always be help so that just make will print some information on what can done with your Makefile. See the Docker example linked to above for a simple example of such a Makefile.

For each of the targets, please specify the following: (1) which files are read, (2) which files are produced, (3) how much time will it take approximately (second or minutes or hours or days), (4) how much RAM and disk space will this need approximately (a few KB, a few MB, many GBs?). If this information is too complex, it's probably a good idea to let make help just print the high-level info (which targets there are and a short description what they do) and have a make help-<target> which prints more detailed info for each target (and the make help page should mention that).

Important: If some of your targets produce new data, they must not overwrite data which is already in /nfs/students/<firstname>-<lastname>. If the produced data is small, it's OK to have it in the container. if the produced data is large, it's best to mount another volume specifically for output data when running the container, for example -v /path/of/choice:/extern/output.

Testing your Dockerfile

Of course, you want to test whether your Dockerfile works. However, you cannot run docker build or docker run on our machines, because that would pose a security risk (with the right arguments, you could then become root on our machines).

As a remedy, we provide a wharfer command, which you can use just like docker, but without the mentioned security risks. The use of wharfer is documented here. On our machines tapoa, atlantis, fiji, nkaba, sirba, galera, alicudi, panarea and metropolis it is already installed and you can just use it. If you have been assigned a different machine for your work, ask for access to one of these machines once your are ready to test your Dockerfile.

Troubleshooting

See DockerTroubleshooting for how to deal with some typical problems which we encountered on our machines so far.

-  ⇤ ← Revision 5 as of 2018-05-15 21:37:41 → 
  Size: 2356
  Editor: Hannah Bast
  Comment:
+   ← Revision 45 as of 2019-02-14 17:48:56 → ⇥
  Size: 7126
  Editor: adpult
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-This page describes our reproducibility requirements for ''all'' projects and theses supervised by someone from our group.
+#acl Claudius Korzen:read,write Patrick Brosi:read,write Niklas Schnelle:read,write Markus Näther:read,write All:read

This page describes our reproducibility requirements for ''all'' projects and theses supervised by someone from our group. You should read it carefully when you ''begin'' your work. If you leave it until the end, it is much more work. If you consider it right from the beginning, it is actually quite rewarding. Before you read this, you should first read the [[http://ad-wiki.informatik.uni-freiburg.de/teaching/BachelorAndMasterProjectsAndTheses|general page on projects and theses at our chair]].
-Line 5:
+Line 7:
-= Access to our SVN and our file system =
+= Access to our SVN, one of our machines, and our file system =
-Line 7:
+Line 9:
-Right after the very first meeting with your supervisor you will be assigned the following. If for some reason, this does not happen within a day, you should send an email to your supervisor ''and'' our system adminstrator [[https://ad.informatik.uni-freiburg.de/staff/dal-ri|Frank Dal-Ri]].
+Right after the very first meeting with your supervisor you will be assigned the following. If for some reason, this does not happen within a day, you should send an email to your supervisor ''and'' our system administrator [[https://ad.informatik.uni-freiburg.de/staff/dal-ri|Frank Dal-Ri]].
-Line 11:
+Line 13:
-. A subfolder in our SVN with URL <i>https://ad-svn.informatik.uni-freiburg.de/student-[projects|theses]/&lt;firstname&gt;-&lt;lastname&gt;</i>. Authentication works via your RZ Account (initials + number).</p>
<p style="color: darkblue">
2. The name of one of our machines, on which you can work. Authentication works via your Informatik Account (first seven letters of your family name + first leter of your given name).</p>
+. A <b>subfolder in our SVN</b> with URL https://ad-svn.informatik.uni-freiburg.de/student-[projects|theses]/&lt;firstname&gt;-&lt;lastname&gt;. Authentication works via your RZ Account (initials + number).</p>
 Line 16:
-. On this machine, two folders, where &lt;username&gt; is the user name from your Informatik Account (see 2):<br/>
3.1 <i>/local/data/&lt;username&gt;</i> : A directory for large datasets on a local disk of the machine which you have been assigned. The local disks are fast, so this is great for IO-heavy code (for example, a search engine which frequently reads large segments of data from disk). This directory will be deleted, once you have given your presentation and received your grade.<br/>
3.2 <i>/nfs/students/&lt;username&gt;</i>  A directory for large datasets on our network file system (NFS). Access to these files can be (and often is) significantly slower, because data packets are routed via the network. However, this directory will be kept after you have given your presentation and received your grade. It should contain a tidied up version of all your data that is worth preserving and was too large to be uploaded to our SVN (see item 1).</p>
+. The <b>name of one of our machines</b>, on which you can work. Authentication works via your Informatik Account (first seven letters of your family name + first letter of your given name). This is the username referred to in the next two items.</p>

<p style="color: darkblue">
3. A <b>directory /local/data/&lt;username&gt;</b> for large datasets on a local disk of the machine which you have been assigned. The local disks are fast, so this is great for IO-heavy code (for example, a search engine which frequently reads large segments of data from disk). This directory will be deleted, once you have given your presentation and received your grade.</p>

<p style="color: darkblue">
4. A <b>directory /nfs/students/&lt;firstname-lastname&gt;</b> for large datasets on our network file system (NFS). Access to these files can be (and often is) significantly slower, because data packets are routed via the network. However, this directory will be kept after you have given your presentation and received your grade. It should contain a tidied up version of all your data that is worth preserving and was too large to be uploaded to our SVN (see item 1).</p>
-Line 21:
+Line 25:
-= Location of your code and small datasets =
+Note: if you are exclusively working on your own machine (not the preferred option, but possible if you know what you are doing), than Items 2 and 3 are not relevant for you.
-Line 23:
+Line 27:
-Your complete code should be in our SVN. Small datasets (of total size less than, say, 500 MB) should also be in our SVN. Large datasets should be in /nfs/students/<username>
+= Coding and Data Standards =
-Line 25:
+Line 29:
+Your code should be properly documented, it should have a consistent style, and there should be at least a simple unit test for each non-trivial function (this is easy and actually quite rewarding for properly designed functions). For the common languages C++, Java, and Python you find examples in our [[https://daphne.informatik.uni-freiburg.de/CodingStandards/svn|Coding Standards]].
-Line 26:
+Line 31:
-There should be a ''README.txt'' or ''README.md'' in each directory and sub-directory, briefly explaining the contents of the respective directory.
+In your subfolder in our SVN, there should be a ''README.txt'' or ''README.md'', in which you clearly explain how you organized your files and what can be found where. If as part of your project or thesis you generated valuable data (= data, which can only be recreated with large effort or not at all), you should put this data in the folder '''/nfs/students/&lt;firstname-lastname&gt;''' (see above) and mention this in the README as well. There should be a README file in your ''/nfs/students'' directory as well.
-Line 28:
+Line 33:
-Data:
+= Reproducibility via Docker and Make =

Your work should be made easily reproducible using ''Docker'' as described in this [[http://ad-wiki.informatik.uni-freiburg.de/teaching/DockerExample|Docker example]]. In particular, there should be a file '''Dockerfile''' in the top-level directory of your SVN folder such that we can reproduce your results as follows:

{{{
svn co https://ad-svn.informatik.uni-freiburg.de/student-[projects|theses]/<firstname>-<lastname>
cd <firstname>-<lastname>
docker build -t <name> .
docker run -it -v /nfs/students/<firstname>-<lastname>:/extern/data <name>
}}}

These commands should build and run a docker container, in which everything is properly prepared for your code to run seamlessly. The purpose of the -v option is that the data from ''/nfs/students'' is available in the container via ''/extern/data''. Note that you can make the contents of the SVN folder available in the docker container via COPY in the Dockerfile (that is, you don't have to check out the SVN folder again in the container), see the example linked to above.

In the docker container, there should be a ''Makefile'' with targets of your choice, so that we can run your various experiments or pipelines or services or whatever it is that you have done. The proper choice of targets is up to you, but the first target in your Makefile should always be ''help'' so that just ''make'' will print some information on what can done with your Makefile. See the Docker example linked to above for a simple example of such a Makefile.

For each of the targets, please specify the following: (1) which files are read, (2) which files are produced, (3) how much time will it take approximately (second or minutes or hours or days), (4) how much RAM and disk space will this need approximately (a few KB, a few MB, many GBs?). If this information is too complex, it's probably a good idea to let ''make help'' just print the high-level info (which targets there are and a short description what they do) and have a ''make help-<target>'' which prints more detailed info for each target (and the ''make help'' page should mention that).

'''Important:''' If some of your targets produce new data, they must not overwrite data which is already in ''/nfs/students/<firstname>-<lastname>''. If the produced data is small, it's OK to have it in the container. if the produced data is large, it's best to mount another volume specifically for output data when running the container, for example ''-v /path/of/choice:/extern/output''.

= Testing your Dockerfile =

Of course, you want to test whether your Dockerfile works. However, you cannot run ''docker build'' or ''docker run'' on our machines, because that would pose a security risk (with the right arguments, you could then become ''root'' on our machines).

As a remedy, we provide a ''wharfer'' command, which you can use just like ''docker'', but without the mentioned security risks. The use of wharfer is [[https://github.com/ad-freiburg/wharfer#using-wharfer|documented here]]. On our machines ''tapoa'', ''atlantis'', ''fiji'', ''nkaba'', ''sirba'', ''galera'', ''alicudi'', ''panarea'' and ''metropolis'' it is already installed and you can just use it. If you have been assigned a different machine for your work, ask for access to one of these machines once your are ready to test your Dockerfile.

= Troubleshooting =

See [[DockerTroubleshooting]] for how to deal with some typical problems which we encountered on our machines so far.