6586
Comment:
|
23548
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
Bachelor's and Master's Projects and Theses at the Chair for Algorithms and Data Structures. ''Note: This is a new page, created on 07-02-2017. The old page, which was outdated and in a terrible format, can still be found [[BachelorAndMasterProjectsAndThesesOld|here]].'' |
#acl Claudius Korzen:read,write Patrick Brosi:read,write Axel Lehmann:read,write Björn Buchhold:read,write Niklas Schnelle:read,write Markus Näther:read,write All:read This page describes how Bachelor's and Master's Projects and Theses work at the Chair for Algorithms and Data Structures. If you are interested in doing a project or theses with us, you should read this page (as well as the pages linked from it) very carefully. {{{#!html <!-- # ''Note: This is a new page, created on 07-02-2017. The old page, which was outdated and in a terrible format, can still be found [[BachelorAndMasterProjectsAndThesesOld|here]].'' --> }}} |
Line 6: | Line 11: |
= Application for a project or theses = | = Application for a project or thesis = If you are interested in doing a project or theses with us, please '''carefully read this page first''' and then '''send an e-mail to the prospective supervisor with the following information:''' |
Line 8: | Line 14: |
If you are interested in doing a project or theses with us, please carefully read this page first and then send an e-mail to the prospective supervisor with the following information: | 1. An acknowledgement that you have carefully read this whole page (not just the first section). 2. A list of the courses you have already taken with us. 3. A transcript of the grades of the courses you have take so far. 4. A very short description of your interests and your strengths (concerning work on a project/thesis). 5. If you have an own project in mind (not necessary): a very short description of the goal and the scientific merit. |
Line 10: | Line 20: |
1. Acknowledge that you have read this whole page (not just the first section) carefully. 2. Which courses have you already taken with us. 3. A very short description of your interests and your strengths (concerning work on a project/thesis). 4. A transcript of the grades of the courses you have take so far. 5. If you have an own project in mine (not necessary): a very short description of the goal and the scientific merit. |
== Thesis with supervision from an external company == |
Line 16: | Line 22: |
= Deliverables = | If you want to do your thesis with a company, please provide the following '''additional''' (to the list above) information in a '''concise''' format. Concise means that you should not write more than a paragraph for each of the items below and that your text should be concrete and understandable for a non-expert. |
Line 18: | Line 24: |
'''B.Sc. and M.Sc. thesis:''' (1) written thesis; (2) oral presentation (20 minutes + question); (3) well-documented version of your code and data, which allow reproducibility of your results, in our SVN. Here are some [[ThesesGuideline|guidelines on how to write a good thesis]]. See the list at the end of this page for many examples (of submitted theses and the accompanying presentation). | a. What is the expected / aimed at outcome? a. How does your approach and the expected result differ from the state of the art? See the section on "Related Work" in our [[ThesesGuidelines|guidelines]]. a. How do you plan to evaluate your work. See the sections on "Theoretical Analysis" and "Empirical Analysis" in our [[ThesesGuidelines|guidelines]]. a. Supervision must be provided by the company and the supervisor should provide an evaluation report at the end of the thesis, in a format to be discussed with us. |
Line 20: | Line 29: |
''' B.Sc. and M.Sc. projects:''' (1) a project website; (2) well-documented version of your code and data, which allow reproducibility of your results, in our SVN. See the list at the end of this page for many examples (of project websites). | = Deliverables of a project or thesis = |
Line 22: | Line 31: |
= Available projects and theses = | Note that the following are requirements, not just guidelines. |
Line 24: | Line 33: |
Click on the titles for a more detailed description of the project, including background info, goals, and first steps. | == Deliverables of a Bachelor's or Master's PROJECT == |
Line 26: | Line 35: |
[[BachelorAndMasterProjectsAndTheses/TokenizationRepair|Tokenization Repair (project and/or thesis)]]: Interesting and well-defined problem, the solution of which is relevant in a variety of information retrieval scenarios. Simple rule-based solutions come to mind easily, but machine learning is key to get very good results. A background in machine learning, or a strong willingness to aquire one as part of the project/thesis, is therefore mandatory for this project. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. | 1. '''Your code and your data''', which allows reproducibility of your results. We have very specific requirements for this. They are explained in detail on our [[Reproducibility]] page, together with a simple working example. |
Line 28: | Line 37: |
[[BachelorAndMasterProjectsAndTheses/SparqlTextUI|A User Interface for the QLever SPARQL+Text engine (project or thesis)]]: This is well-suited as a project (B.Sc. or M.Sc.) but also provides ample opportunity for continuation with a theses (B.Sc. or M.Sc). You should be fond of good user interfaces and have a good taste concerning layout and colors and such things. You should also like knowledge bases and big datasets and searching in them. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. | 2. '''A project website.''' The pages must be in a subfolder ''www'' in your folder in our SVN. They should not require any special installation, but work just by copying them somewhere. Note that this does not exclude interactive elements, as long as all the required (!JavaScript) code is also contained in the ''www'' subfolder. There are no strict requirements for the format of the page, but the pages should be well-structured, informative and pleasant to read. Here is a long (incomplete) list of [[https://ad.informatik.uni-freiburg.de/publikationen/bachelor_master_projekte|examples projects]] which have already been completed at our chair. |
Line 30: | Line 40: |
[[BachelorAndMasterProjectsAndTheses/FrequencyMap|Generating Regular-Interval Maps from Public Transit Data (project or thesis, ongoing)]]: We are looking for well-structured input data for our [[http://panarea.informatik.uni-freiburg.de/gtfs-lines/|transit-map drawing algorithm]]. The goal of this topic is to provide the transit-mapper with a graph from which it can render regular-intervall timetable maps ("Taktfahrplankarten") like [[http://www.sma-partner.com/images/Downloads/NGCH-2015.pdf|this]]. You should be a fan of huge datasets, timetable data and of course graphs (but who isn't). As a bachelor or master project with the possibility of continuation as a thesis. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. | == Deliverables of a Bachelor's or Master's THESIS == |
Line 32: | Line 42: |
[[BachelorAndMasterProjectsAndTheses/GtfsMerger|Merging Overlapping GTFS Feeds (Bachelor project or thesis, ongoing)]]: Many transportation companies publish their timetable data either directly as GTFS feeds or in formats that can be converted to GTFS. As soon as you have two GTFS feeds (two sets of timetable data) that cover either the same or adjacent areas, the problem of duplicate trips arises. You should develop a tool that merges two or more GTFS feeds and solves duplication / fragmentation issues. As a bachelor project or thesis. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. | 1. '''Your code and your data''', which allows reproducibility of your results. We have very specific requirements for this. They are explained in detail on our [[Reproducibility]] page, together with a simple working example. |
Line 34: | Line 44: |
[[BachelorAndMasterProjectsAndTheses/GtfsBrowser|GTFS Browser Web App (Bachelor project or thesis)]]: Develop a web-application that can be used to analyze huge GTFS datasets. There are already some tools available (for example, [[https://github.com/google/transitfeed/wiki/ScheduleViewer|ScheduleViewer]]) but they all feel and look quite clumsy, are incredible slow and cannot handle large datasets. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. | 2. '''A written thesis.''' A PDF of the thesis as well as all the sources (tex files, bib files, figures, etc.) should be in the SVN, in a separate folder ''thesis''. See our [[http://ad-wiki.informatik.uni-freiburg.de/teaching/ThesesGuidelines|Guidelines for how to write a proper thesis]]. Here is a [[https://ad.informatik.uni-freiburg.de/publikationen/bachelor_master_arbeiten|long list of examples]].<<BR>> 3. '''An oral presentation.''' A PDF of the slides as well as all the sources (if you use tex: like for the thesis, if you use !PowerPoint: the PPTX file) should be in the SVN, in a separate folder ''presentation''. The maximal duration of the presentation is 20 minutes. There is ample time for questions afterwards. Place: Building 51, 2nd Floor, Room 024 (our "Küche"). You should be there 15 minutes earlier for setup and testing.<<BR>> {{{#!html <!-- For both thesis and projects, your codebase and data '''must''' fulfill the following requirements (please read carefully and check twice before you submit your thesis): 1. '''Instructions on how to install and run your code.''' The standard is that you provide a ''Dockerfile'' together with your code, and the Docker image can be build with ''docker build'' and run with ''docker run'' (you should specify how exactly in your README file). If you want to deviate from this standard, you must discuss this with your supervisor early on in your project / thesis work. You can test your ''Dockerfile'' on our machine ''tapoa'' as described [[https://github.com/ad-freiburg/wharfer|in this document (Section "Using wharfer")]]. If you don't have access to ''tapoa'', write an email to [[https://ad.informatik.uni-freiburg.de/staff/dal-ri|Frank Dal-Ri]] with Cc to your supervisor. 2. '''A README file in the top-level directory.''' This README file should provide an overview over everything that's in the repository, so that anyone looking at your repository can quickly find they are looking for. In particular: where do I find the installation instructions, where do I find the data, where do I find the thesis and the presentation, how do I run the experiments. 3. '''A README file in each other directory.''' Each subfolder (and subsubfolder, etc) should contain a README file that explains how the folder is organized. If there are files in the folder (and not just other subfolders), you should explain which kind of information is in those files and what the format of those files is. Also make intelligent use of suffixes. For example, a file with tab-separated values should have the suffix ''.tsv'', a pure text file should have the suffix ''.txt'' and so on. Also pay attention to consistent names: don't call your files ''one_file'', ''fileTwo'', ''Third file'', ''file_4'' (you get the idea). 4. '''Instructions on how to reproduce your experiments.''' 5. '''List of data files which are not part of the code base.''' Many projects / theses involve data that is too big to upload in our SVN. You should explain precisely, how this data can be obtained, and where on our file system you have stored in (typically somwhere under ''/nfs/raid3/<username>'' ... don't forget the README files in those folders, too). --> }}} = Grading scheme for the theses = Your final grade for the thesis will be an average of four grades, one for each of the following four aspects: 1. Quality of the conceptual/theoretical work. This includes aspects such as: a. How well were the ideas thought out / the details worked out a. How independent was the work a. How meaningful and interesting/useful were the results. 1. Quality of the implementation work. This includes aspects such as: a. Are the installation instructions easy to understand a. Can we run your code and reproduce your results a. Is the code well documented, did you adhere to a proper coding style, are there unit tests (these are all basic requirements in every course offered by our chair). 1. Quality of the evaluation. This includes aspects such as: a. Is the experimental setup well described a. Is the selection of datasets reasonable (there should be at least two, with different characteristics) a. Is there a comparison with a reasonable baseline or competitor method a. Are the results correct a. Are the results properly discussed 1. Quality of the write-up. This includes the aspects described in our [[ThesesGuidelines|guidelines for writing a thesis]] |
Line 37: | Line 88: |
In the first meeting with the supervisor, create a Google Doc where you copy and fill out the following template. The Google Doc must be named ''Firstname Lastname'' and it should be shared with at least the following people: Student, Supervisor(s), Hannah Bast (bast.hannah@gmail.com), Frank Dal-Ri (nirlad@gmail.com, our system adiminstrator), Heike Hägle (hhaegle@gmail.com, our secretary). |
== First meeting == In the first meeting with the supervisor, create a Google Doc where you copy and fill out the following template. The Google Doc must be named ''Firstname Lastname'' and it should be shared with at least the following people: Student, Supervisor(s), Hannah Bast ( bast.hannah@gmail.com ), Frank Dal-Ri ( nirlad@gmail.com , our system administrator), Heike Hägle ( hhaegle@gmail.com , our secretary). |
Line 41: | Line 92: |
Type: [one of: Bachelor Project, Bachelor Thesis, Master Project, Master Thesis] | |
Line 43: | Line 95: |
RZ Account: [initials + number, e.g. ''tf437''] Department Account: [usually the first seven letters of the given name + the initial of the first name, e.g. ''friedt''] |
RZ Account: [initials + number, e.g. ''es437''] Department Account: [usually the first seven letters of the given name + the initial of the first name, e.g. ''scissore''] |
Line 46: | Line 98: |
SVN: [subdirectory in student-projects or student-theses, named firstname-lastname, e.g. ''edward-scissorhands'']] | |
Line 50: | Line 103: |
Line 56: | Line 110: |
Line 59: | Line 112: |
== Follow-up meetings == It usually makes sense to have a few follow-up / progress meetings over the course of the projects. They usually take place, after certain milestones have been reached. Students should come well-prepared to these meetings. Here is a checklist: |
|
Line 60: | Line 115: |
1. Concise(!!) list of what has been done in the Google Doc (at the top) 1. Concise list of questions, if you have any 1. Code written so far should be in the SVN, not (only) on your local computer 1. All relevant files should be on our file system, not (only) on your local computer 1. A working demo of whatever you have done so far 1. The demo should run on our servers, not just locally on your computer 1. Bring your laptop (just in case there is uncommitted stuff) |
|
Line 61: | Line 123: |
= Available and ongoing projects and theses = {{attachment:available.png|Available|align="top"}} = Available; {{attachment:ongoing.png|Ongoing|align="top"}} = Ongoing Click on the titles for a more detailed description of the project, including background info, goals, and first steps. Note that for some topics, it can make sense if several people work on them concurrently. So ''ongoing'' does not necessarily mean, that the topic is no longer available. {{attachment:available.png|Available|align="top"}} [[BachelorAndMasterProjectsAndTheses/GtfsMerger|Merging Overlapping GTFS Feeds (Bachelor project or thesis)]]: Many transportation companies publish their timetable data either directly as GTFS feeds or in formats that can be converted to GTFS. As soon as you have two GTFS feeds (two sets of timetable data) that cover either the same or adjacent areas, the problem of duplicate trips arises. You should develop a tool that merges two or more GTFS feeds and solves duplication / fragmentation issues. As a bachelor project or thesis. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. {{attachment:ongoing.png|Ongoing|align="top"}} [[BachelorAndMasterProjectsAndTheses/GtfsBrowser|GTFS Browser Web App (Bachelor project or thesis)]]: Develop a web-application that can be used to analyze huge GTFS datasets. There are already some tools available (for example, [[https://github.com/google/transitfeed/wiki/ScheduleViewer|ScheduleViewer]]) but they all feel and look quite clumsy, are incredible slow and cannot handle large datasets. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. {{attachment:ongoing.png|Ongoing|align="top"}} [[BachelorAndMasterProjectsAndTheses/DehyphenationAndGuessingLigatures|Merging Hyphenated Words & Guessing Ligatures (Master thesis, ongoing)]]: PDF is a layout-based format: it specifies the positions and fonts of the individual characters, of which the text is composed, but usually does not provide any information about words, paragraphs and sections. You should write code, that (1) merges hyphenated words with respect to ''compound words'' and (2) "guesses" characters (e.g., ligatures) that can't be identified as textual content. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/korzen|Claudius Korzen]]. {{attachment:ongoing.png|Ongoing|align="top"}} [[BachelorAndMasterProjectsAndTheses/HomepageAnalyzer|Extract and Analyze Scientist's Homepages (project and/or thesis)]]: Extract a large number of scientist's homepages from the [[http://commoncrawl.org|CommonCrawl]] web crawl. Extract the central information from these pages, including: name, profession, gender, affiliation. It will be relativel straightforward to get results of medium quality. The challenge is to achieve results of high quality. Machine learning will be crucial to achieve that. Exploring suitable methods is part of the challenge. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. {{attachment:ongoing.png|Ongoing|align="top"}} [[BachelorAndMasterProjectsAndTheses/TokenizationRepair|Tokenization Repair (project and/or thesis)]]: Interesting and well-defined problem, the solution of which is relevant in a variety of information retrieval scenarios. Simple rule-based solutions come to mind easily, but machine learning is key to get very good results. A background in machine learning, or a strong willingness to aquire one as part of the project/thesis, is therefore mandatory for this project. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. {{attachment:ongoing.png|Available|align="top"}} [[BachelorAndMasterProjectsAndTheses/SparqlTextUI|Combined SPARQL+Text search (project or thesis)]]: This is well-suited as a project (B.Sc. or M.Sc.) but also provides ample opportunity for continuation with a theses (B.Sc. or M.Sc). You should be fond of good user interfaces and have a good taste concerning layout and colors and such things. You should also like knowledge bases and big datasets and searching in them. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. {{attachment:available.png|Available|align="top"}} [[BachelorAndMasterProjectsAndTheses/SynonymFinder|Synonym Finder (project and/or theses)]]: Find synonyms for all entities from Freebase or !WikiData or Wikipedia. Evaluate the quality and compare it to existing synonym databases like !CrossWikis. Motivation: Most entities are known under several names. For example, a person like "Ellen !DeGeneres" is also known as just "Ellen". A profession like "Astronaut" is also known as "Spaceman", "Spacewoman" or "Cosmsonaut". Knowing these synonyms is key for many NLP (Natural Lanuage Processing) problems. Including complex problems like semantic search and question answering. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. {{attachment:ongoing.png|Ongoing|align="top"}} [[BachelorAndMasterProjectsAndTheses/OsmSearch|A Search Engine for OpenStreetMap Data (project and/or thesis)]]: Implement the backend for a search engine for [[http://www.openstreetmap.org|OpenStreetMap]] data. Your server should load an arbitrary .osm file and answer (fuzzy) string searches over the entire dataset. This is a nice project to familiarize yourself with different index structures. Continuation as a thesis is possible. Preferably you have visited our lecture [[InformationRetrievalWS1617|Information Retrieval]], but this is not required. Code should be written in C++. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. {{attachment:ongoing.png|Ongoing|align="top"}} [[BachelorAndMasterProjectsAndTheses/TabularInformationExtraction|Tabular Information Extraction (project and/or thesis)]]: Extract data from a knowledge base in a tabular format. This could for example be a list of cities with columns for the country they are in, population count and coordinates but really anything fitting in a table should be possible. Think of the typically hand crafted summary tables on Wikipedia. This is well-suited as a project (B.Sc. or M.Sc.) with possible continuation as a thesis. You should be interested in knowledge bases, big datasets and searching in them. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/schnelle|Niklas Schnelle]]. {{attachment:ongoing.png|Available|align="top"}} [[BachelorAndMasterProjectsAndTheses/ConversationalAqqu|Conversational Aqqu (project and/or thesis)]]: In this project you will be working on an extension to our Question Answering system Aqqu. This extension will enable follow-up questions and thus a more natural interface. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/schnelle|Niklas Schnelle]]. {{attachment:available.png|Available|align="top"}} [[http://ad-wiki.informatik.uni-freiburg.de/teaching/BachelorAndMasterProjectsAndTheses/QAonWikiData|Question Anwering on WikiData]]: make our question answering work on [[https://www.wikidata.org|WikiData]]. !WikiData is currently growing fast and will become the new Freebase. It's an exciting dataset. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. {{attachment:available.png|Available|align="top"}} [[http://ad-wiki.informatik.uni-freiburg.de/teaching/BachelorAndMasterProjectsAndTheses/Chatbot|A Simple chat bot]]: Build a simple chat bot using deep learning. For a recent example of such a chatbot, see [[https://www.washingtonpost.com/news/to-your-health/wp/2017/12/03/the-woebot-will-see-you-now-the-rise-of-chatbot-therapy|Woebot]]. This topic gives you a lot of freedom concerning the outcome, but it is also free in the sense that you have to research yourself what is out there already and what can realistically be achieved in six months yourself. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. {{attachment:available.png|Available|align="top"}} [[http://ad-wiki.informatik.uni-freiburg.de/teaching/BachelorAndMasterProjectsAndTheses/QAErrorCorrection|Error Correction for Question Answering]]: Design and build a system that accepts a (relatively simple) question in natural language and automatically corrects typos etc. This should be realized with a character-based language model learned using deep learning (e.g., with an RNN). Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. {{attachment:ongoing.png|Available|align="top"}} [[http://ad-wiki.informatik.uni-freiburg.de/teaching/BachelorAndMasterProjectsAndTheses/ScientistFinder|Crawling and Analysis of Scientist Homepages]]: Design and build a system that crawls as many university pages and finds as many person mentions as possible and creates a neat knowledge base from this information (Name of Scientist, Affiliation, Gender, Title). In previous work, we tried extracting this information from the CommonCrawl archive, but that turned out to be much too incomplete and unreliable. This is a challenging problem with large practical value. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/schnelle|Niklas Schnelle]]. {{attachment:available.png|Available|align="top"}} [[http://ad-wiki.informatik.uni-freiburg.de/teaching/BachelorAndMasterProjectsAndTheses/BitcoinTradingApp|Bitcoin trading app]] Design and implement a reasonably clever (not to reckless and not to conservative) algorithm for trading bitcoins. Evaluate on historical data and implement an API and a web app to monitor on live data. Take into account all the actual costs incurred (like taxes and fees). Analyse the return of investment that can be reasonably expected. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. {{attachment:available.png|Available|align="top"}} [[http://ad-wiki.informatik.uni-freiburg.de/teaching/BachelorAndMasterProjectsAndTheses/ClueWebEntityRecognition|Entity Recognition on a Web Corpus]] Design and implement a named-entity recognizer for a web-size corpus. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. {{attachment:available.png|Available|align="top"}} [[http://ad-wiki.informatik.uni-freiburg.de/teaching/BachelorAndMasterProjectsAndTheses/ClueWebContextDecomposition|Context Decomposition of a Web Corpus]] Decompose the sentences of a given web-size corpus into their semantic components, using fancy technology developed in our group. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. {{attachment:available.png|Available|align="top"}} [[http://ad-wiki.informatik.uni-freiburg.de/teaching/BachelorAndMasterProjectsAndTheses/CircularArcTransitMaps|Circular Arc Transit Maps]] The goal of this project is to reproduce the results of [[http://www1.informatik.uni-wuerzburg.de/fileadmin/10030100/_temp_/circularArcMetro_01.pdf|this poster presentation]], but with our tool [[http://loom.cs.uni-freiburg.de|LOOM]]. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. {{attachment:available.png|Available|align="top"}} [[http://ad-wiki.informatik.uni-freiburg.de/teaching/BachelorAndMasterProjectsAndTheses/RiverMap|River Maps]] The goal of this project is to use our tool [[http://loom.cs.uni-freiburg.de|LOOM]] to render maps of rivers from OSM data. Each river segment should consist of all rivers that contributed to this river so far (for example, beginning at Mannheim, the Neckar should be part of the segment that makes up the Rhine). Think of a single river as a single subway line starting at the source of that river, and the Rhine, for example, as dozens of small subway lines next to each other. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. {{attachment:available.png|Available|align="top"}} [[http://ad-wiki.informatik.uni-freiburg.de/teaching/BachelorAndMasterProjectsAndTheses/MailSearch|Mail Search]] The goal of this project is a fast and efficient search in very large mail archives. The subtasks are: (1) Write an efficient parser that reads one or files in MBOX format and produces a CSV file with one line per mail and columns for the various structured and unstructured parts of an email (from, to, subject, date, body, ...); (2) take proper care of encoding issues, which are a major issue when dealing with a large number of emails; (3) setup an instance of CompleteSearch! for the data from the CSV file; (4) provide a simple and effective search unterface using the instance from 3 as a backend. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. |
|
Line 63: | Line 174: |
This page describes how Bachelor's and Master's Projects and Theses work at the Chair for Algorithms and Data Structures. If you are interested in doing a project or theses with us, you should read this page (as well as the pages linked from it) very carefully.
Contents
Application for a project or thesis
If you are interested in doing a project or theses with us, please carefully read this page first and then send an e-mail to the prospective supervisor with the following information:
- An acknowledgement that you have carefully read this whole page (not just the first section).
- A list of the courses you have already taken with us.
- A transcript of the grades of the courses you have take so far.
- A very short description of your interests and your strengths (concerning work on a project/thesis).
- If you have an own project in mind (not necessary): a very short description of the goal and the scientific merit.
Thesis with supervision from an external company
If you want to do your thesis with a company, please provide the following additional (to the list above) information in a concise format. Concise means that you should not write more than a paragraph for each of the items below and that your text should be concrete and understandable for a non-expert.
- What is the expected / aimed at outcome?
How does your approach and the expected result differ from the state of the art? See the section on "Related Work" in our guidelines.
How do you plan to evaluate your work. See the sections on "Theoretical Analysis" and "Empirical Analysis" in our guidelines.
- Supervision must be provided by the company and the supervisor should provide an evaluation report at the end of the thesis, in a format to be discussed with us.
Deliverables of a project or thesis
Note that the following are requirements, not just guidelines.
Deliverables of a Bachelor's or Master's PROJECT
1. Your code and your data, which allows reproducibility of your results. We have very specific requirements for this. They are explained in detail on our Reproducibility page, together with a simple working example.
2. A project website. The pages must be in a subfolder www in your folder in our SVN. They should not require any special installation, but work just by copying them somewhere. Note that this does not exclude interactive elements, as long as all the required (JavaScript) code is also contained in the www subfolder. There are no strict requirements for the format of the page, but the pages should be well-structured, informative and pleasant to read. Here is a long (incomplete) list of examples projects which have already been completed at our chair.
Deliverables of a Bachelor's or Master's THESIS
1. Your code and your data, which allows reproducibility of your results. We have very specific requirements for this. They are explained in detail on our Reproducibility page, together with a simple working example.
2. A written thesis. A PDF of the thesis as well as all the sources (tex files, bib files, figures, etc.) should be in the SVN, in a separate folder thesis. See our Guidelines for how to write a proper thesis. Here is a long list of examples.
3. An oral presentation. A PDF of the slides as well as all the sources (if you use tex: like for the thesis, if you use PowerPoint: the PPTX file) should be in the SVN, in a separate folder presentation. The maximal duration of the presentation is 20 minutes. There is ample time for questions afterwards. Place: Building 51, 2nd Floor, Room 024 (our "Küche"). You should be there 15 minutes earlier for setup and testing.
Grading scheme for the theses
Your final grade for the thesis will be an average of four grades, one for each of the following four aspects:
- Quality of the conceptual/theoretical work. This includes aspects such as:
- How well were the ideas thought out / the details worked out
- How independent was the work
- How meaningful and interesting/useful were the results.
- Quality of the implementation work. This includes aspects such as:
- Are the installation instructions easy to understand
- Can we run your code and reproduce your results
- Is the code well documented, did you adhere to a proper coding style, are there unit tests (these are all basic requirements in every course offered by our chair).
- Quality of the evaluation. This includes aspects such as:
- Is the experimental setup well described
- Is the selection of datasets reasonable (there should be at least two, with different characteristics)
- Is there a comparison with a reasonable baseline or competitor method
- Are the results correct
- Are the results properly discussed
Quality of the write-up. This includes the aspects described in our guidelines for writing a thesis
Instructions for students and supervisors
First meeting
In the first meeting with the supervisor, create a Google Doc where you copy and fill out the following template. The Google Doc must be named Firstname Lastname and it should be shared with at least the following people: Student, Supervisor(s), Hannah Bast ( bast.hannah@gmail.com ), Frank Dal-Ri ( nirlad@gmail.com , our system administrator), Heike Hägle ( hhaegle@gmail.com , our secretary).
Type: [one of: Bachelor Project, Bachelor Thesis, Master Project, Master Thesis] Short working title of project/thesis: [this may change later] RZ Account: [initials + number, e.g. ''es437''] Department Account: [usually the first seven letters of the given name + the initial of the first name, e.g. ''scissore''] Primary e-mail adress: SVN: [subdirectory in student-projects or student-theses, named firstname-lastname, e.g. ''edward-scissorhands'']] Special RAM requirements: Special Disk space requirements: Actual beginning of work: Planned end of work: Goal: Subgoals: First steps: Deadline for first steps:
The first steps should be something relative fail-safe which can just "be done". If the first steps are not even remotely finished by the agreed-upon deadline, we might not want to supervise you further. Supervision is a considerable effort for us and we care a lot about it. In particular, we spent a lot of thought and effort on formulating interesting topics (and goals, and subgoals, and first steps) which are interesting and relevant and related to our actual research. We care about the results of these projects/theses, so you should also care working on them.
Follow-up meetings
It usually makes sense to have a few follow-up / progress meetings over the course of the projects. They usually take place, after certain milestones have been reached. Students should come well-prepared to these meetings. Here is a checklist:
- Concise(!!) list of what has been done in the Google Doc (at the top)
- Concise list of questions, if you have any
- Code written so far should be in the SVN, not (only) on your local computer
- All relevant files should be on our file system, not (only) on your local computer
- A working demo of whatever you have done so far
- The demo should run on our servers, not just locally on your computer
- Bring your laptop (just in case there is uncommitted stuff)
Available and ongoing projects and theses
= Available; = Ongoing
Click on the titles for a more detailed description of the project, including background info, goals, and first steps. Note that for some topics, it can make sense if several people work on them concurrently. So ongoing does not necessarily mean, that the topic is no longer available.
Merging Overlapping GTFS Feeds (Bachelor project or thesis): Many transportation companies publish their timetable data either directly as GTFS feeds or in formats that can be converted to GTFS. As soon as you have two GTFS feeds (two sets of timetable data) that cover either the same or adjacent areas, the problem of duplicate trips arises. You should develop a tool that merges two or more GTFS feeds and solves duplication / fragmentation issues. As a bachelor project or thesis. Supervised by Patrick Brosi.
GTFS Browser Web App (Bachelor project or thesis): Develop a web-application that can be used to analyze huge GTFS datasets. There are already some tools available (for example, ScheduleViewer) but they all feel and look quite clumsy, are incredible slow and cannot handle large datasets. Supervised by Patrick Brosi.
Merging Hyphenated Words & Guessing Ligatures (Master thesis, ongoing): PDF is a layout-based format: it specifies the positions and fonts of the individual characters, of which the text is composed, but usually does not provide any information about words, paragraphs and sections. You should write code, that (1) merges hyphenated words with respect to compound words and (2) "guesses" characters (e.g., ligatures) that can't be identified as textual content. Supervised by Claudius Korzen.
Extract and Analyze Scientist's Homepages (project and/or thesis): Extract a large number of scientist's homepages from the CommonCrawl web crawl. Extract the central information from these pages, including: name, profession, gender, affiliation. It will be relativel straightforward to get results of medium quality. The challenge is to achieve results of high quality. Machine learning will be crucial to achieve that. Exploring suitable methods is part of the challenge. Supervised by Hannah Bast.
Tokenization Repair (project and/or thesis): Interesting and well-defined problem, the solution of which is relevant in a variety of information retrieval scenarios. Simple rule-based solutions come to mind easily, but machine learning is key to get very good results. A background in machine learning, or a strong willingness to aquire one as part of the project/thesis, is therefore mandatory for this project. Supervised by Hannah Bast.
Combined SPARQL+Text search (project or thesis): This is well-suited as a project (B.Sc. or M.Sc.) but also provides ample opportunity for continuation with a theses (B.Sc. or M.Sc). You should be fond of good user interfaces and have a good taste concerning layout and colors and such things. You should also like knowledge bases and big datasets and searching in them. Supervised by Hannah Bast.
Synonym Finder (project and/or theses): Find synonyms for all entities from Freebase or WikiData or Wikipedia. Evaluate the quality and compare it to existing synonym databases like CrossWikis. Motivation: Most entities are known under several names. For example, a person like "Ellen DeGeneres" is also known as just "Ellen". A profession like "Astronaut" is also known as "Spaceman", "Spacewoman" or "Cosmsonaut". Knowing these synonyms is key for many NLP (Natural Lanuage Processing) problems. Including complex problems like semantic search and question answering. Supervised by Hannah Bast.
A Search Engine for OpenStreetMap Data (project and/or thesis): Implement the backend for a search engine for OpenStreetMap data. Your server should load an arbitrary .osm file and answer (fuzzy) string searches over the entire dataset. This is a nice project to familiarize yourself with different index structures. Continuation as a thesis is possible. Preferably you have visited our lecture Information Retrieval, but this is not required. Code should be written in C++. Supervised by Patrick Brosi.
Tabular Information Extraction (project and/or thesis): Extract data from a knowledge base in a tabular format. This could for example be a list of cities with columns for the country they are in, population count and coordinates but really anything fitting in a table should be possible. Think of the typically hand crafted summary tables on Wikipedia. This is well-suited as a project (B.Sc. or M.Sc.) with possible continuation as a thesis. You should be interested in knowledge bases, big datasets and searching in them. Supervised by Niklas Schnelle.
Conversational Aqqu (project and/or thesis): In this project you will be working on an extension to our Question Answering system Aqqu. This extension will enable follow-up questions and thus a more natural interface. Supervised by Niklas Schnelle.
Question Anwering on WikiData: make our question answering work on WikiData. WikiData is currently growing fast and will become the new Freebase. It's an exciting dataset. Supervised by Hannah Bast.
A Simple chat bot: Build a simple chat bot using deep learning. For a recent example of such a chatbot, see Woebot. This topic gives you a lot of freedom concerning the outcome, but it is also free in the sense that you have to research yourself what is out there already and what can realistically be achieved in six months yourself. Supervised by Hannah Bast.
Error Correction for Question Answering: Design and build a system that accepts a (relatively simple) question in natural language and automatically corrects typos etc. This should be realized with a character-based language model learned using deep learning (e.g., with an RNN). Supervised by Hannah Bast.
Crawling and Analysis of Scientist Homepages: Design and build a system that crawls as many university pages and finds as many person mentions as possible and creates a neat knowledge base from this information (Name of Scientist, Affiliation, Gender, Title). In previous work, we tried extracting this information from the CommonCrawl archive, but that turned out to be much too incomplete and unreliable. This is a challenging problem with large practical value. Supervised by Niklas Schnelle.
Bitcoin trading app Design and implement a reasonably clever (not to reckless and not to conservative) algorithm for trading bitcoins. Evaluate on historical data and implement an API and a web app to monitor on live data. Take into account all the actual costs incurred (like taxes and fees). Analyse the return of investment that can be reasonably expected. Supervised by Hannah Bast.
Entity Recognition on a Web Corpus Design and implement a named-entity recognizer for a web-size corpus. Supervised by Hannah Bast.
Context Decomposition of a Web Corpus Decompose the sentences of a given web-size corpus into their semantic components, using fancy technology developed in our group. Supervised by Hannah Bast.
Circular Arc Transit Maps The goal of this project is to reproduce the results of this poster presentation, but with our tool LOOM. Supervised by Patrick Brosi.
River Maps The goal of this project is to use our tool LOOM to render maps of rivers from OSM data. Each river segment should consist of all rivers that contributed to this river so far (for example, beginning at Mannheim, the Neckar should be part of the segment that makes up the Rhine). Think of a single river as a single subway line starting at the source of that river, and the Rhine, for example, as dozens of small subway lines next to each other. Supervised by Patrick Brosi.
Mail Search The goal of this project is a fast and efficient search in very large mail archives. The subtasks are: (1) Write an efficient parser that reads one or files in MBOX format and produces a CSV file with one line per mail and columns for the various structured and unstructured parts of an email (from, to, subject, date, body, ...); (2) take proper care of encoding issues, which are a major issue when dealing with a large number of emails; (3) setup an instance of CompleteSearch! for the data from the CSV file; (4) provide a simple and effective search unterface using the instance from 3 as a backend. Supervised by Hannah Bast.
Completed projects and theses
List of completed B.Sc. and M.Sc. theses (each with the written thesis and presentation)
List of completed B.Sc. and M.Sc. projects (each with a project website)