#acl Claudius Korzen:read,write Patrick Brosi:read,write Axel Lehmann:read,write Natalie Prange:read,write All:read This page gives an overview of available, ongoing and completed Bachelor's and Master's Projects and Theses at the Chair for Algorithms and Data Structures. <> = Available projects and theses = == Extensions for QLever, our SPARQL engine for large knowledge graphs == QLever is a SPARQL engine that can efficiently handle knowledge graphs with up to 100 billion triples on a standard PC. QLever is an open source project, full of clever(!) algorithms and data structures, and written in beautiful C++20 (at least the newer parts of the code). It is being very actively developed by our group. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]] and [[https://ad.informatik.uni-freiburg.de/staff/kalnbach|Johannes Kalmbach]]. <> == Geo-Queries on QLever == Our SPARQL engine QLever currently supports some basic geo queries such as https://qlever.cs.uni-freiburg.de/wikidata/53fVvz . However, only few such queries are supported and they are not efficient. One reason is that coordinates are currently stored as strings, which are only parsed at runtime. Another (connected) reason is that there is currently no data structure (like an R-tree) to narrow down the amount of expensive computation. This is a nice project and thesis for people who like to design and implement efficient algorithms for problems that actually matter. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. == Question Answering on WikiData == [[https://www.wikidata.org|Wikidata]] is the largest general-purpose knowledge graph, with a current size of 17 billion triples. One way to search on Wikidata is via a SPARQL engine, see [[https://query.wikidata.org|Wikidata's own query service]] or an instance of [[https://qlever.cs.uni-freiburg.de/wikidata|our own SPARQL engine]]. SPARQL is great for asking complex queries in a precise manner, but for many simpler queries, it is desirable to just ask a question in natural language. For example: "all German politicians and their current party". The goal of this project or thesis is to enable such question answering on Wikidata. The basic principle is to automatically SPARQL queries that interpret the given natural-language question and then rank these queries via deep learning to find the best interpretation. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. <> == Wikidata Easy == [[https://www.wikidata.org|Wikidata]] is the largest general-purpose knowledge graph, with a current size of 17 billion triples. Unlike for Wikipedia, Wikidata's identifiers have alpha-numerical codes, from which it is not apparent what the entity is about. For example, the identifier for Albert Einstein is https://www.wikidata.org/wiki/Q937 . The goal of this project is to create a version of a subset of Wikidata with human-readable (Wikipedia-like) identifiers. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. <> = Ongoing projects and theses = == Determine the natural type for each Wikidata entity == Each entity has a "natural" primary type. For example, Albert Einstein is a "person", gold is a "substance", Berlin is a "city", the Sagrada Familia is a "building", etc. There is also a secondary, more specifiy type. For example, more specifically Albert Einstein is a "scientist", gold is a "chemical element", Berlin is a "capital", and the Sagrada Familia is a "church". Now a knowledge graph like Wikidata knows not only these types for each entity, but also dozens of others more or less specific types. For example, Albert Einstein is also a "quantum physicist" and an "organism". The goal of this thesis or project is to compute a natural primary and secondary type for each entity automatically. Having such types is very useful for many natural language processing tasks. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. == Electricity grid state estimation with Graph Neural Networks == In this thesis you will work with Graph Neural Networks to estimate the state of an electricity grid. Supervised by [[mailto:jasmin.montalbano@ise.fraunhofer.de|Jasmin Montalbano (Fraunhofer ISE)]] and [[https://ad.informatik.uni-freiburg.de/staff/hertel|Matthias Hertel]]. <> == Search Engine for Medical Reports == Implement a search system for medical reports (combination of unstructured and structured data) for the Department of Radiology at the Medical Center of the University of Freiburg, using (and developing further) the fancy search technology available at our chair. Supervised by Ben Wilhelm (Department of Radiology), [[https://ad.informatik.uni-freiburg.de/staff/prange|Natalie Prange]], and [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. <> == Mail Search == It's 2021 and mail search still sucks, especially if you have a lot of mail. The goal of this is to enable fast and intelligent search on very large mail archives. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. <> == Extracting schedule data from PDF timetables == Many public transit agencies already publish their schedule data as GTFS feeds or in some proprietary format. The goal of this project and thesis is to extract machine readable schedule data from PDF timetables published for print like [[https://www.rmv.de/c/fileadmin/import/timetable/RMV_Linienfahrplan_G_10_ab_15.12.19_bis_12.12.20_ohne.pdf|this one]]. We expect that good results can already be achieved by using an off-the-shelf tool for extracting tabular data from PDFs (pdftotext). Full station names and coordinates can be extracted from OSM data. An interest in Geo data, schedule data and Hidden Markov Models is beneficial. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. <> == Annotating OpenStreetMap data with elevation data == [[http://www.openstreetmap.org|OpenStreetMap]] comes with the attribute {{{height}}} to capture the elevation above sea level for elements. This attribute, however, is seldomly used. The goal of this project is to develop a tool which takes external elevation data from the SRTM project (https://wiki.openstreetmap.org/wiki/SRTM) and annotates the complete OSM data with elevation information (that is, each node should be given a {{{height}}} attribute). The elevation data resolution and accuracy make this a challenging task: for example, rivers should not sometimes flow uphill, nodes on bridges and tunnels should not be assigned the ground elevation, and obvious errors (streets suddenly dropping 5 meters) should be smoothened. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. <> == Map-Matching Mobile Phones to Public Transit Vehicles == Consider the following scenario: you are sitting in a public transit vehicle and you are interested in the available connections at the next stop, or the current delay of your vehicle, or you have changed your desired destination and want to calculate a new route, using the public transit vehicle your are currently in as a starting point. To do all this, your device has to know which vehicle it is currently travelling in. This is a map-matching problem, but the "map" (the positions of all vehicles in the network) is not static, but highly dynamic. The goal of this thesis and project is to develop an app + a dynamic map-matching backend which "snaps" the device to the most likely public transit vehicle. Once such a link is established, the device can be used to improve the real-time information for the vehicle (the Google project "Pigeon Transit" did exactly that). An interest in Geo data and schedule data is beneficial. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. <> == Simplified Package Delivery App == The goal of this bachelor project is to create a web application to streamline package deliveries in large institutions / companies / campuses without a central concierge service. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. <> == Question Answering on WikiData == Make our question answering work on [[https://www.wikidata.org|WikiData]]. !WikiData is currently growing fast and will become the new Freebase. It's an exciting dataset. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. <> == Spelling Correction by Neural Machine Translation == In this project or thesis you will use state-of-the-art neural machine translation methods to develop a spell checker. Misspelled text is treated as an artificial language, and the task is to train a neural network that “translates” to correctly spelled text. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/hertel|Matthias Hertel]]. <> == Spelling Correction and Autocompletion for Mobile Devices == Automatic spelling correction and text completion methods help billions of smartphone users every day. The goal of this project or thesis is to develop an Android keyboard which gives accurate corrections and completions efficiently. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/hertel|Matthias Hertel]]. <> == Fast and Accurate Sentence Segmentation == Sentence boundary detection (sometimes called sentence segmentation or sentence splitting) is the task of determining where sentences start and end in a given text. While this seems a simple task at first glance, and rule-based methods work in many cases, there are difficult corner cases where current methods fail. The goal of this project or thesis is to develop a segmentation method based on deep learning, which beats traditional methods in accuracy and is fast enough to be applicable to big datasets. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/hertel|Matthias Hertel]]. <> == Disaggregation of Energy Data == Manufacturers need to know electricity consumption profiles of their machines. However, electricity consumption is usually only measured as an aggregate over the entire factory. The goal of this project or thesis is to use Machine Learning and Deep Learning to predict the per-machine consumption from the aggregated measurement. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/hertel|Matthias Hertel]] and [[mailto:benedikt.koepfer@ise.fraunhofer.de|Benedikt Köpfer (Fraunhofer ISE)]]. <> == Ant Colony Optimization for Distribution Grid Planning == The energy system transformation, with more and more decentralized electricity generation, raises the need for electricity grid reinforcements. The goal of this project or thesis is to evaluate and improve an ant-colony optimization algorithm for automated distribution grid planning. Supervised by [[https://www.ise.fraunhofer.de/en/about-us/staff-profiles/biener-wolfgang.html|Wolfgang Biener]] (Fraunhofer ISE) and [[https://ad.informatik.uni-freiburg.de/staff/hertel|Matthias Hertel]]. <> = Completed projects and theses (very incomplete list) = For a detailed description of completed projects see our [[https://ad-blog.informatik.uni-freiburg.de/|AD Blog]]. For details about completed theses see our [[https://ad.informatik.uni-freiburg.de/publikationen/bachelor_master_arbeiten|Website]]. == Entity Linking and Coreference Resolution in C++ == At our chair, we have developed an Entity Linking and Coreference Resolution tool that yields high quality linking results on Wikipedia. However, the tool is currently implemented in Python and linking the entire Wikipedia (~ 6M articles) takes several days. The goal of this project is to efficiently re-implement our existing tool in C++ in order to reduce the run time while maintaining the quality of the linking results. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/prange|Natalie Prange]]. <> == End-to-end Entity Linking with BERT == Traditionally, the task of Entity Linking is split into three subtasks: 1) Named Entity Recition 2) Candidate Selection 3) Named Entity Disambiguation. In reality, these tasks are heavily connected and while many recent approaches focus on Named Entity Disambiguation, steps 1) and 2) have a huge influence on the performance of an EL system. Your task is to implement an end-to-end Named Entity Recognition and Disambiguation system using a “siamese” BERT architecture. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/prange|Natalie Prange]]. <> == Scientist finder == Extract a large number of scientist's homepages from the [[http://commoncrawl.org/|CommonCrawl]] web crawl. Extract the central information from these pages, including: name, profession, gender, affiliation. It will be relatively straightforward to get results of medium quality. The challenge is to achieve results of high quality. Machine learning will be crucial to achieve that. Exploring suitable methods is part of the challenge. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. <> == Entity Linking with BERT Embeddings == Recent approaches to Entity Linking make use of BERT word embeddings. In contrast to traditional word embeddings like !Word2Vec, BERT word embeddings are contextualized. That is, the same word has different embeddings, depending on the context it occurs in. This is an interesting property to exploit when trying to disambiguate an entity mention in a text against a knowledge base. The goal of this project is to implement a neural network that uses BERT word embeddings to link entities to their entries in the knowledge base !WikiData. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/prange|Natalie Prange]]. <> == Merging Overlapping GTFS Feeds (Bachelor project or thesis) == Many transportation companies publish their timetable data either directly as GTFS feeds or in formats that can be converted to GTFS. As soon as you have two GTFS feeds (two sets of timetable data) that cover either the same or adjacent areas, the problem of duplicate trips arises. You should develop a tool that merges two or more GTFS feeds and solves duplication / fragmentation issues. As a bachelor project or thesis. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. <> == Tokenization Repair (project and/or thesis) == Interesting and well-defined problem, the solution of which is relevant in a variety of information retrieval scenarios. Simple rule-based solutions come to mind easily, but machine learning is key to get very good results. A background in machine learning, or a strong willingness to aquire one as part of the project/thesis, is mandatory for this project. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. <> == Combined SPARQL+Text search (project or thesis) == This is well-suited as a project (B.Sc. or M.Sc.) but also provides ample opportunity for continuation with a theses (B.Sc. or M.Sc). You should be fond of good user interfaces and have a good taste concerning layout and colors and such things. You should also like knowledge bases and big datasets and searching in them. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. <> == Synonym Finder (project and/or theses) == Find synonyms for all entities from Freebase or !WikiData or Wikipedia. Evaluate the quality and compare it to existing synonym databases like !CrossWikis. Motivation: Most entities are known under several names. For example, a person like "Ellen !DeGeneres" is also known as just "Ellen". A profession like "Astronaut" is also known as "Spaceman", "Spacewoman" or "Cosmsonaut". Knowing these synonyms is key for many NLP (Natural Lanuage Processing) problems. Including complex problems like semantic search and question answering. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. <> == A Search Engine for OpenStreetMap Data (project and/or thesis) == Implement the backend for a search engine for [[http://www.openstreetmap.org|OpenStreetMap]] data. Your server should load an arbitrary .osm file and answer (fuzzy) string searches over the entire dataset. This is a nice project to familiarize yourself with different index structures. Continuation as a thesis is possible.||Preferably you have visited our lecture [[InformationRetrievalWS1617|Information Retrieval]], but this is not required. Code should be written in C++. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. <> == Tabular Information Extraction (project and/or thesis) == Extract data from a knowledge base in a tabular format. This could for example be a list of cities with columns for the country they are in, population count and coordinates but really anything fitting in a table should be possible. Think of the typically hand crafted summary tables on Wikipedia. This is well-suited as a project (B.Sc. or M.Sc.) with possible continuation as a thesis. You should be interested in knowledge bases, big datasets and searching in them. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/schnelle|Niklas Schnelle]]. <> == Conversational Aqqu (project and/or thesis) == In this project you will be working on an extension to our Question Answering system Aqqu. This extension will enable follow-up questions and thus a more natural interface. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/schnelle|Niklas Schnelle]]. <> == A Simple chat bot == Build a simple chat bot using deep learning. For a recent example of such a chatbot, see [[https://www.washingtonpost.com/news/to-your-health/wp/2017/12/03/the-woebot-will-see-you-now-the-rise-of-chatbot-therapy|Woebot]]. This topic gives you a lot of freedom concerning the outcome, but it is also free in the sense that you have to research yourself what is out there already and what can realistically be achieved in six months yourself. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. <> == Error Correction for Question Answering == Design and build a system that accepts a (relatively simple) question in natural language and automatically corrects typos etc. This should be realized with a character-based language model learned using deep learning (e.g., with an RNN). Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. <> == Crawling and Analysis of Scientist Homepages == Design and build a system that crawls as many university pages and finds as many person mentions as possible and creates a neat knowledge base from this information (Name of Scientist, Affiliation, Gender, Title). In previous work, we tried extracting this information from the !CommonCrawl archive, but that turned out to be much too incomplete and unreliable. This is a challenging problem with large practical value. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/schnelle|Niklas Schnelle]]. <> == Bitcoin trading app == Design and implement a reasonably clever (not to reckless and not to conservative) algorithm for trading bitcoins. Evaluate on historical data and implement an API and a web app to monitor on live data. Take into account all the actual costs incurred (like taxes and fees). Analyse the return of investment that can be reasonably expected. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. <> == Entity Recognition on a Web Corpus == Design and implement a named-entity recognizer for a web-size corpus. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. <> == Context Decomposition of a Web Corpus == Decompose the sentences of a given web-size corpus into their semantic components, using fancy technology developed in our group. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. <> == Extracting Words from Text Documents with Complex Layouts (bachelor thesis) == Design and implement a (learning-based) system for extracting words from layout-based text documents (e.g., PDF documents), which is a surprisingly difficult (but not super-hard) task. The reason is that the text is typically only provided character-wise (and not word-wise) so that word boundaries must be derived from e.g., analyzing the spacings between the characters. Another challenge is that the layout of a text document can be arbitrarily complex, with text arranged in multiple columns and different alignments so that special care is required to not mix up text from different columns. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/korzen|Claudius Korzen]]. <> == Extracting Special Characters from Layout-Based Text Documents (bachelor thesis) == Design and implement a (learning-based) system for extracting ''ligatures'' (like fi or ffi) and ''characters with diacritics'' (like á and è) from layout-based text documents (e.g., PDF documents). The challenge here is that such characters can be ''drawn'' into the text, in which case they need to be recognized by analyzing their shapes. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/korzen|Claudius Korzen]]. <> == GTFS Browser Web App (Bachelor project or thesis) == Develop a web-application that can be used to analyze huge GTFS datasets. There are already some tools available (for example, [[https://github.com/google/transitfeed/wiki/ScheduleViewer|ScheduleViewer]]) but they all feel and look quite clumsy, are incredible slow and cannot handle large datasets. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. <> == Automated Generation of Rail Noise Maps == Given rail schedule data which has been fully map-matched (for which for each vehicle we know the exact path it takes through a geographical network), it is possible to automatically generate rail noise maps from this data. In this project or bachelor thesis, you are provided with map-matched schedule data from which you should generate such noise maps. Your maps should respect high buildings or large forests which may act as a noise shield. The noise intensity should be estimated by the expected vehicle speed. Your maps should offer the ability to show the noise for adjustable time spans. An interest in Geo data and schedule data is beneficial. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. <> == River Maps == The goal of this project is to use our tool [[http://loom.cs.uni-freiburg.de|LOOM]] to render maps of rivers from OSM data. Each river segment should consist of all rivers that contributed to this river so far (for example, beginning at Mannheim, the Neckar should be part of the segment that makes up the Rhine). Think of a single river as a single subway line starting at the source of that river, and the Rhine, for example, as dozens of small subway lines next to each other. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. <> == Schedule Map Matching with Graphhopper == The Graphhopper routing engine has been extended with a map-matching functionality which is based on [[https://www.ismll.uni-hildesheim.de/lehre/semSpatial-10s/script/6.pdf|this Microsoft paper]] in recent years. The goal of this project or thesis is to use Graphhopper to implement a map-matching tool which produces the correct geographical routes (shapes) for schedule data given in the GTFS format. The tool should be evaluated (speed and quality) against our own schedule map matching tool pfaedle. Supervised by [[https://ad.informatik.uni-freiburg.de/staff/brosi|Patrick Brosi]]. <>