AD Teaching Wiki:

This page gives an overview of available, ongoing and completed Bachelor's and Master's Projects and Theses at the Chair for Algorithms and Data Structures.

Contents

  1. Available projects and theses
    1. Extensions for QLever, our SPARQL engine for large knowledge graphs
    2. Geo-Queries on QLever
    3. Question Answering on WikiData
    4. Wikidata Easy
    5. Determine the natural type for each Wikidata entity
  2. Ongoing projects and theses
    1. Electricity grid state estimation with Graph Neural Networks
    2. Search Engine for Medical Reports
    3. Mail Search
    4. Extracting schedule data from PDF timetables
    5. Annotating OpenStreetMap data with elevation data
    6. Map-Matching Mobile Phones to Public Transit Vehicles
    7. Simplified Package Delivery App
    8. Question Answering on WikiData
    9. Spelling Correction by Neural Machine Translation
    10. Spelling Correction and Autocompletion for Mobile Devices
    11. Fast and Accurate Sentence Segmentation
    12. Disaggregation of Energy Data
    13. Ant Colony Optimization for Distribution Grid Planning
  3. Completed projects and theses (very incomplete list)
    1. Entity Linking and Coreference Resolution in C++
    2. End-to-end Entity Linking with BERT
    3. Scientist finder
    4. Entity Linking with BERT Embeddings
    5. Merging Overlapping GTFS Feeds (Bachelor project or thesis)
    6. Tokenization Repair (project and/or thesis)
    7. Combined SPARQL+Text search (project or thesis)
    8. Synonym Finder (project and/or theses)
    9. A Search Engine for OpenStreetMap Data (project and/or thesis)
    10. Tabular Information Extraction (project and/or thesis)
    11. Conversational Aqqu (project and/or thesis)
    12. A Simple chat bot
    13. Error Correction for Question Answering
    14. Crawling and Analysis of Scientist Homepages
    15. Bitcoin trading app
    16. Entity Recognition on a Web Corpus
    17. Context Decomposition of a Web Corpus
    18. Extracting Words from Text Documents with Complex Layouts (bachelor thesis)
    19. Extracting Special Characters from Layout-Based Text Documents (bachelor thesis)
    20. GTFS Browser Web App (Bachelor project or thesis)
    21. Automated Generation of Rail Noise Maps
    22. River Maps
    23. Schedule Map Matching with Graphhopper

Available projects and theses

Extensions for QLever, our SPARQL engine for large knowledge graphs

QLever is a SPARQL engine that can efficiently handle knowledge graphs with up to 100 billion triples on a standard PC. QLever is an open source project, full of clever(!) algorithms and data structures, and written in beautiful C++20 (at least the newer parts of the code). It is being very actively developed by our group. Supervised by Hannah Bast and Johannes Kalmbach.

- Show / hide details

Type: Bachelor or Master thesis, also possible in combination with a project.

Background info: Have a look at our QLever demo and the links at the bottom, in particular, to our GitHub repository and various publications.

Goal: There are lots of things still to be done for Qlever, like all kinds of missing features and things going not as fast as they could or should. If you like backend development on an open-source project, and if you like or want to learn how to write efficient and beautiful code in C++, this is for you.

It makes sense to start this as a (practical) project and then delve deeper scientifically in your theses. Suitable for both bachelor and master students.

Geo-Queries on QLever

Our SPARQL engine QLever currently supports some basic geo queries such as https://qlever.cs.uni-freiburg.de/wikidata/53fVvz . However, only few such queries are supported and they are not efficient. One reason is that coordinates are currently stored as strings, which are only parsed at runtime. Another (connected) reason is that there is currently no data structure (like an R-tree) to narrow down the amount of expensive computation. This is a nice project and thesis for people who like to design and implement efficient algorithms for problems that actually matter. Supervised by Hannah Bast.

Question Answering on WikiData

Wikidata is the largest general-purpose knowledge graph, with a current size of 17 billion triples. One way to search on Wikidata is via a SPARQL engine, see Wikidata's own query service or an instance of our own SPARQL engine. SPARQL is great for asking complex queries in a precise manner, but for many simpler queries, it is desirable to just ask a question in natural language. For example: "all German politicians and their current party". The goal of this project or thesis is to enable such question answering on Wikidata. The basic principle is to automatically SPARQL queries that interpret the given natural-language question and then rank these queries via deep learning to find the best interpretation. Supervised by Hannah Bast.

- Show / hide details

Type: Bachelor or Master thesis, also possible in combination with a project.

Background info: our question answering system Aqqu anwers questions formulated in natural language from a knowledge base. Currently, that knowledge base is Freebase, which is still the world's largest open source general-purpose knowledge base. Unfortunately, Freebase was discontinued in 2016.

Goal: make our question answering work on WikiData. WikiData is currently growing fast and will become the new Freebase. It's an exciting dataset.

Wikidata Easy

Wikidata is the largest general-purpose knowledge graph, with a current size of 17 billion triples. Unlike for Wikipedia, Wikidata's identifiers have alpha-numerical codes, from which it is not apparent what the entity is about. For example, the identifier for Albert Einstein is https://www.wikidata.org/wiki/Q937 . The goal of this project is to create a version of a subset of Wikidata with human-readable (Wikipedia-like) identifiers. Supervised by Hannah Bast.

- Show / hide details

Type: Interesting and well-defined problem. It will be no problem to get some result. One challenge is to map as much as possible from Wikidata to a simpler format. Another challenge is to find a good human-readable name for each entities (that is disjoint from all other entity names).

Goal: Core-information from Wikidata with human-readable identifiers, so that it is more accessible for others.

Step 1: First version that is stitched together from carefully chosen CONSTRUCT queries, making use of Wikidatas links to Wikipedia (and exploiting that Wikipedia names are unique).

Step 2: Tackle Wikidata entities that are not mapped to Wikipedia and how to find unique and meaningful names for these. Also what about predicate names?

Determine the natural type for each Wikidata entity

Each entity has a "natural" primary type. For example, Albert Einstein is a "person", gold is a "substance", Berlin is a "city", the Sagrada Familia is a "building", etc. There is also a secondary, more specifiy type. For example, more specifically Albert Einstein is a "scientist", gold is a "chemical element", Berlin is a "capital", and the Sagrada Familia is a "church". Now a knowledge graph like Wikidata knows not only these types for each entity, but also dozens of others more or less specific types. For example, Albert Einstein is also a "quantum physicist" and an "organism". The goal of this thesis or project is to compute a natural primary and secondary type for each entity automatically. Having such types is very useful for many natural language processing tasks. Supervised by Hannah Bast.

Ongoing projects and theses

Electricity grid state estimation with Graph Neural Networks

In this thesis you will work with Graph Neural Networks to estimate the state of an electricity grid. Supervised by Jasmin Montalbano (Fraunhofer ISE) and Matthias Hertel.

- Show / hide details

Your interests/skills: Deep Learning, Graph Neural Networks, Electricity Networks

Details:
Electricity grids are increasingly heavily used due to the rise of power-consuming technologies like electric vehicles and heat pumps, which are needed for the energy transition towards a carbon-free energy supply. To guarantee the safety and stability of the grid, grid operators must know the grid's state at every moment in time. Only few measuring stations exist in the grid, and therefore the voltage must be estimated at grid nodes without measurements to infer the full state of the grid. Graph Neural Networks (GNNs) are suited to work on heterogeneous graph structures like an electrical grid. This was already tested at Fraunhofer ISE. In this thesis, you are going to extend and improve the GNN model by Fraunhofer ISE. The performance of GNNs with different layer types, hyperparameter settings and training data will be tested.

Search Engine for Medical Reports

Implement a search system for medical reports (combination of unstructured and structured data) for the Department of Radiology at the Medical Center of the University of Freiburg, using (and developing further) the fancy search technology available at our chair. Supervised by Ben Wilhelm (Department of Radiology), Natalie Prange, and Hannah Bast.

- Show / hide details

Type: Master project

Your interests / skills: C++, Natural Language Processing, Search Systems

Our SPARQL engine QLever supports search in structured data, but also in text. The text search part is still underdeveloped, both regarding what is possible and the efficiency. It would be great for this project, as well as for other projects, to develop it further.

Beyond that, there are also many other interesting and non-trivial aspects that can be considered, such as entity recognition and linking, synonym support, proper handling of negations, etc.

It's 2021 and mail search still sucks, especially if you have a lot of mail. The goal of this is to enable fast and intelligent search on very large mail archives. Supervised by Hannah Bast.

- Show / hide details

Type: Bachelor or Master thesis, also possible in combination with a project.

Background info: Are you happy with the search capabilities of your mailing program? In particular: if you use Thunderbird, are you happy with Thunderbird's search capabilities? I'm not and I think we can do better.

Goal: Subtasks are: (1) Write an efficient parser that reads one or files in MBOX format and produces a CSV file with one line per mail and columns for the various structured and unstructured parts of an email (from, to, subject, date, body, ...); (2) take proper care of encoding issues, which are a major issue when dealing with a large number of emails; (3) setup an instance of CompleteSearch for the data from the CSV file; (4) provide a simple and effective search interface using the instance from 3 as a backend; (5) implement a plugin for Thunderbird.

Extracting schedule data from PDF timetables

Many public transit agencies already publish their schedule data as GTFS feeds or in some proprietary format. The goal of this project and thesis is to extract machine readable schedule data from PDF timetables published for print like this one. We expect that good results can already be achieved by using an off-the-shelf tool for extracting tabular data from PDFs (pdftotext). Full station names and coordinates can be extracted from OSM data. An interest in Geo data, schedule data and Hidden Markov Models is beneficial. Supervised by Patrick Brosi.

- Show / hide details

As a combination of bachelor project / bachelor thesis.

Goal: Given a collection of PDF timetables like this one, extract machine-readable schedule data (in the GTFS format) from it.

Requirements:

  1. Your tool should not require any additional user input.
  2. You should write a tool that can be used from the command line.
  3. The output format should be valid GTFS.

Annotating OpenStreetMap data with elevation data

OpenStreetMap comes with the attribute height to capture the elevation above sea level for elements. This attribute, however, is seldomly used. The goal of this project is to develop a tool which takes external elevation data from the SRTM project (https://wiki.openstreetmap.org/wiki/SRTM) and annotates the complete OSM data with elevation information (that is, each node should be given a height attribute). The elevation data resolution and accuracy make this a challenging task: for example, rivers should not sometimes flow uphill, nodes on bridges and tunnels should not be assigned the ground elevation, and obvious errors (streets suddenly dropping 5 meters) should be smoothened. Supervised by Patrick Brosi.

- Show / hide details

As a combination of bachelor project / bachelor thesis.

Goal: Given OSM data in one of the usual formats (OSM-XML, pbf, o5m) and external elevation data from the SRTM project, annotate the OSM data with height information per node.

Requirements:

  1. Obtaining the SRTM data for large areas is not trivial and requires a special tool (at least this is what we found).
  2. You should write a tool that can be used from the command line.
  3. The output format should be valid OSM.

Optional requirements:

  1. Error correction for bridges and tunnels
  2. Error correction for buildings (the height of a building should not include the building height, as would be the case in the SRTM satellite data)

Map-Matching Mobile Phones to Public Transit Vehicles

Consider the following scenario: you are sitting in a public transit vehicle and you are interested in the available connections at the next stop, or the current delay of your vehicle, or you have changed your desired destination and want to calculate a new route, using the public transit vehicle your are currently in as a starting point. To do all this, your device has to know which vehicle it is currently travelling in. This is a map-matching problem, but the "map" (the positions of all vehicles in the network) is not static, but highly dynamic. The goal of this thesis and project is to develop an app + a dynamic map-matching backend which "snaps" the device to the most likely public transit vehicle. Once such a link is established, the device can be used to improve the real-time information for the vehicle (the Google project "Pigeon Transit" did exactly that). An interest in Geo data and schedule data is beneficial. Supervised by Patrick Brosi.

- Show / hide details

As a combination of bachelor project / bachelor thesis.

Goal: Given a stream of GPS coordinates (each with a timestamp), static schedule data and optional realtime schedule data, match the GPS device to the most likely public transit schedule or decide that the device is not on public transit at all.

Requirements:

  1. Your tool should be implemented as a backend service with a HTTP interface
  2. A simple GUI should be provided, which either accepts a list of coordinate / timestamp pairs or already uses the GPS stream delivered by the device

Optional requirements:

  1. Track the link between the public transit vehicle and the device. If the device is still following the vehicle course, but with a different position than it currently should have, update the real-time information (delay) accordingly.
  2. Provide a chat for riders of the same vehicle.

Simplified Package Delivery App

The goal of this bachelor project is to create a web application to streamline package deliveries in large institutions / companies / campuses without a central concierge service. Supervised by Patrick Brosi.

- Show / hide details

You interests/skills: web programming, user interfaces

A common problem in large institutions / companies / campuses without a central concierge service are missing packages. Someone ordered a package, some other person accepted it, and it is now lost. The goal of this project is to create a web application which can be used by such institutions to make it easier to find lost packages and to avoid frequent institution-wide mails asking for package whereabouts.

Goal 1 Develop a web app in which users can enter packages that they are expecting, and packages they have received. The list should be fuzzy-searchable. If you add a package you have received, the app should auto-suggest possible matching candidates.

Goal 2 Connect the app to one or multiple tracking APIs (like the DHL API) to show the current status of packages.

Goal 3 Implement email notifications.

Extension to a bachelor thesis is possible. The thesis should then result in a polished web application which can be easily configured and deployed. A usable instance for our faculty should be deployed. A usability study should also be conducted.

Question Answering on WikiData

Make our question answering work on WikiData. WikiData is currently growing fast and will become the new Freebase. It's an exciting dataset. Supervised by Hannah Bast.

- Show / hide details

Type: Bachelor or Master thesis, also possible in combination with a project.

Background info: our question answering system Aqqu anwers questions formulated in natural language from a knowledge base. Currently, that knowledge base is Freebase, which is still the world's largest open source general-purpose knowledge base. Unfortunately, Freebase was discontinued in 2016.

Goal: make our question answering work on WikiData. WikiData is currently growing fast and will become the new Freebase. It's an exciting dataset.

Spelling Correction by Neural Machine Translation

In this project or thesis you will use state-of-the-art neural machine translation methods to develop a spell checker. Misspelled text is treated as an artificial language, and the task is to train a neural network that “translates” to correctly spelled text. Supervised by Matthias Hertel.

- Show / hide details

Your interests/skills: Deep Learning, Natural Language Processing, Neural Machine Translation

Very recently, big improvements in neural machine translation were achieved with encoder-decoder networks with attention and Transformer networks. Like machine translation, spelling correction is a sequence-to-sequence transformation task, where the input language is misspelled English and the target language is correctly spelled English. In this project or thesis, you are going to solve this task with deep learning.

While the idea looks very promising, there are many details one has to get right to achieve good performance: finding (or generating) good training data, choosing the right representations for the input and target language and many hyperparameters for the model design and training.

You should be familiar with at least one deep learning framework for python (TensorFlow or pytorch).

Step 1: Start simple: take a text corpus of your choice and inject a few random characters into each sentence. Build and train a small encoder-decoder network that learns to remove the wrong characters from the sentences.
Steps 2 to N: From here you can proceed in many directions. Extend the types of randomized errors (e.g. character deletions, replacements and swaps, as well as merges and splits of words) and find collections of real misspellings to train and evaluate on; try different models (recurrent models without and with attention, Transformer) and optimize their hyperparameters; test different input and output representations (characters, subwords, words or embeddings); and many more.

Goal: In the end you will need to compare the performance of your approach to at least one baseline method and one existing spell checker, and provide insight into what works well (or not) and ideally why it works (or why it does not).

Spelling Correction and Autocompletion for Mobile Devices

Automatic spelling correction and text completion methods help billions of smartphone users every day. The goal of this project or thesis is to develop an Android keyboard which gives accurate corrections and completions efficiently. Supervised by Matthias Hertel.

- Show / hide details

Your interests/skills: Android programming, efficient algorithms, text processing

The standard tools for spelling correction and autocompletion for mobile devices are limited in their functionality and accuracy. You will explore why this is the case, and intend to develop a better tool than the existing ones.

Goal 1 of the project is to develop a simple Android keyboard with spelling correction and autocompletion functionalities. The material from the Information Retrieval lecture about Fuzzy Search will be helpful, and you will need to build a (small) n-gram language model (no Machine Learning required, basic statistics is enough).
Goal 2 is to understand the limitations of existing methods: which RAM and speed constraints follow from the mobile device setting?
Goal 3 is to improve your method as much as you can, while satisfying the constraints.

Extension to a thesis is possible: a proper evaluation and comparison to baselines and existing tools will be required, and you will work on further achievements: improve the accuracy and efficiency of your method, use more complex (possibly neural network) language models, make your tool adaptive to previous user input, deploy a usability study with real humans, and everything you find interesting and want to work on.

Fast and Accurate Sentence Segmentation

Sentence boundary detection (sometimes called sentence segmentation or sentence splitting) is the task of determining where sentences start and end in a given text. While this seems a simple task at first glance, and rule-based methods work in many cases, there are difficult corner cases where current methods fail. The goal of this project or thesis is to develop a segmentation method based on deep learning, which beats traditional methods in accuracy and is fast enough to be applicable to big datasets. Supervised by Matthias Hertel.

- Show / hide details

Your interests/skills: Deep Learning, Natural Language Processing

Consider the following example text, which consists of two sentences:
Prename M. Lastname (brn. 1930 - ?) was a U.S. citizen working for A.B.C. Inc. She owned buildings Pl. Someplace Nos. 1-3, e.g. No. 1.
This is an invented example, but each of the difficult cases in the text happens to appear in some Wikipedia article. Neither the sentence segmenter of SpaCy nor the one of the NLTK toolkit perform the correct split after “Inc.” The two segmenters split the text into 6 and 9 sentences respectively, which is 5 and 8 wrong splits.

We expect that the problem can be solved almost perfectly with machine learning techniques. Furthermore, a learning approach has the advantage that it can be used for multiple languages, simply by training it on different datasets. However, the training dataset has to reflect the difficult cases and must be of appropriate size. As an additional requirement, we want to be able to segment large text corpora (e.g. the entire Wikipedia) in little time, which rules out the usage of big machine learning models.

You could begin your work as follows:

  1. Find existing sentence segmenters (e.g. from NLTK, SpaCy, CoreNLP, OpenNLP, Stanza). Play with them and detect cases where the segmentation goes wrong.

  2. As always: review the research to see which solutions exist already.
  3. Get labeled training and test data. Check whether you find existing datasets or generate a dataset by yourself.
    1. Find corpora which are sentencized. Training data can be generated by attaching randomly selected sentences. You have to ensure that enough difficult examples are in the training data.
    2. Select difficult examples from Wikipedia and label them manually (only suitable for evaluation, not for training where you need many examples).
  4. Implement and test your approach(es).
    1. As a baseline, see how far you can come with handcrafted rules.
    2. Use statistical methods to automatically generate such rules and exceptions (like the “Punkt” tokenizer behind the NLTK sentence segmentation).
    3. We expect the best segmentations from a (bidirectional) recurrent neural network. However, this will be the slowest model. (Advanced: a Transformer should in principle work equally well or better, and is faster.)
    4. Little context could be enough to solve the problem, e.g. 5 characters to the left and right of a potential split position. Therefore, a convolutional neural network or dense neural network could solve the task, which are faster than a recurrent model.
    5. Can we transform our learned model into a set of rules, or a regular expression? Long training times are acceptable, if we can predict fast.

If the problem is too easy, we can make it harder by introducing noise (typos, missing spaces, missing punctuation), or focus on a less clean domain, for example Tweets.

References

Disaggregation of Energy Data

Manufacturers need to know electricity consumption profiles of their machines. However, electricity consumption is usually only measured as an aggregate over the entire factory. The goal of this project or thesis is to use Machine Learning and Deep Learning to predict the per-machine consumption from the aggregated measurement. Supervised by Matthias Hertel and Benedikt Köpfer (Fraunhofer ISE).

- Show / hide details

Your interests/skills: Basic knowledge and practical experience with Machine Learning and Deep Learning is required (e.g. having completed the ML and DL lectures). Interests in energy data, data analytics and statistics are helpful.

Details:
Manufacturers are interested in the electricity consumption profiles on the level of individual machines. This enables them to predict their consumption when a machine gets replaced or a new machine is added, to reduce the peak consumption of the factory, and to scale on-site batteries, among other usages.

However, electricity consumption data is usually only available as a single time series for a factory. You will develop methods to disaggregate the factory-wide measurement into time series for individual machines, based on a small dataset with machine-level measurements.

Your tasks are as follows:

  1. Find and summarize relevant literature about energy data disaggregation.
  2. Familiarize yourself with the data and task.
  3. Develop baseline methods and get a feeling for the difficulty of the task.
  4. Implement the most promising Machine Learning and Deep Learning methods from the literature.
  5. Do a thorough analysis of the performance of different methods and develop ideas for future improvements.

You will get a paid contract with Fraunhofer ISE. For more information see their job offer.

Ant Colony Optimization for Distribution Grid Planning

The energy system transformation, with more and more decentralized electricity generation, raises the need for electricity grid reinforcements. The goal of this project or thesis is to evaluate and improve an ant-colony optimization algorithm for automated distribution grid planning. Supervised by Wolfgang Biener (Fraunhofer ISE) and Matthias Hertel.

- Show / hide details

Your interests/skills: Graph Algorithms, Electricity Networks, Optimization

As a result of the energy system transformation, the decentralized generation of electricity is becoming more important. Additional grid stress is caused by Power2X technologies as electric vehicles, electrolyzers and heat pumps. This is a challenge especially for distribution grids. In the next years those grid structures need to be adapted to fulfill their duty – a reliable, environmentally friendly and economic electricity supply. State of the art grid planning is based on worst-case assumptions that lead to an oversizing of assets. This is necessary because planners would be overburdened with time series based robust planning tasks. Therefore Fraunhofer ISE develops algorithms for automated optimal network planning, which can take into account all relevant factors for grid reinforcement. By this we aim to reduce necessary grid reinforcement and avoid a delay of the energy system transformation due to grid bottlenecks.

In the project and/or thesis an ant-colony algorithm for distribution grid reinforcement shall be tested and improved. The algorithm optimizes medium and low voltage grid structures. The goal is to test the algorithm at grid structures provided from our partners and to evaluate its results. Based on convergence analysis it shall be decided if heuristics to improve convergence properties shall be implemented or a different ant-colony algorithm should be implemented. Moreover options to use parallelization shall be tested.

Task description

  • Getting familiar with the topic of distribution grid planning
  • Getting familiar with ant-colony algorithms
  • Getting familiar with the existing optimization framework
  • Applying the algorithm on our real world test cases
  • Analyzing resulting grids
  • Integration of heuristics into the algorithm
  • Evaluation and documentation of the results

Completed projects and theses (very incomplete list)

For a detailed description of completed projects see our AD Blog. For details about completed theses see our Website.

Entity Linking and Coreference Resolution in C++

At our chair, we have developed an Entity Linking and Coreference Resolution tool that yields high quality linking results on Wikipedia. However, the tool is currently implemented in Python and linking the entire Wikipedia (~ 6M articles) takes several days. The goal of this project is to efficiently re-implement our existing tool in C++ in order to reduce the run time while maintaining the quality of the linking results. Supervised by Natalie Prange.

- Show / hide details

Type: Bachelor or Master project, with a possible extension to a thesis.

Your interests / skills: Efficient programming in C++, Natural Language Processing

Entity Linking and Coreference Resolution are important steps in the pipeline for downstream NLP tasks like Question Answering or Relation Extraction. Entity Linking on Wikipedia articles is different from Entity Linking on other corpora since intra-Wiki hyperlinks that are manually created by the authors of an article already provide some links "for free". Our existing rule-based system uses these hyperlinks together with entity information extracted from Wikidata such as aliases, popularity, type or gender to link mentions in a given Wikipedia article to Wikidata entities.

Goal: Reduce the run time of our Entity Linking + Coreference Resolution tool while maintaining the quality of the linking results, by efficiently re-implementing the tool in C++.

End-to-end Entity Linking with BERT

Traditionally, the task of Entity Linking is split into three subtasks: 1) Named Entity Recition 2) Candidate Selection 3) Named Entity Disambiguation. In reality, these tasks are heavily connected and while many recent approaches focus on Named Entity Disambiguation, steps 1) and 2) have a huge influence on the performance of an EL system. Your task is to implement an end-to-end Named Entity Recognition and Disambiguation system using a “siamese” BERT architecture. Supervised by Natalie Prange.

- Show / hide details

Your interests/skills: Deep Learning, Natural Language Processing, Knowledge Bases

Step 1: Implement a BERT model with two heads: one for mention detection and one to generate mention embeddings. Find a proper way to embed entities from a knowledge base (e.g. Wikipedia2vec). Then perform candidate selection using similarity search between mention embeddings and entity embeddings.

Step 2: Look for ways to improve the performance and especially the time consumption of your system.

Step 3: Thoroughly evaluate your system. Compare it with at least two other approaches (one state of the art system and one baseline) on different benchmarks. Analyse your results in detail.

Scientist finder

Extract a large number of scientist's homepages from the CommonCrawl web crawl. Extract the central information from these pages, including: name, profession, gender, affiliation. It will be relatively straightforward to get results of medium quality. The challenge is to achieve results of high quality. Machine learning will be crucial to achieve that. Exploring suitable methods is part of the challenge. Supervised by Hannah Bast.

- Show / hide details

Type: Interesting and well-defined problem. It will be relatively straightforward to get results of medium quality. The challenge is to achieve results of high quality. Machine learning will be crucial to achieve that. Exploring suitable methods is part of the challenge.

Goal: Extract a large number of scientist's homepages from the CommonCrawl web crawl. Extract the central information from these pages, including: name, profession, gender, affiliation.

Step 1: Download a large chunk of university webpages (e.g. all universities in Europe or in the US or even in the whole world).

Step 2: Design, implement and run a classifier that identifies people's homepages. A first version can be rule-based. For a classifier of high-quality machine learning will eventually be needed.

Step 3: Extract basic information, like the name, the gender, the profession, and the affiliation. A first version can be rule-based. For result with high precision and recall (which is the goal), machine learning will eventually be needed.

Entity Linking with BERT Embeddings

Recent approaches to Entity Linking make use of BERT word embeddings. In contrast to traditional word embeddings like Word2Vec, BERT word embeddings are contextualized. That is, the same word has different embeddings, depending on the context it occurs in. This is an interesting property to exploit when trying to disambiguate an entity mention in a text against a knowledge base. The goal of this project is to implement a neural network that uses BERT word embeddings to link entities to their entries in the knowledge base WikiData. Supervised by Natalie Prange.

- Show / hide details

Your interests/skills: Deep Learning, Natural Language Processing, Knowledge Bases

Step 1: Choose a meaningful example (in the context of Entity Linking) and visualize BERT word embeddings. We hope to find that BERT word embeddings are useful to differentiate between identical names that refer to different entities. This can be used to motivate our BERT approach to Entity Linking.

Step 2: Implement a neural network that uses BERT word embeddings to link entities to their entries in the knowledge base WikiData. This could be a neural network that takes BERT word embeddings of a recognized entity mention in a text and word embeddings that represent a candidate entity in the knowledge base (candidate entities can be extracted from the knowledge base using the entity’s label and its aliases) and decides whether the mention should be linked to the entity.

Step 3: Compare your approach against state-of-the-art approaches and at least one baseline. What happens if you use Word2Vec / GloVe instead of BERT word embeddings? Evaluate your approach on at least 2 benchmarks (e.g. AIDA-CoNLL, Wikipedia with links, our benchmark, ....).

Merging Overlapping GTFS Feeds (Bachelor project or thesis)

Many transportation companies publish their timetable data either directly as GTFS feeds or in formats that can be converted to GTFS. As soon as you have two GTFS feeds (two sets of timetable data) that cover either the same or adjacent areas, the problem of duplicate trips arises. You should develop a tool that merges two or more GTFS feeds and solves duplication / fragmentation issues. As a bachelor project or thesis. Supervised by Patrick Brosi.

- Show / hide details

Type: Bachelor project or Bachelor thesis. You should be fond of train schedules and have a basic understanding of GIS and geometrical operations. Programming language of your choice, but performance should be good enough to handle very big datasets.

Background info: Many transportation companies publish their timetable data either directly as GTFS feeds or in formats that can be converted to GTFS. As soon as you have two GTFS feeds (two sets of timetable data) that cover either the same or adjacent areas, the problem of duplicate trips arises. Consider a schedule published by Deutsche Bahn containing only trains, and a schedule published by the VAG Freiburg, containing busses, trams and the Breisgau-S-Bahn. The S-Bahn is contained in both datasets, but most probably with different IDs, different line names ("BSB" vs "Breisgau-S-Bahn" vs "Zug" vs ...), different station IDs, different station coordinates... consider an even more complicated example, where a train schedule published by DB contains a train from Amsterdam to Zürich. The train is contained in the DB-Feed from Amsterdam to Basel SBB (where it crosses the Swiss Border), but the part in Switzerland is missing. Another dataset, published by the Swiss Federal Railways, contains the same train, but starting only at Basel Bad Bf (the last German station) and ending at Zurich. A third dataset, published by the Nederlandse Spoorwegen, contains the train from Amsterdam to the first station in Germany. If you want to use all the three feeds together, several problems appear: the train is represented two times between Amsterdam and the first station in Germany, two times between Basel Bad Bf and Basel SBB and the information that you can travel through Basel SBB into Switzerland without having to change trains is completely lost.

Goal: Your input will be two or more GTFS-feeds, your output will be a single, merged GTFS feed that solves all the problems described above. Specifically, you should analyze the data, think of some equivalency measurements for trips (for example, if a train called "ICE 501" arrives in Basel Bad Bf at 15:44 in feed A and a train "ICE501" departs from Basel Bad Bf at 15:49 in feed B, it is most likely that this is the same train) and merge trips / routes that belong to the same vehicle. Another example: if two trains in A and B serve exactly the same stations at exactly the same arrival / departure times, this is also most likely the same train. You should think of some testing mechanism that makes sure that indeed every connection that was possible in feed A and feed B is still possible in the merged feed, that is no information was lost. Given some overlapping feed that appears in different qualities on both feeds, your tool should also automatically decide which (partial) representation has the better quality (for example, in feed A, no geographical information on the train route ('shape') is present, but in feed B, it is, so use the shape information from feed B). You tool should be able to handle huge datasets (for example, the entire schedule [trains, busses, trams, ferries etc.] of Germany).

Tokenization Repair (project and/or thesis)

Interesting and well-defined problem, the solution of which is relevant in a variety of information retrieval scenarios. Simple rule-based solutions come to mind easily, but machine learning is key to get very good results. A background in machine learning, or a strong willingness to aquire one as part of the project/thesis, is mandatory for this project. Supervised by Hannah Bast.

- Show / hide details

Type: Interesting and well-defined problem, the solution of which is relevant in a variety of information retrieval scenarios. Simple rule-based solutions come to mind easily, but machine learning is key to get very good results. A background in machine learning, or a strong willingness to acquire one as part of the project/thesis, is therefore mandatory for this project.

Background info: Breaking a natural-langue text into words is a basic first step for almost any system that does something sophisticated with text (we only consider word-based languages in this project/theses, and not single-symbol-based languages like Chinese). This is called tokenization. If the text is formatted perfectly, this is an (almost) trivial problem: just take the words as maximal strings of word characters. However, text is almost never formatted perfectly. The following kinds of formatting errors are frequently encountered in almost any text:

  1. Hyphenation due to a line break (e.g. Our algo-[newline]rithm runs in linear time.).

  2. Wrong spaces (e.g. Our algo rithm runs in linear time.).

  3. Missing spaces (e.g. Our algorithm runsin linear time.)

  4. Garbled characters (e.g. Our algorifm runs in linear time.)

Most text processing systems simple break for such errors. For example, a typical search engine will simply not find the word algorithm in the sentences 1, 2, and 4 above. A POS-tagger will not recognize runsin as a verb followed by a preposition, which can also affect the correctness of the other POS-tags in the sentence. More complex NLP routines, like parsing, are known to be very non-robust to such errors and may produce completely wrong results for the whole sentence.

Goal: design, implement, and evaluate a system that recognizes and fixes errors of the types 1, 2, and 3 in a given text. Check if 4 can be solved with similar techniques. If yes, also solve it. If no, characterize how it is a different problem (requiring different techniques). The first meaningful steps in this project are as follows.

Step 1: Search the literature for solutions to this problem. It is an important problem, but has received muss less research attention so far then other problems in NLP / text processing / information retrieval.

Step 2: Design simple baseline algorithms (not based on learning or using a very simple kind of learning), design a first small and simple (but not too small and simple) benchmark and evaluate your baseline algorithms on this benchmark. This will give you a feeling of how hard the problem is and where the actual problems lie. Actually your ideas may go in the wrong direction or you optimize the wrong aspects.

Step 3: Familiarize yourself with the kind of machine learning techniques that are useful in such a scenario. A neural network might be useful later on, but is most probably not the first think you should try (if it gives bad results, you will not know if its because the problem is so hard or because you used it in the wrong way, so you first need a baseline from a more classical learning algorithm).

Step 4: Come up with a learning-based algorithm and evaluate and see that you beat the baseline.

Combined SPARQL+Text search (project or thesis)

This is well-suited as a project (B.Sc. or M.Sc.) but also provides ample opportunity for continuation with a theses (B.Sc. or M.Sc). You should be fond of good user interfaces and have a good taste concerning layout and colors and such things. You should also like knowledge bases and big datasets and searching in them. Supervised by Hannah Bast.

- Show / hide details

Type: This is well-suited as a project (B.Sc. or M.Sc.) but also provides ample opportunity for continuation with a theses (B.Sc. or M.Sc). You should be fond of good user interfaces and have a good taste concerning layout and colors and such things. You should also like knowledge bases and big datasets and searching in them.

Background info: Broccoli is our current system for convenient semantic search on text combined with a knowledge base. Here is a demo. The demo is described in [1]. Broccoli works with a special backend, which only supports a particular subset of queries (in a nutshell: treelike SPARQL queries with a text-search component). The backend also has its own peculiar and non-standard syntax. The backend is described in [4]. In the meantime, we have developed a full-featured SPARQL+Text engine, called QLever [1]. QLever supports full SPARQL queries with a text-search component. However, the current UI is very simplistic and hard to use: it's a essentially a text field, where one can enter a SPARQL query, and then one gets a table with the result tuples. Here is a link to an internal demo (only works with an IP address from the technical faculty), which searches the Freebase knowledge base in combination with the ClueWeb12 web corpus. An example query for copy&paste is given under [5].

Goal: design a front-end for QLever that makes it easy to ask SPARQL+Text queries, also without prior knowledge of the entity and predicate names of the underlying knowledge base (in the demos above: Freebase). The user interface should be simpler than that for Broccoli.

Subgoal 1: Broccoli does not know synonyms. For example, when you type occupation there will be no relation match because the name of the relation in Freebase is profession. Or if you type acted in there will be no relation match because the name of the relation in Freebase is Film performance. Basic synonyms are very important for usability.

Subgoal 2: Broccoli has four boxes with suggestions, see the demo linked above. This confused users a lot. One of the main tasks will be to devise a simpler interface, but that is still powerful enough to formulate all queries.

Subgoal 3: Broccoli currently shows only a list of entities with text snippets. A result from a general SPARQL query can consist of an arbitrary number of columns (for example, a person, their profession, their gender, and some text snippets where they are mentioned together with certain words).

Subgoal 4: One problem with the current Broccoli interface is that the user has to know what is a predicate and what is an object. However, this depends a lot on how the statements are represented in the knowledge base. Which is not something the user can or is supposed to know a priori. TODO: give an example. It would be better to provide completions for a predicate combined with an object at the same time.

[1] H. Bast and B. Buchhold. QLever: a Query Engine for Efficient SPARQL+Text Search. Paper currently under review. Code and setup instructions on GitHub

[2] H. Bast, F. Bäurle, B. Buchhold, E. Haußmann. Semantic Full-Text Search with Broccoli (Demo paper). SIGIR 2014: 1265-1266

[3] H. Bast, F. Bäurle, B. Buchhold, E. Haußmann. Broccoli: Semantic Full-Text Search at your Fingertips (System and Evaluation Paper). CoRR abs/1207.2615 (2012)

[4] H. Bast and B. Buchhold. An index for efficient semantic full-text search. CIKM 2013: 369-378

[5] Example SPARQL+Text query (astronauts who walked on the moon) for copy&paste in the text field of our internal QLever demo:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT ?person ?name TEXT(?t) SCORE(?t) WHERE {
  ?person fb:people.person.profession ?profession .
  ?profession fb:type.object.name.en "Astronaut" .
  ?person fb:type.object.name.en ?name .
  ?person <in-text> ?t .
  ?t <in-text> "walk* moon"
}
LIMIT 100
ORDER BY DESC(SCORE(?t))

Synonym Finder (project and/or theses)

Find synonyms for all entities from Freebase or WikiData or Wikipedia. Evaluate the quality and compare it to existing synonym databases like CrossWikis. Motivation: Most entities are known under several names. For example, a person like "Ellen DeGeneres" is also known as just "Ellen". A profession like "Astronaut" is also known as "Spaceman", "Spacewoman" or "Cosmsonaut". Knowing these synonyms is key for many NLP (Natural Lanuage Processing) problems. Including complex problems like semantic search and question answering. Supervised by Hannah Bast.

- Show / hide details

Background info: Most entities are known under several names. For example, a person like "Ellen DeGeneres" is also known as just "Ellen". A profession like "Astronaut" is also known as "Spaceman", "Spacewoman" or "Cosmsonaut". Knowing these synonyms is key for many NLP (Natural Lanuage Processing) problems. Including complex problems like semantic search and question answering.

Goal: Find synonyms for all entities from Freebase or WikiData or Wikipedia. Evaluate the quality and compare it to existing synonym databases like CrossWikis.

Step 1: Play around with CrossWikis and analyze the main error sources.

Step 2: Collect ideas for constructing a dataset with higher quality.

A Search Engine for OpenStreetMap Data (project and/or thesis)

Implement the backend for a search engine for OpenStreetMap data. Your server should load an arbitrary .osm file and answer (fuzzy) string searches over the entire dataset. This is a nice project to familiarize yourself with different index structures. Continuation as a thesis is possible.||Preferably you have visited our lecture Information Retrieval, but this is not required. Code should be written in C++. Supervised by Patrick Brosi.

- Show / hide details

Background info: OpenStreetMap is a collaborative mapping project and basically consists of geographically located points ("nodes"), lines ("ways") and groups of them ("relations"). All objects can be further described by means of key=value pairs, which are mainly used to give them certain attributes (for example "name"="Freiburg im Breisgau"). There are a number of search engines available for this data (see http://wiki.openstreetmap.org/wiki/Search_engines), and many of them are open source. The "best" engine at the moment is Nominatim, which is also used by the official OSM map page. It has a debug interface for testing (http://nominatim.openstreetmap.org/). Nominatim builds on a PostgreSQL database.

Goal: You should develop a search engine for OSM data that aims to be comparable in speed and quality to Nominatim. It does not matter of you don't reach this goal, but you should strive for it nevertheless.

Optional goal: Build a small, exemplary web-application that uses your backend to provide a searchable map.

Step 1: Familiarize yourself with raw OSM data. Download some smaller .osm file and open it in a text editor (preferrably vim, other editors tend to dislike text files that are multiple GB in size). Search around in the file. Try to find some city. Try to find a popular tourist attraction. Try to find a place that sells mexican food. Play around with the OSM toolkit (osmfilter, osmconvert ...) Should not take more than 1-2 days.

Step 2: Start coding. Build a parser for OSM data (keep memory effiency in mind).

Step 3: Select one or multiple attributes (for example, "name") and build an inverted index with it. Write tests for it. Think of a useful interface for your index class and put it into an abstract class.

Step 4: Build a HTTP server (use for example your code from the Information Retrieval lecture as a starting point) that takes an instance of the abstract index class and answers queries

Step 5: Realize that the results from Nominatim are still infinitely superior to yours

Step 6: Think of ways to change this :)

Step 7: Goto 5.

Tabular Information Extraction (project and/or thesis)

Extract data from a knowledge base in a tabular format. This could for example be a list of cities with columns for the country they are in, population count and coordinates but really anything fitting in a table should be possible. Think of the typically hand crafted summary tables on Wikipedia. This is well-suited as a project (B.Sc. or M.Sc.) with possible continuation as a thesis. You should be interested in knowledge bases, big datasets and searching in them. Supervised by Niklas Schnelle.

- Show / hide details

Type: Interesting and well-defined problem with broad applicability in data science, knowledge representation and knowledge base exploration

Background info: Knowledge Bases such as Freebase, WikiData and DBpedia contain giant amounts of knowledge data encompassing vast fields from book characters to geographic and political entities. Especially for data for which ontological knowledge such as Germany <is-a> Country is available it can often be useful to represent (parts) of the data in a tabular format. One particularly user relevant application for this is in automatically generated and updated tables on wikis as well as custom reduced data sets for data science.

Assume we wish to examine the distribution of cities on the globe. If we had a table of the form [City] [Country] [Population Count] [Latitude] [Longitude] this would be an easy task and as we just showed this table is trivially specified in a concise, structured format. However, trying to get this data out of a SPARQL based Freebase frontend can be quite challenging even if we are sure that as in this case the data is in there. Try writing a SPARQL query for our Qlever instance and you will see, if you don't believe us it's possible send me a mail.

This gap between the availability of a simple concise description, queryable datasets and the effort necessary to extract the result is what we aim to close.

Goal: design, implement, and evaluate a system that given a concise, structured table definition generates the resulting table using a knowledge base as its data source. While the definition is concise and structured note, that there is some fuzzyness to the category names which should not need to match the exact name of the associated knowledge base entity since these have often non-human-readable format, unexpected names and/or require a detour via mediator entities.

Step 0: Search the literature for solutions to this problem, familiarize yourself with the available knowledge base systems, SPARQL and try your hand at a medium number of manually designed queries and example tables. Start designing a simple definition for a table description format, this is not set in stone however. DOCUMENT YOUR QUERIES AND DISCOVERED PROBLEMS

Step 1: Design and implement a baseline version using for example exact entity names with a rule based approach. Design a simple but useful benchmark against which your system can be evaluated. This will give you an idea of where you stand, what kind of errors are still present. This also gives you the opportunity to evaluate if and where your approach may have gone in the wrong direction.

Step 2: Using more advanced techniques such as simple machine learning algorithms tackle the problems discovered in the previous step. Handle synonyms for the categories and possibly allow for additional data filtering. If necessary improve the performance of query generation. Design and implement a web frontend allowing easy interaction with your system both for human and machine users.

Conversational Aqqu (project and/or thesis)

In this project you will be working on an extension to our Question Answering system Aqqu. This extension will enable follow-up questions and thus a more natural interface. Supervised by Niklas Schnelle.

- Show / hide details

Type

Project and/or thesis, preferably with the thesis building on a project. You will be working with a relatively large existing codebase making for a very real-world like project.

Background

Our Question Answering (QA) system Aqqu available online as well as via Telegram currently treats every question as self-contained and isolated.

Especially on a conversation based interface like Telegram this can feel quite unnatural, for example instead of being able to have the following conversation:

User: Where was Albert Einstein born?

AqquBot: Albert Einstein, place of birth: Ulm

User: And where did he die?

AqquBot: Albert Einstein, place of death: Princeton

The second question has to repeat the name to work asking for example And where did Albert Einstein die?

Goal

The goal of the project will be to design and implement the necessary changes to Aqqu enabling such conversational sessions. To keep the backend question answering system stateless this should be implemented using some kind of conversation context tracking mechanism on the client (Web UI or service bridge (e.g. Telegram)).

Subgoals

  • Research related work on conversational interfaces
  • Become familiar with the existing Aqqu system and code base
  • Create a dataset with normal and follow-up questions for evaluation, ideally based on an existing dataset for QA (e.g. WebQSP, SimpleQuestions, Free917)

  • Design and implement a simple conversation context extension for the Aqqu API e.g. tracking previous entities and registering them with pronouns like "sh e", "he", "it"
  • Design and implement a UI using the aforementioned API extension with a chat like interface

NOTE: For the purpose of this project/thesis you may regard it as out-of-scope to make the answers use natural language and may stick to answer lists

A Simple chat bot

Build a simple chat bot using deep learning. For a recent example of such a chatbot, see Woebot. This topic gives you a lot of freedom concerning the outcome, but it is also free in the sense that you have to research yourself what is out there already and what can realistically be achieved in six months yourself. Supervised by Hannah Bast.

- Show / hide details

Type: Bachelor or Master thesis, also possible in combination with a project. This topic gives you a lot of freedom concerning the outcome, but it is also free in the sense that you have to research yourself what is out there already and what can realistically be achieved in six months yourself.

Background info: An example of a recent successful chatbot is Woebot.

Goal: Design and build a chatbot using NLP and Deep Learning. The behavior should be non-trivial.

Error Correction for Question Answering

Design and build a system that accepts a (relatively simple) question in natural language and automatically corrects typos etc. This should be realized with a character-based language model learned using deep learning (e.g., with an RNN). Supervised by Hannah Bast.

- Show / hide details

Type: Project or Thesis, preferably both in succession. Nice topic if you like Deep Learning and Natural Language Processing. The topic is also relatively "stand-alone" in that it does not rely on other complex systems but you can design and build everything yourself from the ground up.

Goal: Design and build a system that accepts a (relatively simple) question in natural language and automatically corrects typos etc. For example, given the question "wo inveted pyton", output "who invented python". This should be realized with a character-based language model learned using deep learning (e.g., with an RNN). One question is whether solving this problem for questions is easier or harder than solving it for arbitrary sentences.

[1] Aqqu Materials + Paper + Link zur Demo

[2] WebQuestions Benchmark WQSP (verbesserte Version)

[3] Free 917 Benchmark

[4] Simple Questions

[5] Problematic questions (gives a good feeling, why QA is so hard)

Crawling and Analysis of Scientist Homepages

Design and build a system that crawls as many university pages and finds as many person mentions as possible and creates a neat knowledge base from this information (Name of Scientist, Affiliation, Gender, Title). In previous work, we tried extracting this information from the CommonCrawl archive, but that turned out to be much too incomplete and unreliable. This is a challenging problem with large practical value. Supervised by Niklas Schnelle.

- Show / hide details

If you try to find a homepage of one of our staff by clicking through from the University of Freiburg main page, you will find that this is close to impossible. Thus anyone with finite patience will resort to googling, ecosiaing or binging for the name.

Now, clearly we shouldn't have to rely on giant, general purpose, search engines just to get contact information for a supervisor, some paper's author or even their secretary. So in this project you will built a domain specific search index for all the people involved with Universities be it scientists, IT support or secretaries. The resulting index should be created in a format that can be used with QLever, a SPARQL+Text search engine developed at our chair.

Like all proper search engines you will built a crawler, a software system that reads web pages and extracts search terms to be indexed. In this case the web pages will be restricted to Universities (and possibly similar institutions) and the search terms are restricted to names. To achieve this you will index (the URLs of) all the mentions of people's names your crawler can find on each university web page.

Possible Implementation Steps

Document what you have done and why you did things this way. Use this to backtrack where you went wrong/too complicated when you get stuck. Also collect good examples for relevant AND irrelevant mentions (pages) of persons, which you can later use to evaluate your system (all throughout the thesis/project)

  1. Collect a list of University web page URLs
    • You can make use of an existing Knowledge Base such as WikiData which allows listing Universities and their web pages using a relatively simple SPARQL query

    • You can use our local instances of the above Knowledge Bases

  2. Built a simple web crawler for extracting visible text from university web pages
    • You can either use a higher level crawling framework like Scrapy, just an HTML parser like Beautiful Soup (with lower level HTTP libraries such as requests) or completely roll your own (which might be really difficult, time consuming and frustrating, you have been warned).

    • You probably want to restrict crawling to the (sub-)domain(s) of the university and to a configurable depth by using limited breadth first search.
    • You will pronbably want to restrict your crawling to static HTML pages
  3. Evaluate the coverage of the crawler with respect to scientist homepages
  4. For each page of user visible text from the above step you will identify all mentions of names in the text
    • In the first version you can make use of very simple rules like all capitalized words after "Dr\." or "Prof\. [Dr\.]?" and using a list of name parts (first, middle, last names)
    • In later refinements you can make use of natural language techniques like POS tagging, parsing or machine learning
  5. For each mention your crawler finds you will save the name (possibly normalized) as well as the URL of the current page in a format that can be turned into a searchable index by one of the tools used at our chair (see above)
    • To improve ranking you may also want to save information such as if the name is found in a heading
    • You may optionally also store the context where the name was found. If you do, it too should be in a format QLever can make use of
  6. Develop a scoring system that should assign a score to each mention. Ideally when sorting according to this score pages that are most relevant to a name such as that person's homepage should come first.
    • This could include things like the name being mentioned in a heading, BM25 like relevancy scores, the name also appearing in the URL and more
  7. Evaluate your index using the search system and ranking evaluation techniques typical for Information Retrieval

Bitcoin trading app

Design and implement a reasonably clever (not to reckless and not to conservative) algorithm for trading bitcoins. Evaluate on historical data and implement an API and a web app to monitor on live data. Take into account all the actual costs incurred (like taxes and fees). Analyse the return of investment that can be reasonably expected. Supervised by Hannah Bast.

- Show / hide details

Entity Recognition on a Web Corpus

Design and implement a named-entity recognizer for a web-size corpus. Supervised by Hannah Bast.

- Show / hide details

Goal: Design and implement a simple but effective named-entity recognizer for a web-size corpus, namely ClueWeb12 [1]. The following features should be supported:

1. Recognize literal mentions (trivial). For example, recognize "Angela Merkel" as https://en.wikipedia.org/wiki/Angela_Merkel .
2. Recognize partial mentions of entities, which have been mentioned literally before. For example, recognize "Merkel" as https://en.wikipedia.org/wiki/Angela_Merkel, after she has been mentioned with her full name before.
3. Recognize mentions of entities via pronouns (he, she, it, ...), which have been mentioned literally before. For example, recognize "she" as https://en.wikipedia.org/wiki/Angela_Merkel if she has been mentioned before. Take the gender into account. That is, "she" should be identified as the last mention of a female entity.
4. Recognize mentions of the form "the <TYPE>", after a mention with the full entity name. For example, "the film" should be recognized as https://en.wikipedia.org/wiki/The_Matrix, if that film has been mentioned with its full name before.

[1] http://lemurproject.org/clueweb12/ We have purchased this dataset and it's available on our file system.

Context Decomposition of a Web Corpus

Decompose the sentences of a given web-size corpus into their semantic components, using fancy technology developed in our group. Supervised by Hannah Bast.

- Show / hide details

Goal: Decompose the sentences of a given web-size corpus into their semantic components. Requirements:

1. It should work on ClueWeb12 [1]

2. The decomposition should be based on CSD-IE, developed in our group [2].

3. Our current implmentation is based on a rather slow parser. This should be switched to the much faster spaCy parser [3].

4. The output format should be compatible with !QLever (easy), our own SPARQL+Text search engine [4].

[1] http://lemurproject.org/clueweb12/ We have purchased this dataset and it's available on our file system.

[2] CSD-IE Paper: http://filicudi.informatik.uni-freiburg.de:6543/publications , CSD-IE Demo: http://filicudi.informatik.uni-freiburg.de:6543 (university-internal ID)

[3] https://spacy.io/

[4] https://github.com/ad-freiburg/QLever#51-input-data (README, Section 5.1 Input Data)

Extracting Words from Text Documents with Complex Layouts (bachelor thesis)

Design and implement a (learning-based) system for extracting words from layout-based text documents (e.g., PDF documents), which is a surprisingly difficult (but not super-hard) task. The reason is that the text is typically only provided character-wise (and not word-wise) so that word boundaries must be derived from e.g., analyzing the spacings between the characters. Another challenge is that the layout of a text document can be arbitrarily complex, with text arranged in multiple columns and different alignments so that special care is required to not mix up text from different columns. Supervised by Claudius Korzen.

- Show / hide details

Type: An interesting and practical bachelor thesis. A basic understanding of Machine Learning is required; knowledge of Deep Learning is desirable. The preferred programming language is Python.

Background info: Many text documents are layout-based and contain text that can consist of different fonts and can be arranged in multiple columns. Think of books, newspapers or scientific research papers. There are so called Manhattan layouts and non-Manhattan layouts. In Manhattan layouts, all columns are rectangular and aligned parallel to the page borders. In non-Manhattan layouts, the columns are of no fixed shape and can be aligned arbitrarily. Here are two examples:

manhattan-layout.jpg and non-manhattan-layout.jpg

The left document shows a Manhattan-layout, the right document a non-Manhattan-layout. In the non-Manhattan layout, note in particular the two text blocks in the middle of the page. The decision that both text blocks are not part of the other two columns (but are two distinct text blocks) might be easy for a human reader, but is quite hard for computers.

Goal: Extract words column-wise from both, Manhattan layouts and non-Manhattan layouts, using machine learning techniques; without mixing words from different columns.

Challenge 1: In non-Manhattan layouts, the columns can be of arbitrary form and can be placed at any position, see above.

Challenge 2: Layout-based documents (in particular: PDF documents) often provide the text only character-wise (but not word-wise). There are further typically no whitespaces. Thus, the boundaries of words must be derived from e.g., analyzing the spacings between the characters. But the spacings can vary from line to line and even within a line there is no fixed rule to determine the extent of a word from the spacings alone. Here is an example:

word-boundaries.jpg

The blue boxes are the bounding boxes of the characters. Note the small spacings between most of the characters and in particular, the direct adjacency of the two f's (between which there is actually a word boundary). Based on the spacings alone, the characters might be extracted as the following six words: "s e r iffo n t". However, they should be extracted as the two words "serif font".

Challenge 3: Internally, layout-based documents can store the characters in an order interleaving between the columns. If the characters are processed in that order, extracting the words correctly is obviously impossible.

Subgoal 1: Search the literature for work related to this problem.

Subgoal 2: Design a (simple) baseline algorithm (e.g., using a rule-based approach, but not machine learning). Design a small benchmark and evaluate your baseline algorithms on this benchmark. The motivation behind this step is to give you a feeling of how hard the problem is and where the actual problems lie.

Subgoal 3: Design, implement and train a (supervised) machine learning model for a fast ( ! ) identification of columns. This can be based on a neural network or any other well performing maching learning technique. This includes to create large and appropriate training data needed to train the model.

Subgoal 4: Similarily, design, implement and train a language model for identifying the words contained in each column, e.g., based on the characters themselves and further features like the spacings between characters or their fonts.

Subgoal 5: A thorough evaluation of your learning-based approach, including a comparison to your baseline approach.

Supervision by Claudius Korzen

Extracting Special Characters from Layout-Based Text Documents (bachelor thesis)

Design and implement a (learning-based) system for extracting ligatures (like fi or ffi) and characters with diacritics (like á and è) from layout-based text documents (e.g., PDF documents). The challenge here is that such characters can be drawn into the text, in which case they need to be recognized by analyzing their shapes. Supervised by Claudius Korzen.

- Show / hide details

Type: An interesting and practical bachelor thesis. A basic understanding of Machine Learning is required; knowledge in Deep Learning is desirable. The preferred programming language is Python.

Background info: Text documents (e.g., books, newspapers or scientific publications) can contain ligatures (like fi or ffi) which are one character in the document, but actually represent multiple characters. Further, text documents can also contain characters with diacritics (like à or é) which are often two characters in the document (in particular often in PDF documents) but actually represent a single character. Translating those characters correctly is crucial for applications like search, because words that has not been identified correctly will simply not be found. Here are two examples:

ligature.jpg and diacritic.jpg

In the left figure, note the single rectangle around the ligature ffi, meaning that it is a single character. If not translated correctly, the word might be extracted as ecient (without the ligature) or e?cient (with a placeholder "?" instead of the ligature). In the right figure, note the two rectangles of the characters é, û and è, meaning that each of these characters is represented by two characters (the base character and the diacritic). If not translated correctly, the words might be extracted as cr´eme br^ul`ee.

Goal: Translate ligatures and characters with diacritics into the characters they actually represent using machine learning techniques.

Challenge 1: Ligatures can be represented by single characters which need to be split; and characters with diacritics can be represented by two characters which need to be merged, see above.

Challenge 2: Ligatures and characters with diacritics can be drawn into the document (in particular often in case of PDF documents), in which case they are more of a graphic nature than a textual nature. This requires to translate the characters based on e.g., analyzing their shapes.

Subgoal 1: Search the literature for work related to this problem.

Subgoal 2: Design a (simple) baseline algorithm (e.g., using a rule-based approach, but not machine learning). Design a small benchmark and evaluate your baseline algorithms on this benchmark. The motivation behind this step is to give you a feeling of how hard the problem is and where the actual problems lie.

Subgoal 3: Design, implement and train a convolutional (?) neural network for a fast ( ! ) translation of ligatures and characters with diacritics, based on e.g., their shapes and their neighboring characters. This includes to create large and appropriate training data needed to train the network.

Subgoal 4: A thorough evaluation of your learning-based approach, including a comparison to your baseline approach.

Supervision by Claudius Korzen

GTFS Browser Web App (Bachelor project or thesis)

Develop a web-application that can be used to analyze huge GTFS datasets. There are already some tools available (for example, ScheduleViewer) but they all feel and look quite clumsy, are incredible slow and cannot handle large datasets. Supervised by Patrick Brosi.

- Show / hide details

Type: Bachelor project or Bachelor thesis. You should have a basic understanding of geometrical operations and spatial access indices. Programming language of your choice (C++, Java, Go, ....), Python discouraged because of performance issues.

Background info: There are already some tools available for analyzing GTFS feeds (for example, ScheduleViewer) but they all feel and look quite clumsy, are incredible slow and cannot handle bigger datasets. We are looking for a reliable tool to get a first, quick look on some new GTFS feed and to analyze them in detail - maybe even edit them (see optional goal).

Goal: You develop a web application (a HTML/JS frontend and a server backend) that displays a map on which the trips, the stations (and possibly the fare zones) contained in the feed are rendered. If you click on them, some detailed information should appear (exact timetable, running days etc). It would also be nice if problems in the feed are highlighted in some way. You application should offer the possibility to upload zipped GTFS feeds or to fetch a GTFS feed from some URL entered in the GUI. Analyzing multiple feeds should be possible. You application should also look "nice" and you should take care that it offers great usability.

Optional Goal: Some editing methods.

Automated Generation of Rail Noise Maps

Given rail schedule data which has been fully map-matched (for which for each vehicle we know the exact path it takes through a geographical network), it is possible to automatically generate rail noise maps from this data. In this project or bachelor thesis, you are provided with map-matched schedule data from which you should generate such noise maps. Your maps should respect high buildings or large forests which may act as a noise shield. The noise intensity should be estimated by the expected vehicle speed. Your maps should offer the ability to show the noise for adjustable time spans. An interest in Geo data and schedule data is beneficial. Supervised by Patrick Brosi.

- Show / hide details

River Maps

The goal of this project is to use our tool LOOM to render maps of rivers from OSM data. Each river segment should consist of all rivers that contributed to this river so far (for example, beginning at Mannheim, the Neckar should be part of the segment that makes up the Rhine). Think of a single river as a single subway line starting at the source of that river, and the Rhine, for example, as dozens of small subway lines next to each other. Supervised by Patrick Brosi.

- Show / hide details

As a combination of bachelor project / bachelor thesis.

A river segment (for example, the Rhine between Breisach and Strasbourg) can be thought of as consisting of many small streams and other rivers (tributaries) that contributed to this river. A nice way to visualize this would be a metro map, where each river is a single metro line and multiple (possibly hundreds) of small parallel lines make up a segment of a larger river.

Goal: Generate a graph of the rivers in OSM and try to deduce which river segment is made up of which smaller streams. Render this graph in a pleasing way, or use our tool LOOM to do so.

Requirements:

  1. Generate a geographical graph, as GeoJSON, in which nodes are "river intersections" (where two or more rivers join) and edges are river segments between those intersections. Each edge should have a list of rivers that contributed to it as an attribute.
  2. You should write a tool that can be used from the command line.
  3. The output format should be the GeoJSON format used by LOOM.

Optional:

  1. Write your own code to render this graph.
  2. Write a web application to display this graph, where you can click on each river and display some useful information.

Schedule Map Matching with Graphhopper

The Graphhopper routing engine has been extended with a map-matching functionality which is based on this Microsoft paper in recent years. The goal of this project or thesis is to use Graphhopper to implement a map-matching tool which produces the correct geographical routes (shapes) for schedule data given in the GTFS format. The tool should be evaluated (speed and quality) against our own schedule map matching tool pfaedle. Supervised by Patrick Brosi.

- Show / hide details

AD Teaching Wiki: ListOfProjectAndThesisTopics (last edited 2024-04-16 09:33:22 by Natalie Prange)