AD Teaching Wiki:

This page gives an overview of available, ongoing and completed Bachelor's and Master's Projects and Theses at the Chair for Algorithms and Data Structures.

Available projects and theses

Description

Your interests / skills

Supervisor

Extracting schedule data from PDF timetables: Many public transit agencies already publish their schedule data as GTFS feeds or in some proprietary format. The goal of this project and thesis is to extract machine readable schedule data from PDF timetables published for print like this one. We expect that good results can already be achieved by using an off-the-shelf tool for extracting tabular data from PDFs (pdftotext). Full station names and coordinates can be extracted from OSM data.

Geo data, schedule data, Hidden Markov Models

Patrick Brosi

Map-Matching Mobile Phones to Public Transit Vehicles: Consider the following scenario: you are sitting in a public transit vehicle and you are interested in the available connections at the next stop, or the current delay of your vehicle, or you have changed your desired destination and want to calculate a new route, using the public transit vehicle your are currently in as a starting point. To do all this, your device has to know which vehicle it is currently travelling in. This is a map-matching problem, but the "map" (the positions of all vehicles in the network) is not static, but highly dynamic. The goal of this thesis and project is to develop an app + a dynamic map-matching backend which "snaps" the device to the most likely public transit vehicle. Once such a link is established, the device can be used to improve the real-time information for the vehicle.

Geo data, schedule data.

Patrick Brosi

Automated Generation of Rail Noise Maps: Given rail schedule data which has been fully map-matched (for which for each vehicle we know the exact path it takes through a geographical network), it is possible to automatically generate rail noise maps from this data. In this project or bachelor thesis, you are provided with map-matched schedule data from which you should generate such noise maps. Your maps should respect high buildings or large forests which may act as a noise shield. The noise intensity should be estimated by the expected vehicle speed. Your maps should offer the ability to show the noise for adjustable time spans.

Geo data, schedule data.

Patrick Brosi

Ongoing projects and theses

Description

Your interests / skills

Supervisor

Question Anwering on WikiData: Make our question answering work on WikiData. WikiData is currently growing fast and will become the new Freebase. It's an exciting dataset.

Hannah Bast

River Maps: The goal of this project is to use our tool LOOM to render maps of rivers from OSM data. Each river segment should consist of all rivers that contributed to this river so far (for example, beginning at Mannheim, the Neckar should be part of the segment that makes up the Rhine). Think of a single river as a single subway line starting at the source of that river, and the Rhine, for example, as dozens of small subway lines next to each other.

Patrick Brosi

Schedule Map Matching with Graphhopper: The Graphhopper routing engine has been extended with a map-matching functionality which is based on this Microsoft paper in recent years. The goal of this project or thesis is to use Graphhopper to implement a map-matching tool which produces the correct geographical routes (shapes) for schedule data given in the GTFS format. The tool should be evaluated (speed and quality) against our own schedule map matching tool pfaedle.

Patrick Brosi

Completed projects and theses

For a detailed description of completed projects see our AD Blog. For details about completed theses see our Website.

Description

Your interests / skills

Supervisor

Merging Overlapping GTFS Feeds (Bachelor project or thesis): Many transportation companies publish their timetable data either directly as GTFS feeds or in formats that can be converted to GTFS. As soon as you have two GTFS feeds (two sets of timetable data) that cover either the same or adjacent areas, the problem of duplicate trips arises. You should develop a tool that merges two or more GTFS feeds and solves duplication / fragmentation issues. As a bachelor project or thesis.

Patrick Brosi

Extract and Analyze Scientist's Homepages (project and/or thesis): Extract a large number of scientist's homepages from the CommonCrawl web crawl. Extract the central information from these pages, including: name, profession, gender, affiliation. It will be relativel straightforward to get results of medium quality. The challenge is to achieve results of high quality. Machine learning will be crucial to achieve that. Exploring suitable methods is part of the challenge.

Hannah Bast

Tokenization Repair (project and/or thesis): Interesting and well-defined problem, the solution of which is relevant in a variety of information retrieval scenarios. Simple rule-based solutions come to mind easily, but machine learning is key to get very good results.

A background in machine learning, or a strong willingness to aquire one as part of the project/thesis, is mandatory for this project.

Hannah Bast

Combined SPARQL+Text search (project or thesis): This is well-suited as a project (B.Sc. or M.Sc.) but also provides ample opportunity for continuation with a theses (B.Sc. or M.Sc).

You should be fond of good user interfaces and have a good taste concerning layout and colors and such things. You should also like knowledge bases and big datasets and searching in them.

Hannah Bast

Synonym Finder (project and/or theses): Find synonyms for all entities from Freebase or WikiData or Wikipedia. Evaluate the quality and compare it to existing synonym databases like CrossWikis. Motivation: Most entities are known under several names. For example, a person like "Ellen DeGeneres" is also known as just "Ellen". A profession like "Astronaut" is also known as "Spaceman", "Spacewoman" or "Cosmsonaut". Knowing these synonyms is key for many NLP (Natural Lanuage Processing) problems. Including complex problems like semantic search and question answering.

Hannah Bast

A Search Engine for OpenStreetMap Data (project and/or thesis): Implement the backend for a search engine for OpenStreetMap data. Your server should load an arbitrary .osm file and answer (fuzzy) string searches over the entire dataset. This is a nice project to familiarize yourself with different index structures. Continuation as a thesis is possible.

Preferably you have visited our lecture Information Retrieval, but this is not required. Code should be written in C++.

Patrick Brosi

Tabular Information Extraction (project and/or thesis): Extract data from a knowledge base in a tabular format. This could for example be a list of cities with columns for the country they are in, population count and coordinates but really anything fitting in a table should be possible. Think of the typically hand crafted summary tables on Wikipedia. This is well-suited as a project (B.Sc. or M.Sc.) with possible continuation as a thesis.

You should be interested in knowledge bases, big datasets and searching in them.

Niklas Schnelle

Conversational Aqqu (project and/or thesis): In this project you will be working on an extension to our Question Answering system Aqqu. This extension will enable follow-up questions and thus a more natural interface.

Niklas Schnelle

A Simple chat bot: Build a simple chat bot using deep learning. For a recent example of such a chatbot, see Woebot. This topic gives you a lot of freedom concerning the outcome, but it is also free in the sense that you have to research yourself what is out there already and what can realistically be achieved in six months yourself.

Hannah Bast

Error Correction for Question Answering: Design and build a system that accepts a (relatively simple) question in natural language and automatically corrects typos etc. This should be realized with a character-based language model learned using deep learning (e.g., with an RNN).

Hannah Bast

Crawling and Analysis of Scientist Homepages: Design and build a system that crawls as many university pages and finds as many person mentions as possible and creates a neat knowledge base from this information (Name of Scientist, Affiliation, Gender, Title). In previous work, we tried extracting this information from the CommonCrawl archive, but that turned out to be much too incomplete and unreliable. This is a challenging problem with large practical value.

Niklas Schnelle

Bitcoin trading app: Design and implement a reasonably clever (not to reckless and not to conservative) algorithm for trading bitcoins. Evaluate on historical data and implement an API and a web app to monitor on live data. Take into account all the actual costs incurred (like taxes and fees). Analyse the return of investment that can be reasonably expected.

Hannah Bast

Entity Recognition on a Web Corpus: Design and implement a named-entity recognizer for a web-size corpus.

Hannah Bast

Context Decomposition of a Web Corpus: Decompose the sentences of a given web-size corpus into their semantic components, using fancy technology developed in our group.

Hannah Bast

Mail Search: The goal of this project is a fast and efficient search in very large mail archives. The subtasks are: (1) Write an efficient parser that reads one or files in MBOX format and produces a CSV file with one line per mail and columns for the various structured and unstructured parts of an email (from, to, subject, date, body, ...); (2) take proper care of encoding issues, which are a major issue when dealing with a large number of emails; (3) setup an instance of CompleteSearch for the data from the CSV file; (4) provide a simple and effective search interface using the instance from 3 as a backend.

Hannah Bast

Extracting Words from Text Documents with Complex Layouts (bachelor thesis): Design and implement a (learning-based) system for extracting words from layout-based text documents (e.g., PDF documents), which is a surprisingly difficult (but not super-hard) task. The reason is that the text is typically only provided character-wise (and not word-wise) so that word boundaries must be derived from e.g., analyzing the spacings between the characters. Another challenge is that the layout of a text document can be arbitrarily complex, with text arranged in multiple columns and different alignments so that special care is required to not mix up text from different columns.

Claudius Korzen

Extracting Special Characters from Layout-Based Text Documents (bachelor thesis): Design and implement a (learning-based) system for extracting ligatures (like fi or ffi) and characters with diacritics (like á and è) from layout-based text documents (e.g., PDF documents). The challenge here is that such characters can be drawn into the text, in which case they need to be recognized by analyzing their shapes.

Claudius Korzen

GTFS Browser Web App (Bachelor project or thesis): Develop a web-application that can be used to analyze huge GTFS datasets. There are already some tools available (for example, ScheduleViewer) but they all feel and look quite clumsy, are incredible slow and cannot handle large datasets.

Patrick Brosi

Style Option 1

Description

Your interests / skills

Supervisor

Merging Overlapping GTFS Feeds (Bachelor project or thesis): Many transportation companies publish their timetable data either directly as GTFS feeds or in formats that can be converted to GTFS. As soon as you have two GTFS feeds (two sets of timetable data) that cover either the same or adjacent areas, the problem of duplicate trips arises. You should develop a tool that merges two or more GTFS feeds and solves duplication / fragmentation issues. As a bachelor project or thesis.

- Show / hide details

Type: Bachelor project or Bachelor thesis. You should be fond of train schedules and have a basic understanding of GIS and geometrical operations. Programming language of your choice, but performance should be good enough to handle very big datasets.

Background info: Many transportation companies publish their timetable data either directly as GTFS feeds or in formats that can be converted to GTFS. As soon as you have two GTFS feeds (two sets of timetable data) that cover either the same or adjacent areas, the problem of duplicate trips arises. Consider a schedule published by Deutsche Bahn containing only trains, and a schedule published by the VAG Freiburg, containing busses, trams and the Breisgau-S-Bahn. The S-Bahn is contained in both datasets, but most probably with different IDs, different line names ("BSB" vs "Breisgau-S-Bahn" vs "Zug" vs ...), different station IDs, different station coordinates... consider an even more complicated example, where a train schedule published by DB contains a train from Amsterdam to Zürich. The train is contained in the DB-Feed from Amsterdam to Basel SBB (where it crosses the Swiss Border), but the part in Switzerland is missing. Another dataset, published by the Swiss Federal Railways, contains the same train, but starting only at Basel Bad Bf (the last German station) and ending at Zurich. A third dataset, published by the Nederlandse Spoorwegen, contains the train from Amsterdam to the first station in Germany. If you want to use all the three feeds together, several problems appear: the train is represented two times between Amsterdam and the first station in Germany, two times between Basel Bad Bf and Basel SBB and the information that you can travel through Basel SBB into Switzerland without having to change trains is completely lost.

Goal: Your input will be two or more GTFS-feeds, your output will be a single, merged GTFS feed that solves all the problems described above. Specifically, you should analyze the data, think of some equivalency measurements for trips (for example, if a train called "ICE 501" arrives in Basel Bad Bf at 15:44 in feed A and a train "ICE501" departs from Basel Bad Bf at 15:49 in feed B, it is most likely that this is the same train) and merge trips / routes that belong to the same vehicle. Another example: if two trains in A and B serve exactly the same stations at exactly the same arrival / departure times, this is also most likely the same train. You should think of some testing mechanism that makes sure that indeed every connection that was possible in feed A and feed B is still possible in the merged feed, that is no information was lost. Given some overlapping feed that appears in different qualities on both feeds, your tool should also automatically decide which (partial) representation has the better quality (for example, in feed A, no geographical information on the train route ('shape') is present, but in feed B, it is, so use the shape information from feed B). You tool should be able to handle huge datasets (for example, the entire schedule [trains, busses, trams, ferries etc.] of Germany).

Patrick Brosi

Extract and Analyze Scientist's Homepages (project and/or thesis): Extract a large number of scientist's homepages from the CommonCrawl web crawl. Extract the central information from these pages, including: name, profession, gender, affiliation. It will be relativel straightforward to get results of medium quality. The challenge is to achieve results of high quality. Machine learning will be crucial to achieve that. Exploring suitable methods is part of the challenge.

- Show / hide details

Type: Interesting and well-defined problem. It will be relatively straightforward to get results of medium quality. The challenge is to achieve results of high quality. Machine learning will be crucial to achieve that. Exploring suitable methods is part of the challenge.

Goal: Extract a large number of scientist's homepages from the CommonCrawl web crawl. Extract the central information from these pages, including: name, profession, gender, affiliation.

Step 1: Download a large chunk of university webpages (e.g. all universities in Europe or in the US or even in the whole world).

Step 2: Design, implement and run a classifier that identifies people's homepages. A first version can be rule-based. For a classifier of high-quality machine learning will eventually be needed.

Step 3: Extract basic information, like the name, the gender, the profession, and the affiliation. A first version can be rule-based. For result with high precision and recall (which is the goal), machine learning will eventually be needed.

Hannah Bast

Tokenization Repair (project and/or thesis): Interesting and well-defined problem, the solution of which is relevant in a variety of information retrieval scenarios. Simple rule-based solutions come to mind easily, but machine learning is key to get very good results.

- Show / hide details

Type: Interesting and well-defined problem, the solution of which is relevant in a variety of information retrieval scenarios. Simple rule-based solutions come to mind easily, but machine learning is key to get very good results. A background in machine learning, or a strong willingness to acquire one as part of the project/thesis, is therefore mandatory for this project.

Background info: Breaking a natural-langue text into words is a basic first step for almost any system that does something sophisticated with text (we only consider word-based languages in this project/theses, and not single-symbol-based languages like Chinese). This is called tokenization. If the text is formatted perfectly, this is an (almost) trivial problem: just take the words as maximal strings of word characters. However, text is almost never formatted perfectly. The following kinds of formatting errors are frequently encountered in almost any text:

  1. Hyphenation due to a line break (e.g. Our algo-[newline]rithm runs in linear time.).

  2. Wrong spaces (e.g. Our algo rithm runs in linear time.).

  3. Missing spaces (e.g. Our algorithm runsin linear time.)

  4. Garbled characters (e.g. Our algorifm runs in linear time.)

Most text processing systems simple break for such errors. For example, a typical search engine will simply not find the word algorithm in the sentences 1, 2, and 4 above. A POS-tagger will not recognize runsin as a verb followed by a preposition, which can also affect the correctness of the other POS-tags in the sentence. More complex NLP routines, like parsing, are known to be very non-robust to such errors and may produce completely wrong results for the whole sentence.

Goal: design, implement, and evaluate a system that recognizes and fixes errors of the types 1, 2, and 3 in a given text. Check if 4 can be solved with similar techniques. If yes, also solve it. If no, characterize how it is a different problem (requiring different techniques). The first meaningful steps in this project are as follows.

Step 1: Search the literature for solutions to this problem. It is an important problem, but has received muss less research attention so far then other problems in NLP / text processing / information retrieval.

Step 2: Design simple baseline algorithms (not based on learning or using a very simple kind of learning), design a first small and simple (but not too small and simple) benchmark and evaluate your baseline algorithms on this benchmark. This will give you a feeling of how hard the problem is and where the actual problems lie. Actually your ideas may go in the wrong direction or you optimize the wrong aspects.

Step 3: Familiarize yourself with the kind of machine learning techniques that are useful in such a scenario. A neural network might be useful later on, but is most probably not the first think you should try (if it gives bad results, you will not know if its because the problem is so hard or because you used it in the wrong way, so you first need a baseline from a more classical learning algorithm).

Step 4: Come up with a learning-based algorithm and evaluate and see that you beat the baseline.

A background in machine learning, or a strong willingness to aquire one as part of the project/thesis, is mandatory for this project.

Hannah Bast

Combined SPARQL+Text search (project or thesis): This is well-suited as a project (B.Sc. or M.Sc.) but also provides ample opportunity for continuation with a theses (B.Sc. or M.Sc).

- Show / hide details

Type: This is well-suited as a project (B.Sc. or M.Sc.) but also provides ample opportunity for continuation with a theses (B.Sc. or M.Sc). You should be fond of good user interfaces and have a good taste concerning layout and colors and such things. You should also like knowledge bases and big datasets and searching in them.

Background info: Broccoli is our current system for convenient semantic search on text combined with a knowledge base. Here is a demo. The demo is described in [1]. Broccoli works with a special backend, which only supports a particular subset of queries (in a nutshell: treelike SPARQL queries with a text-search component). The backend also has its own peculiar and non-standard syntax. The backend is described in [4]. In the meantime, we have developed a full-featured SPARQL+Text engine, called QLever [1]. QLever supports full SPARQL queries with a text-search component. However, the current UI is very simplistic and hard to use: it's a essentially a text field, where one can enter a SPARQL query, and then one gets a table with the result tuples. Here is a link to an internal demo (only works with an IP address from the technical faculty), which searches the Freebase knowledge base in combination with the ClueWeb12 web corpus. An example query for copy&paste is given under [5].

Goal: design a front-end for QLever that makes it easy to ask SPARQL+Text queries, also without prior knowledge of the entity and predicate names of the underlying knowledge base (in the demos above: Freebase). The user interface should be simpler than that for Broccoli.

Subgoal 1: Broccoli does not know synonyms. For example, when you type occupation there will be no relation match because the name of the relation in Freebase is profession. Or if you type acted in there will be no relation match because the name of the relation in Freebase is Film performance. Basic synonyms are very important for usability.

Subgoal 2: Broccoli has four boxes with suggestions, see the demo linked above. This confused users a lot. One of the main tasks will be to devise a simpler interface, but that is still powerful enough to formulate all queries.

Subgoal 3: Broccoli currently shows only a list of entities with text snippets. A result from a general SPARQL query can consist of an arbitrary number of columns (for example, a person, their profession, their gender, and some text snippets where they are mentioned together with certain words).

Subgoal 4: One problem with the current Broccoli interface is that the user has to know what is a predicate and what is an object. However, this depends a lot on how the statements are represented in the knowledge base. Which is not something the user can or is supposed to know a priori. TODO: give an example. It would be better to provide completions for a predicate combined with an object at the same time.

[1] H. Bast and B. Buchhold. QLever: a Query Engine for Efficient SPARQL+Text Search. Paper currently under review. Code and setup instructions on GitHub

[2] H. Bast, F. Bäurle, B. Buchhold, E. Haußmann. Semantic Full-Text Search with Broccoli (Demo paper). SIGIR 2014: 1265-1266

[3] H. Bast, F. Bäurle, B. Buchhold, E. Haußmann. Broccoli: Semantic Full-Text Search at your Fingertips (System and Evaluation Paper). CoRR abs/1207.2615 (2012)

[4] H. Bast and B. Buchhold. An index for efficient semantic full-text search. CIKM 2013: 369-378

[5] Example SPARQL+Text query (astronauts who walked on the moon) for copy&paste in the text field of our internal QLever demo:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT ?person ?name TEXT(?t) SCORE(?t) WHERE {
  ?person fb:people.person.profession ?profession .
  ?profession fb:type.object.name.en "Astronaut" .
  ?person fb:type.object.name.en ?name .
  ?person <in-text> ?t .
  ?t <in-text> "walk* moon"
}
LIMIT 100
ORDER BY DESC(SCORE(?t))

You should be fond of good user interfaces and have a good taste concerning layout and colors and such things. You should also like knowledge bases and big datasets and searching in them.

Hannah Bast

Style Option 2

Merging Overlapping GTFS Feeds (Bachelor project or thesis): Many transportation companies publish their timetable data either directly as GTFS feeds or in formats that can be converted to GTFS. As soon as you have two GTFS feeds (two sets of timetable data) that cover either the same or adjacent areas, the problem of duplicate trips arises. You should develop a tool that merges two or more GTFS feeds and solves duplication / fragmentation issues. As a bachelor project or thesis. Supervised by Patrick Brosi.

- Show / hide details

Type: Bachelor project or Bachelor thesis. You should be fond of train schedules and have a basic understanding of GIS and geometrical operations. Programming language of your choice, but performance should be good enough to handle very big datasets.

Background info: Many transportation companies publish their timetable data either directly as GTFS feeds or in formats that can be converted to GTFS. As soon as you have two GTFS feeds (two sets of timetable data) that cover either the same or adjacent areas, the problem of duplicate trips arises. Consider a schedule published by Deutsche Bahn containing only trains, and a schedule published by the VAG Freiburg, containing busses, trams and the Breisgau-S-Bahn. The S-Bahn is contained in both datasets, but most probably with different IDs, different line names ("BSB" vs "Breisgau-S-Bahn" vs "Zug" vs ...), different station IDs, different station coordinates... consider an even more complicated example, where a train schedule published by DB contains a train from Amsterdam to Zürich. The train is contained in the DB-Feed from Amsterdam to Basel SBB (where it crosses the Swiss Border), but the part in Switzerland is missing. Another dataset, published by the Swiss Federal Railways, contains the same train, but starting only at Basel Bad Bf (the last German station) and ending at Zurich. A third dataset, published by the Nederlandse Spoorwegen, contains the train from Amsterdam to the first station in Germany. If you want to use all the three feeds together, several problems appear: the train is represented two times between Amsterdam and the first station in Germany, two times between Basel Bad Bf and Basel SBB and the information that you can travel through Basel SBB into Switzerland without having to change trains is completely lost.

Goal: Your input will be two or more GTFS-feeds, your output will be a single, merged GTFS feed that solves all the problems described above. Specifically, you should analyze the data, think of some equivalency measurements for trips (for example, if a train called "ICE 501" arrives in Basel Bad Bf at 15:44 in feed A and a train "ICE501" departs from Basel Bad Bf at 15:49 in feed B, it is most likely that this is the same train) and merge trips / routes that belong to the same vehicle. Another example: if two trains in A and B serve exactly the same stations at exactly the same arrival / departure times, this is also most likely the same train. You should think of some testing mechanism that makes sure that indeed every connection that was possible in feed A and feed B is still possible in the merged feed, that is no information was lost. Given some overlapping feed that appears in different qualities on both feeds, your tool should also automatically decide which (partial) representation has the better quality (for example, in feed A, no geographical information on the train route ('shape') is present, but in feed B, it is, so use the shape information from feed B). You tool should be able to handle huge datasets (for example, the entire schedule [trains, busses, trams, ferries etc.] of Germany).

Extract and Analyze Scientist's Homepages (project and/or thesis): Extract a large number of scientist's homepages from the CommonCrawl web crawl. Extract the central information from these pages, including: name, profession, gender, affiliation. It will be relativel straightforward to get results of medium quality. The challenge is to achieve results of high quality. Machine learning will be crucial to achieve that. Exploring suitable methods is part of the challenge. Supervised by Hannah Bast.

- Show / hide details

Type: Interesting and well-defined problem. It will be relatively straightforward to get results of medium quality. The challenge is to achieve results of high quality. Machine learning will be crucial to achieve that. Exploring suitable methods is part of the challenge.

Goal: Extract a large number of scientist's homepages from the CommonCrawl web crawl. Extract the central information from these pages, including: name, profession, gender, affiliation.

Step 1: Download a large chunk of university webpages (e.g. all universities in Europe or in the US or even in the whole world).

Step 2: Design, implement and run a classifier that identifies people's homepages. A first version can be rule-based. For a classifier of high-quality machine learning will eventually be needed.

Step 3: Extract basic information, like the name, the gender, the profession, and the affiliation. A first version can be rule-based. For result with high precision and recall (which is the goal), machine learning will eventually be needed.

Tokenization Repair (project and/or thesis): Interesting and well-defined problem, the solution of which is relevant in a variety of information retrieval scenarios. Simple rule-based solutions come to mind easily, but machine learning is key to get very good results. A background in machine learning, or a strong willingness to aquire one as part of the project/thesis, is mandatory for this project. Supervised by Hannah Bast.

- Show / hide details

Type: Interesting and well-defined problem, the solution of which is relevant in a variety of information retrieval scenarios. Simple rule-based solutions come to mind easily, but machine learning is key to get very good results. A background in machine learning, or a strong willingness to acquire one as part of the project/thesis, is therefore mandatory for this project.

Background info: Breaking a natural-langue text into words is a basic first step for almost any system that does something sophisticated with text (we only consider word-based languages in this project/theses, and not single-symbol-based languages like Chinese). This is called tokenization. If the text is formatted perfectly, this is an (almost) trivial problem: just take the words as maximal strings of word characters. However, text is almost never formatted perfectly. The following kinds of formatting errors are frequently encountered in almost any text:

  1. Hyphenation due to a line break (e.g. Our algo-[newline]rithm runs in linear time.).

  2. Wrong spaces (e.g. Our algo rithm runs in linear time.).

  3. Missing spaces (e.g. Our algorithm runsin linear time.)

  4. Garbled characters (e.g. Our algorifm runs in linear time.)

Most text processing systems simple break for such errors. For example, a typical search engine will simply not find the word algorithm in the sentences 1, 2, and 4 above. A POS-tagger will not recognize runsin as a verb followed by a preposition, which can also affect the correctness of the other POS-tags in the sentence. More complex NLP routines, like parsing, are known to be very non-robust to such errors and may produce completely wrong results for the whole sentence.

Goal: design, implement, and evaluate a system that recognizes and fixes errors of the types 1, 2, and 3 in a given text. Check if 4 can be solved with similar techniques. If yes, also solve it. If no, characterize how it is a different problem (requiring different techniques). The first meaningful steps in this project are as follows.

Step 1: Search the literature for solutions to this problem. It is an important problem, but has received muss less research attention so far then other problems in NLP / text processing / information retrieval.

Step 2: Design simple baseline algorithms (not based on learning or using a very simple kind of learning), design a first small and simple (but not too small and simple) benchmark and evaluate your baseline algorithms on this benchmark. This will give you a feeling of how hard the problem is and where the actual problems lie. Actually your ideas may go in the wrong direction or you optimize the wrong aspects.

Step 3: Familiarize yourself with the kind of machine learning techniques that are useful in such a scenario. A neural network might be useful later on, but is most probably not the first think you should try (if it gives bad results, you will not know if its because the problem is so hard or because you used it in the wrong way, so you first need a baseline from a more classical learning algorithm).

Step 4: Come up with a learning-based algorithm and evaluate and see that you beat the baseline.

Combined SPARQL+Text search (project or thesis): This is well-suited as a project (B.Sc. or M.Sc.) but also provides ample opportunity for continuation with a theses (B.Sc. or M.Sc). You should be fond of good user interfaces and have a good taste concerning layout and colors and such things. You should also like knowledge bases and big datasets and searching in them. Supervised by Hannah Bast.

- Show / hide details

Type: This is well-suited as a project (B.Sc. or M.Sc.) but also provides ample opportunity for continuation with a theses (B.Sc. or M.Sc). You should be fond of good user interfaces and have a good taste concerning layout and colors and such things. You should also like knowledge bases and big datasets and searching in them.

Background info: Broccoli is our current system for convenient semantic search on text combined with a knowledge base. Here is a demo. The demo is described in [1]. Broccoli works with a special backend, which only supports a particular subset of queries (in a nutshell: treelike SPARQL queries with a text-search component). The backend also has its own peculiar and non-standard syntax. The backend is described in [4]. In the meantime, we have developed a full-featured SPARQL+Text engine, called QLever [1]. QLever supports full SPARQL queries with a text-search component. However, the current UI is very simplistic and hard to use: it's a essentially a text field, where one can enter a SPARQL query, and then one gets a table with the result tuples. Here is a link to an internal demo (only works with an IP address from the technical faculty), which searches the Freebase knowledge base in combination with the ClueWeb12 web corpus. An example query for copy&paste is given under [5].

Goal: design a front-end for QLever that makes it easy to ask SPARQL+Text queries, also without prior knowledge of the entity and predicate names of the underlying knowledge base (in the demos above: Freebase). The user interface should be simpler than that for Broccoli.

Subgoal 1: Broccoli does not know synonyms. For example, when you type occupation there will be no relation match because the name of the relation in Freebase is profession. Or if you type acted in there will be no relation match because the name of the relation in Freebase is Film performance. Basic synonyms are very important for usability.

Subgoal 2: Broccoli has four boxes with suggestions, see the demo linked above. This confused users a lot. One of the main tasks will be to devise a simpler interface, but that is still powerful enough to formulate all queries.

Subgoal 3: Broccoli currently shows only a list of entities with text snippets. A result from a general SPARQL query can consist of an arbitrary number of columns (for example, a person, their profession, their gender, and some text snippets where they are mentioned together with certain words).

Subgoal 4: One problem with the current Broccoli interface is that the user has to know what is a predicate and what is an object. However, this depends a lot on how the statements are represented in the knowledge base. Which is not something the user can or is supposed to know a priori. TODO: give an example. It would be better to provide completions for a predicate combined with an object at the same time.

[1] H. Bast and B. Buchhold. QLever: a Query Engine for Efficient SPARQL+Text Search. Paper currently under review. Code and setup instructions on GitHub

[2] H. Bast, F. Bäurle, B. Buchhold, E. Haußmann. Semantic Full-Text Search with Broccoli (Demo paper). SIGIR 2014: 1265-1266

[3] H. Bast, F. Bäurle, B. Buchhold, E. Haußmann. Broccoli: Semantic Full-Text Search at your Fingertips (System and Evaluation Paper). CoRR abs/1207.2615 (2012)

[4] H. Bast and B. Buchhold. An index for efficient semantic full-text search. CIKM 2013: 369-378

[5] Example SPARQL+Text query (astronauts who walked on the moon) for copy&paste in the text field of our internal QLever demo:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT ?person ?name TEXT(?t) SCORE(?t) WHERE {
  ?person fb:people.person.profession ?profession .
  ?profession fb:type.object.name.en "Astronaut" .
  ?person fb:type.object.name.en ?name .
  ?person <in-text> ?t .
  ?t <in-text> "walk* moon"
}
LIMIT 100
ORDER BY DESC(SCORE(?t))

Style Option 3

Merging Overlapping GTFS Feeds (Bachelor project or thesis)

Many transportation companies publish their timetable data either directly as GTFS feeds or in formats that can be converted to GTFS. As soon as you have two GTFS feeds (two sets of timetable data) that cover either the same or adjacent areas, the problem of duplicate trips arises. You should develop a tool that merges two or more GTFS feeds and solves duplication / fragmentation issues. As a bachelor project or thesis. Supervised by Patrick Brosi.

- Show / hide details

Type: Bachelor project or Bachelor thesis. You should be fond of train schedules and have a basic understanding of GIS and geometrical operations. Programming language of your choice, but performance should be good enough to handle very big datasets.

Background info: Many transportation companies publish their timetable data either directly as GTFS feeds or in formats that can be converted to GTFS. As soon as you have two GTFS feeds (two sets of timetable data) that cover either the same or adjacent areas, the problem of duplicate trips arises. Consider a schedule published by Deutsche Bahn containing only trains, and a schedule published by the VAG Freiburg, containing busses, trams and the Breisgau-S-Bahn. The S-Bahn is contained in both datasets, but most probably with different IDs, different line names ("BSB" vs "Breisgau-S-Bahn" vs "Zug" vs ...), different station IDs, different station coordinates... consider an even more complicated example, where a train schedule published by DB contains a train from Amsterdam to Zürich. The train is contained in the DB-Feed from Amsterdam to Basel SBB (where it crosses the Swiss Border), but the part in Switzerland is missing. Another dataset, published by the Swiss Federal Railways, contains the same train, but starting only at Basel Bad Bf (the last German station) and ending at Zurich. A third dataset, published by the Nederlandse Spoorwegen, contains the train from Amsterdam to the first station in Germany. If you want to use all the three feeds together, several problems appear: the train is represented two times between Amsterdam and the first station in Germany, two times between Basel Bad Bf and Basel SBB and the information that you can travel through Basel SBB into Switzerland without having to change trains is completely lost.

Goal: Your input will be two or more GTFS-feeds, your output will be a single, merged GTFS feed that solves all the problems described above. Specifically, you should analyze the data, think of some equivalency measurements for trips (for example, if a train called "ICE 501" arrives in Basel Bad Bf at 15:44 in feed A and a train "ICE501" departs from Basel Bad Bf at 15:49 in feed B, it is most likely that this is the same train) and merge trips / routes that belong to the same vehicle. Another example: if two trains in A and B serve exactly the same stations at exactly the same arrival / departure times, this is also most likely the same train. You should think of some testing mechanism that makes sure that indeed every connection that was possible in feed A and feed B is still possible in the merged feed, that is no information was lost. Given some overlapping feed that appears in different qualities on both feeds, your tool should also automatically decide which (partial) representation has the better quality (for example, in feed A, no geographical information on the train route ('shape') is present, but in feed B, it is, so use the shape information from feed B). You tool should be able to handle huge datasets (for example, the entire schedule [trains, busses, trams, ferries etc.] of Germany).

Extract and Analyze Scientist's Homepages (project and/or thesis)

Extract a large number of scientist's homepages from the CommonCrawl web crawl. Extract the central information from these pages, including: name, profession, gender, affiliation. It will be relativel straightforward to get results of medium quality. The challenge is to achieve results of high quality. Machine learning will be crucial to achieve that. Exploring suitable methods is part of the challenge. Supervised by Hannah Bast.

- Show / hide details

Type: Interesting and well-defined problem. It will be relatively straightforward to get results of medium quality. The challenge is to achieve results of high quality. Machine learning will be crucial to achieve that. Exploring suitable methods is part of the challenge.

Goal: Extract a large number of scientist's homepages from the CommonCrawl web crawl. Extract the central information from these pages, including: name, profession, gender, affiliation.

Step 1: Download a large chunk of university webpages (e.g. all universities in Europe or in the US or even in the whole world).

Step 2: Design, implement and run a classifier that identifies people's homepages. A first version can be rule-based. For a classifier of high-quality machine learning will eventually be needed.

Step 3: Extract basic information, like the name, the gender, the profession, and the affiliation. A first version can be rule-based. For result with high precision and recall (which is the goal), machine learning will eventually be needed.

Tokenization Repair (project and/or thesis)

Interesting and well-defined problem, the solution of which is relevant in a variety of information retrieval scenarios. Simple rule-based solutions come to mind easily, but machine learning is key to get very good results. A background in machine learning, or a strong willingness to aquire one as part of the project/thesis, is mandatory for this project. Supervised by Hannah Bast.

- Show / hide details

Type: Interesting and well-defined problem, the solution of which is relevant in a variety of information retrieval scenarios. Simple rule-based solutions come to mind easily, but machine learning is key to get very good results. A background in machine learning, or a strong willingness to acquire one as part of the project/thesis, is therefore mandatory for this project.

Background info: Breaking a natural-langue text into words is a basic first step for almost any system that does something sophisticated with text (we only consider word-based languages in this project/theses, and not single-symbol-based languages like Chinese). This is called tokenization. If the text is formatted perfectly, this is an (almost) trivial problem: just take the words as maximal strings of word characters. However, text is almost never formatted perfectly. The following kinds of formatting errors are frequently encountered in almost any text:

  1. Hyphenation due to a line break (e.g. Our algo-[newline]rithm runs in linear time.).

  2. Wrong spaces (e.g. Our algo rithm runs in linear time.).

  3. Missing spaces (e.g. Our algorithm runsin linear time.)

  4. Garbled characters (e.g. Our algorifm runs in linear time.)

Most text processing systems simple break for such errors. For example, a typical search engine will simply not find the word algorithm in the sentences 1, 2, and 4 above. A POS-tagger will not recognize runsin as a verb followed by a preposition, which can also affect the correctness of the other POS-tags in the sentence. More complex NLP routines, like parsing, are known to be very non-robust to such errors and may produce completely wrong results for the whole sentence.

Goal: design, implement, and evaluate a system that recognizes and fixes errors of the types 1, 2, and 3 in a given text. Check if 4 can be solved with similar techniques. If yes, also solve it. If no, characterize how it is a different problem (requiring different techniques). The first meaningful steps in this project are as follows.

Step 1: Search the literature for solutions to this problem. It is an important problem, but has received muss less research attention so far then other problems in NLP / text processing / information retrieval.

Step 2: Design simple baseline algorithms (not based on learning or using a very simple kind of learning), design a first small and simple (but not too small and simple) benchmark and evaluate your baseline algorithms on this benchmark. This will give you a feeling of how hard the problem is and where the actual problems lie. Actually your ideas may go in the wrong direction or you optimize the wrong aspects.

Step 3: Familiarize yourself with the kind of machine learning techniques that are useful in such a scenario. A neural network might be useful later on, but is most probably not the first think you should try (if it gives bad results, you will not know if its because the problem is so hard or because you used it in the wrong way, so you first need a baseline from a more classical learning algorithm).

Step 4: Come up with a learning-based algorithm and evaluate and see that you beat the baseline.

Combined SPARQL+Text search (project or thesis)

This is well-suited as a project (B.Sc. or M.Sc.) but also provides ample opportunity for continuation with a theses (B.Sc. or M.Sc). You should be fond of good user interfaces and have a good taste concerning layout and colors and such things. You should also like knowledge bases and big datasets and searching in them. Supervised by Hannah Bast.

- Show / hide details

Type: This is well-suited as a project (B.Sc. or M.Sc.) but also provides ample opportunity for continuation with a theses (B.Sc. or M.Sc). You should be fond of good user interfaces and have a good taste concerning layout and colors and such things. You should also like knowledge bases and big datasets and searching in them.

Background info: Broccoli is our current system for convenient semantic search on text combined with a knowledge base. Here is a demo. The demo is described in [1]. Broccoli works with a special backend, which only supports a particular subset of queries (in a nutshell: treelike SPARQL queries with a text-search component). The backend also has its own peculiar and non-standard syntax. The backend is described in [4]. In the meantime, we have developed a full-featured SPARQL+Text engine, called QLever [1]. QLever supports full SPARQL queries with a text-search component. However, the current UI is very simplistic and hard to use: it's a essentially a text field, where one can enter a SPARQL query, and then one gets a table with the result tuples. Here is a link to an internal demo (only works with an IP address from the technical faculty), which searches the Freebase knowledge base in combination with the ClueWeb12 web corpus. An example query for copy&paste is given under [5].

Goal: design a front-end for QLever that makes it easy to ask SPARQL+Text queries, also without prior knowledge of the entity and predicate names of the underlying knowledge base (in the demos above: Freebase). The user interface should be simpler than that for Broccoli.

Subgoal 1: Broccoli does not know synonyms. For example, when you type occupation there will be no relation match because the name of the relation in Freebase is profession. Or if you type acted in there will be no relation match because the name of the relation in Freebase is Film performance. Basic synonyms are very important for usability.

Subgoal 2: Broccoli has four boxes with suggestions, see the demo linked above. This confused users a lot. One of the main tasks will be to devise a simpler interface, but that is still powerful enough to formulate all queries.

Subgoal 3: Broccoli currently shows only a list of entities with text snippets. A result from a general SPARQL query can consist of an arbitrary number of columns (for example, a person, their profession, their gender, and some text snippets where they are mentioned together with certain words).

Subgoal 4: One problem with the current Broccoli interface is that the user has to know what is a predicate and what is an object. However, this depends a lot on how the statements are represented in the knowledge base. Which is not something the user can or is supposed to know a priori. TODO: give an example. It would be better to provide completions for a predicate combined with an object at the same time.

[1] H. Bast and B. Buchhold. QLever: a Query Engine for Efficient SPARQL+Text Search. Paper currently under review. Code and setup instructions on GitHub

[2] H. Bast, F. Bäurle, B. Buchhold, E. Haußmann. Semantic Full-Text Search with Broccoli (Demo paper). SIGIR 2014: 1265-1266

[3] H. Bast, F. Bäurle, B. Buchhold, E. Haußmann. Broccoli: Semantic Full-Text Search at your Fingertips (System and Evaluation Paper). CoRR abs/1207.2615 (2012)

[4] H. Bast and B. Buchhold. An index for efficient semantic full-text search. CIKM 2013: 369-378

[5] Example SPARQL+Text query (astronauts who walked on the moon) for copy&paste in the text field of our internal QLever demo:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT ?person ?name TEXT(?t) SCORE(?t) WHERE {
  ?person fb:people.person.profession ?profession .
  ?profession fb:type.object.name.en "Astronaut" .
  ?person fb:type.object.name.en ?name .
  ?person <in-text> ?t .
  ?t <in-text> "walk* moon"
}
LIMIT 100
ORDER BY DESC(SCORE(?t))

AD Teaching Wiki: ListOfProjectAndThesisTopics (last edited 2020-08-18 17:33:49 by Patrick Brosi)