Named Entity Recognition (project and/or thesis)
Type: Interesting and well-defined classical text processing problem with broad applicability in knowlegde extraction and the combination of structured and unstructured data. While rule based approaches have been studied and may be feasable in very simple scenarios, machine learning is necessary for real-world nosiy text using synonyms and references. A background in machine learning, or a strong willingness to acquire one as part of the project/thesis, is therefore mandatory for this project.
Background info: For many tasks in text processing it is an essential prerequisite to know which tokens in a text refer to a specific (i.e. named) entity such as a person, date or topic. Furthermore it will often be necessary to link such an entity occurence with the respective entity or concept in a knowledge base which is complicated by the fact that many entities have the same name but different meaning often with both entities in the same class - such as dfferent people with the same name.
As an example in a review on software engineering literature one might find the following title:
"Software Requirements and Design: The Work of Michael Jackson" (from here)
For this an entity recognizer would have to match "Michael Jackson" as the computer scientist instead of the singer by the same name.
This task is known as Named Entity Recognition and considerable effort has been expended on it. Nevertheless the growing availability of large amounts of structured information in the form of knowledge bases such as Freebase as well as advances in machine learning allow for new approaches especially when targeting unconstrained non-domain data.
Goal: Design, implement, and evaluate a system for recognizing knowledge base entities in general unconstrained text.
Step 0: Search the literature for existing approaches to this problem and familiarize yourself with the available knowledge base data sets. Design and implement a baseline version which recognizes contiguous ranges of tokens as possible entity occurences. For example in the sentence "Trump met Angela Merkel in the White House" it should match "Trump", "Angela Merkel" and "White House". Note, that unlike in the example you can not rely on correct casing alone. This could for example be implemented using a BIO scheme where tokens are tagged as either Beginning, Inside or Outside of an entity. While a rule based approach may be a useful first step this will likely already benefit greatly from machine learning techniques. It may already be useful to incorporate matching with a list of entities as a feature here.
Step 1: Extend your baseline version to find for each entity occurence a set of possibly referred entities in the knowledge base. This should incorporate some kind of fuzzy matching, allowing for example "ellen" to match Ellen DeGeneres
Step 2: Design and implement a machine learning based approach to rank possible matches from Step 1 by their likelyhood given the context.
NOTE: Co-Reference analysis and other sentence structure analysis may be regarded as out of scope for this project