#acl All:read

= Named Entity Recognition  (project and/or thesis) =

'''Type:''' Interesting and well-defined classical text processing problem with
broad applicability in knowlegde extraction and the combination of structured
and unstructured data. While rule based approaches have been studied and may be
feasable in very simple scenarios, machine learning is necessary for real-world
nosiy text using synonyms and references. A background in machine learning, or
a strong willingness to acquire one as part of the project/thesis, is therefore
mandatory for this project.

'''Background info:''' For many tasks in text processing it is an essential
prerequisite to know which tokens in a text refer to a specific (i.e. named)
entity such as a person, date or topic. Furthermore it will often be
necessary to link such an entity occurence with the respective entity or
concept in a knowledge base which is complicated by the fact that many
entities have the same name but different meaning often with both entities
in the same class - such as dfferent people with the same name.

As an example in a review on software engineering literature one might find the following
title: 

"Software Requirements and Design: The Work of Michael Jackson" (from [[http://www.research.att.com/people/Zave_Pamela/custom/indexCustom.html|here]])

For this an entity recognizer would have to match "Michael Jackson"
as the [[https://en.wikipedia.org/wiki/Michael_A._Jackson|computer scientist]]
instead of the singer by the same name. 

This task is known as Named Entity Recognition and considerable effort has
been expended on it. Nevertheless the growing availability of large amounts
of structured information in the form of knowledge bases such as Freebase
as well as advances in machine learning allow for new approaches especially
when targeting unconstrained non-domain data.

'''Goal''': Design, implement, and evaluate a system for recognizing knowledge
base entities in general unconstrained text.

'''Step 0''': Search the literature for existing approaches to this problem and
familiarize yourself with the available knowledge base data sets. Design and
implement a baseline version which recognizes contiguous ranges of tokens as
possible entity occurences.  For example in the sentence "Trump met Angela
Merkel in the White House" it should match "Trump", "Angela Merkel" and "White
House". Note, that unlike in the example you can not rely on correct casing
alone. This could for example be implemented using a BIO scheme where tokens
are tagged as either '''B'''eginning, '''I'''nside or '''O'''utside of an entity.  While
a rule based approach may be a useful first step this will likely already
benefit greatly from machine learning techniques. It may already be useful
to incorporate matching with a list of entities as a feature here.

'''Step 1''': Extend your baseline version to find for each entity occurence
a set of possibly referred entities in the knowledge base. This should
incorporate some kind of fuzzy matching, allowing for example "ellen" to match
[[https://en.wikipedia.org/wiki/Ellen_DeGeneres|Ellen DeGeneres]]

'''Step 2''': Design and implement a machine learning based approach to rank
possible matches from Step 1 by their likelyhood given the context.

'''NOTE''': Co-Reference analysis and other sentence structure analysis may be
regarded as out of scope for this project