Goal: Design and implement a simple but effective named-entity recognizer for a web-size corpus, namely ClueWeb12 [1]. The following features should be supported:
1. Recognize literal mentions (trivial). For example, recognize "Angela Merkel" as https://en.wikipedia.org/wiki/Angela_Merkel .
2. Recognize partial mentions of entities, which have been mentioned literally before. For example, recognize "Merkel" as https://en.wikipedia.org/wiki/Angela_Merkel, after she has been mentioned with her full name before.
3. Recognize mentions of entities via pronouns (he, she, it, ...), which have been mentioned literally before. For example, recognize "she" as https://en.wikipedia.org/wiki/Angela_Merkel if she has been mentioned before. Take the gender into account. That is, "she" should be identified as the last mention of a female entity.
4. Recognize mentions of the form "the <TYPE>", after a mention with the full entity name. For example, "the film" should be recognized as https://en.wikipedia.org/wiki/The_Matrix, if that film has been mentioned with its full name before.
[1] http://lemurproject.org/clueweb12/ We have purchased this dataset and it's available on our file system.