BachelorAndMasterProjectsAndTheses/TokenizationRepair

Type: Interesting and well-defined problem, the solution of which is relevant in a variety of information retrieval scenarios. Simple rule-based solutions come to mind easily, but machine learning is key to get very good results. A background in machine learning, or a strong willingness to acquire one as part of the project/thesis, is therefore mandatory for this project.

Background info: Breaking a natural-langue text into words is a basic first step for almost any system that does something sophisticated with text (we only consider word-based languages in this project/theses, and not single-symbol-based languages like Chinese). This is called tokenization. If the text is formatted perfectly, this is an (almost) trivial problem: just take the words as maximal strings of word characters. However, text is almost never formatted perfectly. The following kinds of formatting errors are frequently encountered in almost any text:

Hyphenation due to a line break (e.g. Our algo-[newline]rithm runs in linear time.).
Wrong spaces (e.g. Our algo rithm runs in linear time.).
Missing spaces (e.g. Our algorithm runsin linear time.)
Garbled characters (e.g. Our algorifm runs in linear time.)

Most text processing systems simple break for such errors. For example, a typical search engine will simply not find the word algorithm in the sentences 1, 2, and 4 above. A POS-tagger will not recognize runsin as a verb followed by a preposition, which can also affect the correctness of the other POS-tags in the sentence. More complex NLP routines, like parsing, are known to be very non-robust to such errors and may produce completely wrong results for the whole sentence.

Goal: design, implement, and evaluate a system that recognizes and fixes errors of the types 1, 2, and 3 in a given text. Check if 4 can be solved with similar techniques. If yes, also solve it. If no, characterize how it is a different problem (requiring different techniques). The first meaningful steps in this project are as follows.

Step 1: Search the literature for solutions to this problem. It is an important problem, but has received muss less research attention so far then other problems in NLP / text processing / information retrieval.

Step 2: Design simple baseline algorithms (not based on learning or using a very simple kind of learning), design a first small and simple (but not too small and simple) benchmark and evaluate your baseline algorithms on this benchmark. This will give you a feeling of how hard the problem is and where the actual problems lie. Actually your ideas may go in the wrong direction or you optimize the wrong aspects.

Step 3: Familiarize yourself with the kind of machine learning techniques that are useful in such a scenario. A neural network might be useful later on, but is most probably not the first think you should try (if it gives bad results, you will not know if its because the problem is so hard or because you used it in the wrong way, so you first need a baseline from a more classical learning algorithm).

Step 4: Come up with a learning-based algorithm and evaluate and see that you beat the baseline.

AD Teaching Wiki: BachelorAndMasterProjectsAndTheses/TokenizationRepair (last edited 2020-08-18 15:53:18 by Natalie Prange)