BachelorAndMasterProjectsAndTheses/WordExtraction

Type: An interesting and practical bachelor thesis. A basic understanding of Machine Learning is required; knowledge of Deep Learning is desirable. The preferred programming language is Python.

Background info: Many text documents are layout-based and contain text that can consist of different fonts and can be arranged in multiple columns. Think of books, newspapers or scientific research papers. There are so called Manhattan layouts and non-Manhattan layouts. In Manhattan layouts, all columns are rectangular and aligned parallel to the page borders. In non-Manhattan layouts, the columns are of no fixed shape and can be aligned arbitrarily. Here are two examples:

and

The left document shows a Manhattan-layout, the right document a non-Manhattan-layout. In the non-Manhattan layout, note in particular the two text blocks in the middle of the page. The decision that both text blocks are not part of the other two columns (but are two distinct text blocks) might be easy for a human reader, but is quite hard for computers.

Goal: Extract words column-wise from both, Manhattan layouts and non-Manhattan layouts, using machine learning techniques; without mixing words from different columns.

Challenge 1: In non-Manhattan layouts, the columns can be of arbitrary form and can be placed at any position, see above.

Challenge 2: Layout-based documents (in particular: PDF documents) often provide the text only character-wise (but not word-wise). There are further typically no whitespaces. Thus, the boundaries of words must be derived from e.g., analyzing the spacings between the characters. But the spacings can vary from line to line and even within a line there is no fixed rule to determine the extent of a word from the spacings alone. Here is an example:

The blue boxes are the bounding boxes of the characters. Note the small spacings between most of the characters and in particular, the direct adjacency of the two f's (between which there is actually a word boundary). Based on the spacings alone, the characters might be extracted as the following six words: "s e r iffo n t". However, they should be extracted as the two words "serif font".

Challenge 3: Internally, layout-based documents can store the characters in an order interleaving between the columns. If the characters are processed in that order, extracting the words correctly is obviously impossible.

Subgoal 1: Search the literature for work related to this problem.

Subgoal 2: Design a (simple) baseline algorithm (e.g., using a rule-based approach, but not machine learning). Design a small benchmark and evaluate your baseline algorithms on this benchmark. The motivation behind this step is to give you a feeling of how hard the problem is and where the actual problems lie.

Subgoal 3: Design, implement and train a (supervised) machine learning model for a fast ( ! ) identification of columns. This can be based on a neural network or any other well performing maching learning technique. This includes to create large and appropriate training data needed to train the model.

Subgoal 4: Similarily, design, implement and train a language model for identifying the words contained in each column, e.g., based on the characters themselves and further features like the spacings between characters or their fonts.

Subgoal 5: A thorough evaluation of your learning-based approach, including a comparison to your baseline approach.

Supervision by Claudius Korzen

AD Teaching Wiki: BachelorAndMasterProjectsAndTheses/WordExtraction (last edited 2020-08-24 09:53:52 by Natalie Prange)