Differences between revisions 1 and 4 (spanning 3 versions)

Accurate Word Extraction from Documents with Complex Layouts

Type: An interesting and practical bachelor thesis. You should have a basic understanding of Machine Learning and optionally, Deep Learning. Programming language Python is preferred.

Background info: Many text documents are layout-based and contain text that can consist of different fonts and can be arranged in multiple columns. Think of books, newspapers or scientific research papers. There are so called Manhattan layouts and non-Manhattan layouts. In Manhattan layouts, all columns are rectangular and aligned parallel to the page borders. In non-Manhattan layouts, the columns are of no fixed shape and can be aligned arbitrarily. Here are two examples:

TODO: Figure: Manhattan Layout and Non-Manhattan layout TODO: Explain the figure.

Goal: Extracting words column-wise from both, Manhattan layouts and non-Manhattan layouts; without mixing words from different columns.

Challenge 1: Layout-based documents (in particular: PDF documents) often provide the text only character-wise (but not word-wise). There are typically no whitespaces. Thus, the boundaries of words must be derived from e.g., analyzing the spacings between the characters. But the spacings can vary from line to line and even within a line there is no fixed rule to determine the extent of a word from the spacings alone. Here is an example:

The blue boxes are the bounding boxes around the characters. Note the small spacings between most of the characters and in particular, the direct adjacency of the two f's (between which there is actually a word boundary). Based on the spacings alone, the words might be extracted as s e r iffo n t.

Challenge 2: Internally, layout-based documents can store the characters in an order interleaving between the columns. If the characters are processed in that order, extracting the words correctly is obviously impossible.

Subgoal 1: Design, implement and train a machine learning model for fast ( ! ) column identification. Create large and meaningful training data. TODO

Subgoal 1: Design, implement and train a language model for identifying words in each column, in combination with features like the spacings between characters or their fonts. Create large and meaningful training data. TODO

-  ⇤ ← Revision 1 as of 2019-05-03 11:55:47 → 
  Size: 1977
  Editor: adpult
  Comment:
+   ← Revision 4 as of 2019-05-03 13:30:22 → ⇥
  Size: 2407
  Editor: adpult
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-'''Type''': An interesting and practical bachelor thesis. You should have a basic understanding of Machine Learning and optionally Deep Learning. Programming language Python is preferred.
+'''Type''': An interesting and practical bachelor thesis. You should have a basic understanding of Machine Learning and optionally, Deep Learning. Programming language Python is preferred.
 Line 5:
-'''Background info''': Many text documents are layout-based and contain text that can consists of many different fonts and can be arranged in any number of columns. Think of books, newspapers or scientific research papers. There are so called ''Manhattan'' layouts and ''non-Manhattan'' layouts. In Manhattan layouts, the shapes of columns are rectangular and aligned parallel to the page boundaries. In non-Manhattan layouts, the columns are of any shape and be aligned arbitrarily. Here are two examples:
+'''Background info''': Many text documents are layout-based and contain text that can consist of different fonts and can be arranged in multiple columns. Think of books, newspapers or scientific research papers. There are so called ''Manhattan'' layouts and ''non-Manhattan'' layouts. In Manhattan layouts, all columns are rectangular and aligned parallel to the page borders. In non-Manhattan layouts, the columns are of no fixed shape and can be aligned arbitrarily. Here are two examples:
 Line 7:
-TODO
+TODO: Figure: Manhattan Layout and Non-Manhattan layout
TODO: Explain the figure.
-Line 13:
+Line 14:
-TODO
+{{attachment:word-boundaries.jpg||width=200}}
-Line 15:
+Line 16:
-''Challenge 2'': Internally, layout-based documents can store the characters in an order interleaving between the columns. If the characters are processed in that order, extracting the words is obviously impossible.
+The blue boxes are the bounding boxes around the characters. Note the small spacings between most of the characters and in particular, the direct adjacency of the two f's (between which there is actually a word boundary). Based on the spacings alone, the words might be extracted as ''s e r iffo n t''.
-Line 17:
+Line 18:
-'''Subgoal 1''': Design, implement and train a machine learning model for fast ( ! ) column identification. Create large and meaningful training data.
+''Challenge 2'': Internally, layout-based documents can store the characters in an order interleaving between the columns. If the characters are processed in that order, extracting the words correctly is obviously impossible.
-Line 19:
+Line 20:
-'''Subgoal 1''': Design, implement and train a language model for identifying words in each column, in combination with features like the spacings between characters or their fonts. Create large and meaningful training data.
+'''Subgoal 1''': Design, implement and train a machine learning model for fast ( ! ) column identification. Create large and meaningful training data. TODO

'''Subgoal 1''': Design, implement and train a language model for identifying words in each column, in combination with features like the spacings between characters or their fonts. Create large and meaningful training data. TODO