Differences between revisions 3 and 12 (spanning 9 versions)

Accurate Word Extraction from Documents with Complex Layouts

Type: An interesting and practical project. A basic understanding of Machine Learning is required; knowledge of Deep Learning is desirable. The preferred programming language is Python.

Background info: Words in text documents can be hyphenated (especially frequent in layouts with multiple columns), in which case they appear broken in two parts at different positions in the text and need to be merged to single words when text from the documents is extracted (e.g., for indexing and searching the documents). The hyphen between two parts of a word can be mandatory even after merging when the word is a compound of two words and the hyphen is used to combine them. Here is an example:

Note the parts highlighted in blue. The word high-quality is a compound word and thus, the hyphen between high and quality needs to be retained after merging the parts. In contrast, the word benchmark is not a compound word and thus, the hyphen needs to be removed on merging the parts.

Goal: Merging hyphenated words by using machine learning techniques, by taking into account that a word can be a compound word, in which case the hyphen between the two parts of the word needs to be retained on merging the parts.

Challenge 1: The decision, whether or not the hyphen between the parts of a hyphenated word needs to be retained (because of a compound word, see above).

Subgoal 1: Search the literature for related work to this problem.

Subgoal 2: Design a (simple) baseline algorithm (e.g., using a rule-based approach, but not machine learning). Design a small benchmark and evaluate your baseline algorithms on this benchmark. The motivation behind this step is to give you a feeling of how hard the problem is and where the actual problems lie.

Subgoal 3: Design, implement and train a machine learning model for merging hyphenated words. This can be based on a neural network or any other well performing machine learning technique. This includes to create large and appropriated training data for training the model.

Subgoal 2: A thorough evaluation of your approach, including a comparison to your baseline approach.

Supervision by Claudius Korzen.

-  ⇤ ← Revision 3 as of 2019-05-05 22:23:42 → 
  Size: 1515
  Editor: adpult
  Comment:
+   ← Revision 12 as of 2019-05-05 22:49:35 → ⇥
  Size: 2470
  Editor: adpult
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 5:
-'''Background info''': Words in text documents can be hyphenated (especially frequent in layouts with multiple columns), in which case they appear broken in two parts at different positions in the text and need to be merged to single words on extracting text from the documents (e.g., for indexing and searching the documents). The hyphen between two parts of a word can be mandatory even after merging when the word is a compound of two words and the hyphen is used to combine them. Here is an example:
+'''Background info''': Words in text documents can be hyphenated (especially frequent in layouts with multiple columns), in which case they appear broken in two parts at different positions in the text and need to be merged to single words when text from the documents is extracted (e.g., for indexing and searching the documents). The hyphen between two parts of a word can be mandatory even after merging when the word is a compound of two words and the hyphen is used to combine them. Here is an example:
<<BR>><<BR>>
{{attachment:hyphenation.jpg||width=500,align=bottom}}
<<BR>><<BR>>
Note the parts highlighted in blue. The word ''high-quality'' is a compound word and thus, the hyphen between ''high'' and ''quality'' needs to be retained after merging the parts. In contrast, the word ''benchmark'' is not a compound word and thus, the hyphen needs to be removed on merging the parts.<<BR>><<BR>>
-Line 7:
+Line 11:
-TODO: Example.
+'''Goal''': Merging hyphenated words by using machine learning techniques, by taking into account that a word can  be a compound word, in which case the hyphen between the two parts of the word needs to be retained on merging the parts.
-Line 9:
+Line 13:
-Note the parts highlighted in blue. The word ''high-quality'' is a compound word and thus, the hyphen between ''high'' and ''quality'' needs to be retained after merging the parts. In contrast, the word ''benchmark'' is not a compound word and thus, the hyphen needs to be removed.
+''Challenge 1'': The decision, whether or not the hyphen between the parts of a hyphenated word needs to be retained (because of a compound word, see above).
-Line 11:
+Line 15:
-'''Goal''': Merging hyphenated words by using machine learning techniques; taking into account that a word can  be a compound word, in which case the hyphen between the two parts of the word needs to be retained on merging the parts.
+'''Subgoal 1''': Search the literature for related work to this problem.
-Line 13:
+Line 17:
-''Challenge 1'': TODO
+'''Subgoal 2''': Design a (simple) baseline algorithm (e.g., using a rule-based approach, but not machine learning). Design a small benchmark and evaluate your baseline algorithms on this benchmark. The motivation behind this step is to give you a feeling of how hard the problem is and where the actual problems lie.
-Line 15:
+Line 19:
-''Challenge 2'': TODO
+'''Subgoal 3''': Design, implement and train a machine learning model for merging hyphenated words. This can be based on a neural network or any other well performing machine learning technique. This includes to create large and appropriated training data for training the model.
-Line 17:
+Line 21:
-'''Subgoal 1''': TODO

'''Subgoal 2''': TODO
+'''Subgoal 2''': A thorough evaluation of your approach, including a comparison to your baseline approach.