Differences between revisions 2 and 19 (spanning 17 versions)

Accurate Extraction of Special Characters

Type: An interesting and practical bachelor thesis. A basic understanding of Machine Learning is required; knowledge in Deep Learning is desirable. The preferred programming language is Python.

Background info: Text documents (e.g., books, newspapers or scientific publications) can contain ligatures (like fi or ffi) which are one character in the document, but actually represent multiple characters. Further, text documents can also contain characters with diacritics (like à or é) which are often two characters in the document (in particular often in PDF documents) but actually represent a single character. Translating those characters correctly is crucial for applications like search, because words that has not been identified correctly will simply not be found. Here are two examples:

and

In the left figure, note the single rectangle around the ligature ffi, meaning that it is a single character. If not translated correctly, the word might be extracted as ecient (without the ligature) or e?cient (with a placeholder ? instead of the ligature). In the right figure, note the two rectangles of the characters é, û and è, meaning that each of these characters is represented by two characters (the base character and the diacritic). If not translated correctly, the words might be extracted as cr´eme br^ul`ee.

Goal: Translating ligatures and characters with diacritics into the characters they actually represent using machine learning techniques.

Challenge 1: Ligatures can be represented by single characters which need to be split; and characters with diacritics can be represented by two characters which need to be merged, see above.

Challenge 2: Ligatures and characters with diacritics can be drawn into the document (in particular often in case of PDF documents), in which case they are more of a graphic nature than a textual nature. This requires to translate the characters based on e.g., analyzing their shapes.

Subgoal 1: Search the literature for related work to this problem.

Subgoal 2: Design a (simple) baseline algorithm (e.g., using a rule-based approach, but not machine learning). Design a small benchmark and evaluate your baseline algorithms on this benchmark. The motivation behind this step is to give you a feeling of how hard the problem is and where the actual problems lie.

Subgoal 3: Design, implement and train a convolutional (?) neural network for a fast ( ! ) translation of ligatures and characters with diacritics, based on e.g., their shapes and their neighboring characters. This includes to create large and appropriate training data needed to train the network.

Subgoal 2: A thorough evaluation of your learning-based approach, including a comparison to your baseline approach.

Supervision by Claudius Korzen

-  ⇤ ← Revision 2 as of 2019-05-05 20:45:31 → 
  Size: 3053
  Editor: adpult
  Comment:
+   ← Revision 19 as of 2019-05-05 22:53:14 → ⇥
  Size: 3117
  Editor: adpult
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-= Accurate Translation of Special Characters into Text =
+= Accurate Extraction of Special Characters =
 Line 6:
-Further, text documents can also contain characters with diacritics (like à or é) which are often two characters in the document (in particular often in PDF documents) but actually represent a single character. Here are two examples:
+Further, text documents can also contain characters with diacritics (like à or é) which are often two characters in the document (in particular often in PDF documents) but actually represent a single character. 
Translating those characters correctly is crucial for applications like search, because words that has not been identified correctly will simply not be found.
Here are two examples:
-Line 8:
+Line 10:
-TODO: Example for ligature and diacritics.
+{{attachment:ligature.jpg||height=30,align=bottom}} and
{{attachment:diacritic.jpg||height=30,align=bottom}}
-Line 10:
+Line 13:
-Translating those characters correctly is crucial for applications like search, because words that has not been identified correctly will simply not be found.
Consider the left figure. Note the single rectangle around the ligature ffi, meaning that it is a single character.
+In the left figure, note the single rectangle around the ligature ffi, meaning that it is a single character.
-Line 13:
+Line 15:
-In the right figure, note the two rectangles of the characters ́é, û and è, meaning that each of these characters is represented by two characters (the base character and the diacritic).
+In the right figure, note the two rectangles of the characters é, û and è, meaning that each of these characters is represented by two characters (the base character and the diacritic).
-Line 20:
+Line 22:
-''Challenge 2'': Ligatures and characters with diacritics can be ''drawn'' into the document (in particular often in case of PDF documents), in which case they are more of a graphic nature than a textual nature. This requires to translate the characters based on e.g., their shapes and the neighboring characters.
+''Challenge 2'': Ligatures and characters with diacritics can be ''drawn'' into the document (in particular often in case of PDF documents), in which case they are more of a graphic nature than a textual nature. This requires to translate the characters based on e.g., analyzing their shapes.
-Line 22:
+Line 24:
-'''Subgoal 1''': Design, implement and train a (supervised) machine learning model for a fast ( ! ) identification of columns. This implies to create large and diverse training data.
+'''Subgoal 1''': Search the literature for related work to this problem.
-Line 24:
+Line 26:
-'''Subgoal 2''': Design, implement and train a language model in order to identify the words contained in each column, based on the characters themselves and further features like the spacings between characters or their fonts. This implies again to create large and diverse training data.
+'''Subgoal 2''': Design a (simple) baseline algorithm (e.g., using a rule-based approach, but not machine learning). Design a small benchmark and evaluate your baseline algorithms on this benchmark. The motivation behind this step is to give you a feeling of how hard the problem is and where the actual problems lie.
-Line 26:
+Line 28:
-'''Subgoal 3''': A thorough evaluation of both approaches.
+'''Subgoal 3''': Design, implement and train a convolutional (?) neural network for a fast ( ! ) translation of ligatures and characters with diacritics, based on e.g., their shapes and their neighboring characters.
This includes to create large and appropriate training data needed to train the network.
-Line 28:
+Line 31:
-Translating those characters correctly is crucial if you want to extract text from the documents.
If not translated correctly, the word might be extracted as "ecient" (without the ligature) or "e?cient"
(with some placeholder “?”  instead of the ligature).
 are reot translated, the words might be extracted as e.g., “cr ́eme brˆul`ee”.
+'''Subgoal 2''': A thorough evaluation of your learning-based approach, including a comparison to your baseline approach.
<<BR>><<BR>>
Supervision by [[http://ad.informatik.uni-freiburg.de/staff/korzen|Claudius Korzen]]