3053
Comment:
|
2503
|
Deletions are marked like this. | Additions are marked like this. |
Line 8: | Line 8: |
TODO: Example for ligature and diacritics. | {{attachment:ligature.jpg||height=50,align=bottom}} and {{attachment:diacritic.jpg||height=50,align=bottom}} |
Line 22: | Line 23: |
'''Subgoal 1''': Design, implement and train a (supervised) machine learning model for a fast ( ! ) identification of columns. This implies to create large and diverse training data. | '''Subgoal 1''': Design, implement and train a convolutional (?) neural network for a fast ( ! ) translation of ligatures and characters with diacritics, based on e.g., their shapes and their neighboring characters. |
Line 24: | Line 25: |
'''Subgoal 2''': Design, implement and train a language model in order to identify the words contained in each column, based on the characters themselves and further features like the spacings between characters or their fonts. This implies again to create large and diverse training data. '''Subgoal 3''': A thorough evaluation of both approaches. Translating those characters correctly is crucial if you want to extract text from the documents. If not translated correctly, the word might be extracted as "ecient" (without the ligature) or "e?cient" (with some placeholder “?” instead of the ligature). are reot translated, the words might be extracted as e.g., “cr ́eme brˆul`ee”. |
'''Subgoal 2''': A thorough evaluation of your approach. |
Accurate Translation of Special Characters into Text
Type: An interesting and practical bachelor thesis. A basic understanding of Machine Learning is required; knowledge in Deep Learning is desirable. The preferred programming language is Python.
Background info: Text documents (e.g., books, newspapers or scientific publications) can contain ligatures (like fi or ffi) which are one character in the document, but actually represent multiple characters. Further, text documents can also contain characters with diacritics (like à or é) which are often two characters in the document (in particular often in PDF documents) but actually represent a single character. Here are two examples:
and
Translating those characters correctly is crucial for applications like search, because words that has not been identified correctly will simply not be found. Consider the left figure. Note the single rectangle around the ligature ffi, meaning that it is a single character. If not translated correctly, the word might be extracted as ecient (without the ligature) or e?cient (with a placeholder ? instead of the ligature). In the right figure, note the two rectangles of the characters ́é, û and è, meaning that each of these characters is represented by two characters (the base character and the diacritic). If not translated correctly, the words might be extracted as cr´eme br^ul`ee.
Goal: Translating ligatures and characters with diacritics into the characters they actually represent using machine learning techniques.
Challenge 1: Ligatures can be represented by single characters which need to be split; and characters with diacritics can be represented by two characters which need to be merged, see above.
Challenge 2: Ligatures and characters with diacritics can be drawn into the document (in particular often in case of PDF documents), in which case they are more of a graphic nature than a textual nature. This requires to translate the characters based on e.g., their shapes and the neighboring characters.
Subgoal 1: Design, implement and train a convolutional (?) neural network for a fast ( ! ) translation of ligatures and characters with diacritics, based on e.g., their shapes and their neighboring characters.
Subgoal 2: A thorough evaluation of your approach.