4159
Comment:
|
2499
|
Deletions are marked like this. | Additions are marked like this. |
Line 6: | Line 6: |
Further, text documents can also contain characters with diacritics (like à or é) which are often two characters in the document (in particular often in PDF documents) but actually represent a single character. Here are two examples: | Further, text documents can also contain characters with diacritics (like à or é) which are often two characters in the document (in particular often in PDF documents) but actually represent a single character. Translating those characters correctly is crucial for applications like search, because words that has not been identified correctly will simply not be found. Here are two examples: |
Line 8: | Line 10: |
TODO: Example for ligature and diacritics. | {{attachment:ligature.jpg||height=40,align=bottom}} and {{attachment:diacritic.jpg||height=40,align=bottom}} |
Line 10: | Line 13: |
In the left figure, note the single rectangle around the ligature ffi, meaning that it is a single character. In the right figure, note the two rectangles of the characters ́é, û and è, meaning that each of these characters is represented by two characters (the base character and the diacritic). Translating those characters correctly is crucial for applications like search, because words that has not been identified correctly will simply not be found.<<BR>><<BR>> | In the left figure, note the single rectangle around the ligature ffi, meaning that it is a single character. If not translated correctly, the word might be extracted as ''ecient'' (without the ligature) or ''e?cient'' (with a placeholder ? instead of the ligature). In the right figure, note the two rectangles of the characters ́é, û and è, meaning that each of these characters is represented by two characters (the base character and the diacritic). If not translated correctly, the words might be extracted as ''cr´eme br^ul`ee''.<<BR>><<BR>> |
Line 14: | Line 20: |
''Challenge 1'': As explained above, ligatures can be represented by single characters; characters with diacritics can be represented by two characters. Consider the two above figures again. If not translated correctly, the word with the ligature might be extracted as ''ecient'' (without the ligature) or ''e?cient'' (with a placeholder ? instead of the ligature. The words with the characters with diacritics might be extracted as ''cr´eme br^ul`ee''. In case you f you want to extracted for a search engines | ''Challenge 1'': Ligatures can be represented by single characters which need to be split; and characters with diacritics can be represented by two characters which need to be merged, see above. |
Line 16: | Line 22: |
''Challenge 1'': Ligatures and characters with diacritics can be drawn into the document, in which case In non-Manhattan layouts, the columns can be of arbitrary form and can be placed at any position, see above. | ''Challenge 2'': Ligatures and characters with diacritics can be ''drawn'' into the document (in particular often in case of PDF documents), in which case they are more of a graphic nature than a textual nature. This requires to translate the characters based on e.g., their shapes and the neighboring characters. |
Line 18: | Line 24: |
''Challenge 2'': Layout-based documents (in particular: PDF documents) often provide the text only character-wise (but not word-wise). There are further typically no whitespaces. Thus, the boundaries of words must be derived from e.g., analyzing the spacings between the characters. But the spacings can vary from line to line and even within a line there is no fixed rule to determine the extent of a word from the spacings alone. Here is an example: | '''Subgoal 1''': Design, implement and train a convolutional (?) neural network for a fast ( ! ) translation of ligatures and characters with diacritics, based on e.g., their shapes and their neighboring characters. |
Line 20: | Line 26: |
{{attachment:word-boundaries.jpg||width=200}} The blue boxes are the bounding boxes of the characters. Note the small spacings between most of the characters and in particular, the direct adjacency of the two f's (between which there is actually a word boundary). Based on the spacings alone, the characters might be extracted as the following six words: ''"s e r iffo n t"''. However, they should be extracted as the two words ''"serif font"''. ''Challenge 3'': Internally, layout-based documents can store the characters in an order interleaving between the columns. If the characters are processed in that order, extracting the words correctly is obviously impossible. '''Subgoal 1''': Design, implement and train a (supervised) machine learning model for a fast ( ! ) identification of columns. This implies to create large and diverse training data. '''Subgoal 2''': Design, implement and train a language model in order to identify the words contained in each column, based on the characters themselves and further features like the spacings between characters or their fonts. This implies again to create large and diverse training data. '''Subgoal 3''': A thorough evaluation of both approaches. Translating those characters correctly is crucial if you want to extract text from the documents. If not translated correctly, the word might be extracted as "ecient" (without the ligature) or "e?cient" (with some placeholder “?” instead of the ligature). are reot translated, the words might be extracted as e.g., “cr ́eme brˆul`ee”. |
'''Subgoal 2''': A thorough evaluation of your approach. |
Accurate Translation of Special Characters into Text
Type: An interesting and practical bachelor thesis. A basic understanding of Machine Learning is required; knowledge in Deep Learning is desirable. The preferred programming language is Python.
Background info: Text documents (e.g., books, newspapers or scientific publications) can contain ligatures (like fi or ffi) which are one character in the document, but actually represent multiple characters. Further, text documents can also contain characters with diacritics (like à or é) which are often two characters in the document (in particular often in PDF documents) but actually represent a single character. Translating those characters correctly is crucial for applications like search, because words that has not been identified correctly will simply not be found. Here are two examples:
and
In the left figure, note the single rectangle around the ligature ffi, meaning that it is a single character. If not translated correctly, the word might be extracted as ecient (without the ligature) or e?cient (with a placeholder ? instead of the ligature). In the right figure, note the two rectangles of the characters ́é, û and è, meaning that each of these characters is represented by two characters (the base character and the diacritic). If not translated correctly, the words might be extracted as cr´eme br^ul`ee.
Goal: Translating ligatures and characters with diacritics into the characters they actually represent using machine learning techniques.
Challenge 1: Ligatures can be represented by single characters which need to be split; and characters with diacritics can be represented by two characters which need to be merged, see above.
Challenge 2: Ligatures and characters with diacritics can be drawn into the document (in particular often in case of PDF documents), in which case they are more of a graphic nature than a textual nature. This requires to translate the characters based on e.g., their shapes and the neighboring characters.
Subgoal 1: Design, implement and train a convolutional (?) neural network for a fast ( ! ) translation of ligatures and characters with diacritics, based on e.g., their shapes and their neighboring characters.
Subgoal 2: A thorough evaluation of your approach.