2503
Comment:
|
← Revision 25 as of 2020-08-24 09:54:50 ⇥
3072
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
= Accurate Translation of Special Characters into Text = |
|
Line 6: | Line 4: |
Further, text documents can also contain characters with diacritics (like à or é) which are often two characters in the document (in particular often in PDF documents) but actually represent a single character. Here are two examples: | Further, text documents can also contain characters with diacritics (like à or é) which are often two characters in the document (in particular often in PDF documents) but actually represent a single character. Translating those characters correctly is crucial for applications like search, because words that has not been identified correctly will simply not be found. Here are two examples: |
Line 8: | Line 8: |
{{attachment:ligature.jpg||height=66,align=bottom}} and {{attachment:diacritic.jpg||height=66,align=bottom}} |
{{attachment:ligature.jpg||height=30,align=bottom}} and {{attachment:diacritic.jpg||height=30,align=bottom}} |
Line 11: | Line 11: |
Translating those characters correctly is crucial for applications like search, because words that has not been identified correctly will simply not be found. Consider the left figure. Note the single rectangle around the ligature ffi, meaning that it is a single character. If not translated correctly, the word might be extracted as ''ecient'' (without the ligature) or ''e?cient'' (with a placeholder ? instead of the ligature). In the right figure, note the two rectangles of the characters ́é, û and è, meaning that each of these characters is represented by two characters (the base character and the diacritic). |
In the left figure, note the single rectangle around the ligature ffi, meaning that it is a single character. If not translated correctly, the word might be extracted as ''ecient'' (without the ligature) or ''e?cient'' (with a placeholder "''?''" instead of the ligature). In the right figure, note the two rectangles of the characters é, û and è, meaning that each of these characters is represented by two characters (the base character and the diacritic). |
Line 17: | Line 16: |
'''Goal''': Translating ligatures and characters with diacritics into the characters they actually represent using machine learning techniques. | '''Goal''': Translate ligatures and characters with diacritics into the characters they actually represent using machine learning techniques. |
Line 21: | Line 20: |
''Challenge 2'': Ligatures and characters with diacritics can be ''drawn'' into the document (in particular often in case of PDF documents), in which case they are more of a graphic nature than a textual nature. This requires to translate the characters based on e.g., their shapes and the neighboring characters. | ''Challenge 2'': Ligatures and characters with diacritics can be ''drawn'' into the document (in particular often in case of PDF documents), in which case they are more of a graphic nature than a textual nature. This requires to translate the characters based on e.g., analyzing their shapes. |
Line 23: | Line 22: |
'''Subgoal 1''': Design, implement and train a convolutional (?) neural network for a fast ( ! ) translation of ligatures and characters with diacritics, based on e.g., their shapes and their neighboring characters. | '''Subgoal 1''': Search the literature for work related to this problem. |
Line 25: | Line 24: |
'''Subgoal 2''': A thorough evaluation of your approach. | '''Subgoal 2''': Design a (simple) baseline algorithm (e.g., using a rule-based approach, but not machine learning). Design a small benchmark and evaluate your baseline algorithms on this benchmark. The motivation behind this step is to give you a feeling of how hard the problem is and where the actual problems lie. '''Subgoal 3''': Design, implement and train a convolutional (?) neural network for a fast ( ! ) translation of ligatures and characters with diacritics, based on e.g., their shapes and their neighboring characters. This includes to create large and appropriate training data needed to train the network. '''Subgoal 4''': A thorough evaluation of your learning-based approach, including a comparison to your baseline approach. <<BR>><<BR>> Supervision by [[http://ad.informatik.uni-freiburg.de/staff/korzen|Claudius Korzen]] |
Type: An interesting and practical bachelor thesis. A basic understanding of Machine Learning is required; knowledge in Deep Learning is desirable. The preferred programming language is Python.
Background info: Text documents (e.g., books, newspapers or scientific publications) can contain ligatures (like fi or ffi) which are one character in the document, but actually represent multiple characters. Further, text documents can also contain characters with diacritics (like à or é) which are often two characters in the document (in particular often in PDF documents) but actually represent a single character. Translating those characters correctly is crucial for applications like search, because words that has not been identified correctly will simply not be found. Here are two examples:
and
In the left figure, note the single rectangle around the ligature ffi, meaning that it is a single character. If not translated correctly, the word might be extracted as ecient (without the ligature) or e?cient (with a placeholder "?" instead of the ligature). In the right figure, note the two rectangles of the characters é, û and è, meaning that each of these characters is represented by two characters (the base character and the diacritic). If not translated correctly, the words might be extracted as cr´eme br^ul`ee.
Goal: Translate ligatures and characters with diacritics into the characters they actually represent using machine learning techniques.
Challenge 1: Ligatures can be represented by single characters which need to be split; and characters with diacritics can be represented by two characters which need to be merged, see above.
Challenge 2: Ligatures and characters with diacritics can be drawn into the document (in particular often in case of PDF documents), in which case they are more of a graphic nature than a textual nature. This requires to translate the characters based on e.g., analyzing their shapes.
Subgoal 1: Search the literature for work related to this problem.
Subgoal 2: Design a (simple) baseline algorithm (e.g., using a rule-based approach, but not machine learning). Design a small benchmark and evaluate your baseline algorithms on this benchmark. The motivation behind this step is to give you a feeling of how hard the problem is and where the actual problems lie.
Subgoal 3: Design, implement and train a convolutional (?) neural network for a fast ( ! ) translation of ligatures and characters with diacritics, based on e.g., their shapes and their neighboring characters. This includes to create large and appropriate training data needed to train the network.
Subgoal 4: A thorough evaluation of your learning-based approach, including a comparison to your baseline approach.
Supervision by Claudius Korzen