3117
Comment:
|
3123
|
Deletions are marked like this. | Additions are marked like this. |
Line 14: | Line 14: |
If not translated correctly, the word might be extracted as ''ecient'' (without the ligature) or ''e?cient'' (with a placeholder ? instead of the ligature). | If not translated correctly, the word might be extracted as ''ecient'' (without the ligature) or ''e?cient'' (with a placeholder "''?''" instead of the ligature). |
Accurate Extraction of Special Characters
Type: An interesting and practical bachelor thesis. A basic understanding of Machine Learning is required; knowledge in Deep Learning is desirable. The preferred programming language is Python.
Background info: Text documents (e.g., books, newspapers or scientific publications) can contain ligatures (like fi or ffi) which are one character in the document, but actually represent multiple characters. Further, text documents can also contain characters with diacritics (like à or é) which are often two characters in the document (in particular often in PDF documents) but actually represent a single character. Translating those characters correctly is crucial for applications like search, because words that has not been identified correctly will simply not be found. Here are two examples:
and
In the left figure, note the single rectangle around the ligature ffi, meaning that it is a single character. If not translated correctly, the word might be extracted as ecient (without the ligature) or e?cient (with a placeholder "?" instead of the ligature). In the right figure, note the two rectangles of the characters é, û and è, meaning that each of these characters is represented by two characters (the base character and the diacritic). If not translated correctly, the words might be extracted as cr´eme br^ul`ee.
Goal: Translating ligatures and characters with diacritics into the characters they actually represent using machine learning techniques.
Challenge 1: Ligatures can be represented by single characters which need to be split; and characters with diacritics can be represented by two characters which need to be merged, see above.
Challenge 2: Ligatures and characters with diacritics can be drawn into the document (in particular often in case of PDF documents), in which case they are more of a graphic nature than a textual nature. This requires to translate the characters based on e.g., analyzing their shapes.
Subgoal 1: Search the literature for related work to this problem.
Subgoal 2: Design a (simple) baseline algorithm (e.g., using a rule-based approach, but not machine learning). Design a small benchmark and evaluate your baseline algorithms on this benchmark. The motivation behind this step is to give you a feeling of how hard the problem is and where the actual problems lie.
Subgoal 3: Design, implement and train a convolutional (?) neural network for a fast ( ! ) translation of ligatures and characters with diacritics, based on e.g., their shapes and their neighboring characters. This includes to create large and appropriate training data needed to train the network.
Subgoal 4: A thorough evaluation of your learning-based approach, including a comparison to your baseline approach.
Supervision by Claudius Korzen