= Dehyphenation of hyphenated words & Guessing Ligatures =

This topic consists of 2 subtasks, which are both related to high-quality text extraction from PDF files, which is one of our research topics. 
The main challenge on text extraction from PDF is that PDF is a layout-based format: it specifies the positions and fonts of the individual characters, of which the text is composed, but usually does not provide any information about words, paragraphs and sections. 
In particular, we are interested in the accurate identification of words and paragraphs, which is crucial for search applications.

'''Type''': Master thesis. You should have advanced programming skills (C++ or Java, Python is discouraged because of performance issues) and a basic understanding how to write efficient code.<<BR>><<BR>>


=== Task 1: Dehyphenation of hyphenated words ===

'''Background info''': In PDF files, in case of line breaks, words are often hyphenated. For illustration, in the snippet:

''"In this paper, we show how to construct a high-quality bench-<<BR>>
mark of principally arbitrary size from parallel TeX and PDF data."'' 

the word "benchmark" is hyphenated. Usually, PDF files does not provide any relation information, it simply does not know that the syllables "bench-" and "mark" belong together. 
On extracting them, we want to "dehyphenate" them, i.e. to remove the hyphen and to merge the syllables. 
For example, we don't want to extract the word "bench- mark" or "bench-mark", but "benchmark".  <<BR>>
The dehyphenation should be done with respect to so called ''compound words'' containing mandatory hyphens, like ''"high-quality"'' in the snippet above. In case of a compound word is hyphenated, the syllables should be merged, but without removing the hyphen if the hyphen is mandatory. For example, consider a small modification of the snippet above:

''"In this paper, we show how to construct a high-<<BR>>
quality benchmark of principally arbitrary size from parallel TeX and PDF data."''

This time, the word "high-quality" is hyphenated. 
We want to dehyphenate this word without removing the hyphen, i.e. we do not want to extract "high- quality" or "highquality", but "high-quality".

'''Goal''': Write a function that, given a hyphenated word, dehyphenates the word -- with respect to compound words as explained above. 
One approach to solve this problem is to have a *huge* dictionary of words (for example, based on [[http://lemurproject.org/clueweb12/|ClueWeb12]]) in order to lookup a hyphenated word and to decide if the word is a compound word or not.  
In general, the dictionary won't fit into memory completely, because its size will be > 50GB. Hence, the challenge here is to find a trade-off between (1) memory consumption and (2) response times.
 
'''Step 1:''' Build a reasonable dictionary as explained above, for example from ClueWeb12, that (1) consumes as less memory as possible (the dictionary will be of size > 50GB) and (2) provides fast response times (<< 1 ms per request).

'''Step 2:''' Evaluate the accuracy of your results and the performance of your code.<<BR>><<BR>>


=== Task 2: Guessing Ligatures ===