Merging Hyphenated Words & Guessing Ligatures

This topic consists of 2 subtasks, which are both related to high-quality text extraction from PDF files, which is one of our research topics. The main challenge on text extraction from PDF is that PDF is a layout-based format: it specifies the positions and fonts of the individual characters, of which the text is composed, but usually does not provide any information about words, paragraphs and sections. In particular, we are interested in the accurate identification of words and paragraphs, which is crucial for search applications.

Type: Master thesis. You should have advanced programming skills (C++ or Java, Python is discouraged because of performance issues) and a basic understanding how to write efficient code.

Task 1: Merging Hyphenated Words

Background info: In PDF files, in case of line breaks, words are often hyphenated. For illustration, in the snippet:

"In this paper, we show how to construct a high-quality bench-
mark of principally arbitrary size from parallel TeX and PDF data."

the word "benchmark" is hyphenated. Usually, PDF files does not provide any relation information, it simply does not know that the syllables "bench-" and "mark" belong together. On extracting them, we want to "dehyphenate" them, i.e. to remove the hyphen and to merge the syllables. For example, we don't want to extract the word "bench- mark" or "bench-mark", but "benchmark".
The dehyphenation should be done with respect to so called compound words containing mandatory hyphens, like "high-quality" in the snippet above. In case of a compound word is hyphenated, the syllables should be merged, but without removing the hyphen if the hyphen is mandatory. For example, consider a small modification of the snippet above:

"In this paper, we show how to construct a high-
quality benchmark of principally arbitrary size from parallel TeX and PDF data."

This time, the word "high-quality" is hyphenated. We want to dehyphenate this word without removing the hyphen, i.e. we do not want to extract "high- quality" or "highquality", but "high-quality".

Goal: Write a function that, given a hyphenated word, dehyphenates the word -- with respect to compound words as explained above. One approach to solve this problem is to have a *huge* dictionary of words (for example, based on ClueWeb12) in order to lookup a hyphenated word and to decide if the word is a compound word or not. In general, the dictionary won't fit into memory completely, because its size will be > 50GB. Hence, the challenge here is to find a trade-off between (1) memory consumption and (2) response times.

Step 1: Build a reasonable dictionary as explained above, for example from ClueWeb12, that (1) consumes as less memory as possible (the dictionary will be of size > 50GB) and (2) provides fast response times (<< 1 ms per request).

Step 2: Evaluate the accuracy of your results and the performance of your code.

Task 2: Guessing Ligatures

This task is based on Task 1.

Background info: In general, a ligature is a typographical letter, where two or more letters are joined to a single letter. A typical example is "fi", which is often a single letter in PDF, but actually represents the two letters "f" and "i". See the wikipedia page for more details about ligatures, in particular this example for an illustration how ligatures look like. On extraction, we don't want to extract ligatures as single letters, but the individual characters. Unfortunately, PDF is a complicated format and knows different fonts and different encodings. In particular, in case of so called Type-3 fonts, ligatures are "drawn" into the PDF. That means, that ligatures aren't actually letters, but -- broadly speaking -- graphics. This makes it impossible, to identify the individual letters as textual content. For an illustration, consider the word "scientific" with ligature "fi". On extracting the word, we only get something like "scienti?c", with "?" as the placeholder for the ligature.

Goal: "Guess" the correct letters for the "?" in a postprocessing step, that is "f" and "i" in the example above. The idea to solve this issue is in a "trial and error" style. In principal, there is only a finite number of possible ligatures. You could try one after the other and check, if the resulting word is valid by looking it up in the dictionary from Task 1. For example, consider again the word "scienti?c" and the possible ligatures "ffi", "fl", "fi", etc. To find the correct letters for the "?", try each ligature one after the other:

try ffi : scientiffic --> not a valid word --> discard
try fl : scientiflc --> not a valid word --> discard
try fi : scientific --> valid word --> accept
...

The challenge here is to handle possible ambiguities: If two different ligatures result in valid words, you have to choose the "best" one (and think about it, how "best" must be defined).

Step 1: Write a function that, given a word with unrecognizable ligature(s), like "scienti?c", guesses the correct letters that are represented by the ligature(s). In particular, your function should also be able to handle words with two (or even more) ligatures, like "?re?y" representing the word "firefly". Further, your function should be able to handle ambiguities as explained above reasonably.

Step 2: Evaluate the accuracy of your results and the performance of your code.

AD Teaching Wiki: BachelorAndMasterProjectsAndTheses/DehyphenationAndGuessingLigatures (last edited 2017-04-04 07:57:41 by adpult)