AD Teaching Wiki
  • Comments
  • Immutable Page
  • Menu
    • Navigation
    • RecentChanges
    • FindPage
    • Local Site Map
    • Help
    • HelpContents
    • HelpOnMoinWikiSyntax
    • Display
    • Attachments
    • Info
    • Raw Text
    • Print View
    • Edit
    • Load
    • Save
  • Login

FrontPage

Upload page content

You can upload content for the page named below. If you change the page name, you can also upload content for another page. If the page name is empty, we derive the page name from the file name.

File to load page content from
Page name
Comment

Revision 3 as of 2019-05-05 22:23:42
AD Teaching Wiki:
  • BachelorAndMasterProjectsAndTheses
  • MergingHyphenatedWords

Accurate Word Extraction from Documents with Complex Layouts

Type: An interesting and practical project. A basic understanding of Machine Learning is required; knowledge of Deep Learning is desirable. The preferred programming language is Python.

Background info: Words in text documents can be hyphenated (especially frequent in layouts with multiple columns), in which case they appear broken in two parts at different positions in the text and need to be merged to single words on extracting text from the documents (e.g., for indexing and searching the documents). The hyphen between two parts of a word can be mandatory even after merging when the word is a compound of two words and the hyphen is used to combine them. Here is an example:

TODO: Example.

Note the parts highlighted in blue. The word high-quality is a compound word and thus, the hyphen between high and quality needs to be retained after merging the parts. In contrast, the word benchmark is not a compound word and thus, the hyphen needs to be removed.

Goal: Merging hyphenated words by using machine learning techniques; taking into account that a word can be a compound word, in which case the hyphen between the two parts of the word needs to be retained on merging the parts.

Challenge 1: TODO

Challenge 2: TODO

Subgoal 1: TODO

Subgoal 2: TODO

Supervision by Claudius Korzen.

  • MoinMoin Powered
  • Python Powered
  • GPL licensed
  • Valid HTML 4.01