BachelorAndMasterProjectsAndTheses/ClueWebContextDecomposition

Goal: Decompose the sentences of a given web-size corpus into their semantic components. Requirements:

1. It should work on ClueWeb12 [1]

2. The decomposition should be based on CSD-IE, developed in our group [2].

3. Our current implmentation is based on a rather slow parser. This should be switched to the much faster spaCy parser [3].

4. The output format should be compatible with !QLever (easy), our own SPARQL+Text search engine [4].

[1] http://lemurproject.org/clueweb12/ We have purchased this dataset and it's available on our file system.