Wikidata

Two versions (prepared for the Information Retrieval WS 17/18 course) can be found unter vulcano:/local/raid/ad/korzen

It's not trivial to understand how Wikidata handles n-ary predicates and how to obtain the names of entities and predicates. See here for various explanations on how to work with Wikidata.

Freebase RDF Dump

The last dump of Freebase that was available as well as scripts for common conversions:

/nfs/raid2/haussmae/freebase/latest

/nfs/raid1/clueweb12

Elmar extracted sentences containing entities from FACC1:

/nfs/raid1/haussmae/clueweb_extracted_sentences

Elmar also parsed the sentences using SpaCy and identified binary and mediated relations in Freebase between FACC1 entities

/nfs/raid5/haussmae/keyword-translation-data/clueweb_sentences

Elmar also filtered the above to only questions:

/nfs/raid5/haussmae/qa-completion-data/questions

/nfs/raid1/clueweb09_b

Google's entity annotations on ClueWeb 09 and 12

/nfs/raid2/haussmae/freebase_annotations

/nfs/raid2/haussmae/datasets

30M-QA (2016): 30M question-answer pairs for Freebase, described in http://arxiv.org/abs/1603.06807
Crosswikis (2012): anchor text statistics from Wikipedia and the Web useful for entity disambiguation
Reddit (2007-2015): 1.6B comments from reddit
Ontonotes 5.0 (2013): LDC corpus with POS, parse, NER annotations etc.
Yahoo! Answers (2007): 4.4M questions and answers
UkWac (2008): 2 billion word corpus constructed from the Web (.uk domain), POS-tagged and lemmatized
AOL queries (2006): ~20M web queries collected from ~650k users over three months.
Relation extraction (2013): 10K place of birth and 40K attending institution judgments for relations in Wikipedia
GOV topic distill (2003): data from the topic distillation TREC task 2003
Ohsumed (1987-1991): ~300K of titles and/or abstracts from 270 medical journals
Valex (2006): subcategorization frame (SCF) and frequency information for 6,397 English verbs