Wikidata
Two versions (prepared for the Information Retrieval WS 17/18 course) can be found unter vulcano:/local/raid/ad/korzen
It's not trivial to understand how Wikidata handles n-ary predicates and how to obtain the names of entities and predicates. See here for various explanations on how to work with Wikidata.
Freebase RDF Dump
The last dump of Freebase that was available as well as scripts for common conversions:
/nfs/raid2/haussmae/freebase/latest
ClueWeb 12 Full
/nfs/raid1/clueweb12
Elmar extracted sentences containing entities from FACC1:
/nfs/raid1/haussmae/clueweb_extracted_sentences
Elmar also parsed the sentences using SpaCy and identified binary and mediated relations in Freebase between FACC1 entities
/nfs/raid5/haussmae/keyword-translation-data/clueweb_sentences
Elmar also filtered the above to only questions:
/nfs/raid5/haussmae/qa-completion-data/questions
ClueWeb 09 B
/nfs/raid1/clueweb09_b
FACC 1
Google's entity annotations on ClueWeb 09 and 12
/nfs/raid2/haussmae/freebase_annotations
Other corpora
/nfs/raid2/haussmae/datasets
30M-QA (2016): 30M question-answer pairs for Freebase, described in http://arxiv.org/abs/1603.06807
- Crosswikis (2012): anchor text statistics from Wikipedia and the Web useful for entity disambiguation
- Reddit (2007-2015): 1.6B comments from reddit
Ontonotes 5.0 (2013): LDC corpus with POS, parse, NER annotations etc.
- Yahoo! Answers (2007): 4.4M questions and answers
UkWac (2008): 2 billion word corpus constructed from the Web (.uk domain), POS-tagged and lemmatized
- AOL queries (2006): ~20M web queries collected from ~650k users over three months.
- Relation extraction (2013): 10K place of birth and 40K attending institution judgments for relations in Wikipedia
- GOV topic distill (2003): data from the topic distillation TREC task 2003
- Ohsumed (1987-1991): ~300K of titles and/or abstracts from 270 medical journals
Valex (2006): subcategorization frame (SCF) and frequency information for 6,397 English verbs