Freebase RDF Dump

The last dump of Freebase that was available:

/nfs/raid2/haussmae/freebase/latest

ClueWeb 12 Full

/nfs/raid1/clueweb12

Elmar extracted sentences containing entities from FACC1:

/nfs/raid1/haussmae/clueweb_extracted_sentences

Elmar also parsed the sentences using SpaCy and identified binary and mediated relations in Freebase between FACC1 entities

/nfs/raid5/haussmae/keyword-translation-data/clueweb_sentences

Elmar also filtered the above to only questions:

/nfs/raid5/haussmae/qa-completion-data/questions

/nfs/raid1/clueweb09_b

Google's entity annotations on ClueWeb 09 and 12

/nfs/raid2/haussmae/freebase_annotations

/nfs/raid2/haussmae/datasets

30M-QA (2016): 30M question-answer pairs for Freebase, described in http://arxiv.org/abs/1603.06807
Crosswikis (2012): anchor text statistics from Wikipedia and the Web useful for entity disambiguation
Reddit (2007-2015): 1.6B comments from reddit
Ontonotes 5.0 (2013): LDC corpus with POS, parse, NER annotations etc.
Yahoo! Answers (2007): 4.4M questions and answers
UkWac (2008): 2 billion word corpus constructed from the Web (.uk domain), POS-tagged and lemmatized
AOL queries (2006): ~20M web queries collected from ~650k users over three months.
Relation extraction (2013): 10K place of birth and 40K attending institution judgments for relations in Wikipedia
GOV topic distill (2003): data from the topic distillation TREC task 2003
Ohsumed (1987-1991): ~300K of titles and/or abstracts from 270 medical journals
Valex (2006): subcategorization frame (SCF) and frequency information for 6,397 English verbs