Freebase RDF Dump
The last dump of Freebase that was available:
/nfs/raid2/haussmae/freebase/latest
ClueWeb 12 Full
/nfs/raid1/clueweb12
Elmar extracted sentences containing entities from FACC1:
/nfs/raid1/haussmae/clueweb_extracted_sentences
Elmar also parsed the sentences using SpaCy and identified binary and mediated relations in Freebase between FACC1 entities
/nfs/raid5/haussmae/keyword-translation-data/clueweb_sentences
Elmar also filtered the above to only questions:
/nfs/raid5/haussmae/qa-completion-data/questions
ClueWeb 09 B
/nfs/raid1/clueweb09_b
FACC 1
Google's entity annotations on ClueWeb 09 and 12
/nfs/raid2/haussmae/freebase_annotations
Other corpora
/nfs/raid2/haussmae/datasets
30M-QA (2016): 30M question-answer pairs for Freebase, described in http://arxiv.org/abs/1603.06807
- Crosswikis (2012): anchor text statistics from Wikipedia and the Web useful for entity disambiguation
- Reddit (2007-2015): 1.6B comments from reddit
Ontonotes 5.0 (2013): LDC corpus with POS, parse, NER annotations etc.
- Yahoo! Answers (2007): 4.4M questions and answers
UkWac (2008): 2 billion word corpus constructed from the Web (.uk domain), POS-tagged and lemmatized
- AOL queries (2006): ~20M web queries collected from ~650k users over three months.
- Relation extraction (2013): 10K place of birth and 40K attending institution judgments for relations in Wikipedia
- GOV topic distill (2003): data from the topic distillation TREC task 2003
- Ohsumed (1987-1991): ~300K of titles and/or abstracts from 270 medical journals
Valex (2006): subcategorization frame (SCF) and frequency information for 6,397 English verbs