Wikidata
Two versions (prepared for the Information Retrieval WS 17/18 course) can be found unter vulcano:/local/raid/ad/korzen
It's not trivial to understand how Wikidata handles n-ary predicates and how to obtain the names of entities and predicates. See here for various explanations on how to work with Wikidata.
Freebase RDF Dump
The last dump of Freebase that was available as well as scripts for common conversions:
/nfs/raid2/haussmae/freebase/latest
ClueWeb 12 Full
/nfs/raid1/clueweb12
Elmar extracted sentences containing entities from FACC1:
/nfs/raid1/haussmae/clueweb_extracted_sentences
Elmar also parsed the sentences using SpaCy and identified binary and mediated relations in Freebase between FACC1 entities
/nfs/raid5/haussmae/keyword-translation-data/clueweb_sentences
Elmar also filtered the above to only questions:
/nfs/raid5/haussmae/qa-completion-data/questions
ClueWeb 09 B
/nfs/raid1/clueweb09_b
FACC 1
Google's entity annotations on ClueWeb 09 and 12
/nfs/raid2/haussmae/freebase_annotations
Other corpora
/nfs/raid2/haussmae/datasets
30M-QA (2016): 30M question-answer pairs for Freebase, described in http://arxiv.org/abs/1603.06807
- Crosswikis (2012): anchor text statistics from Wikipedia and the Web useful for entity disambiguation
- Reddit (2007-2015): 1.6B comments from reddit
Ontonotes 5.0 (2013): LDC corpus with POS, parse, NER annotations etc.
- Yahoo! Answers (2007): 4.4M questions and answers
UkWac (2008): 2 billion word corpus constructed from the Web (.uk domain), POS-tagged and lemmatized
- AOL queries (2006): ~20M web queries collected from ~650k users over three months.
- Relation extraction (2013): 10K place of birth and 40K attending institution judgments for relations in Wikipedia
- GOV topic distill (2003): data from the topic distillation TREC task 2003
- Ohsumed (1987-1991): ~300K of titles and/or abstracts from 270 medical journals
Valex (2006): subcategorization frame (SCF) and frequency information for 6,397 English verbs
Deutsche Bahn Schedule
The entire timetables for Germany since 2012 are in
/nfs/datasets/bahn
IMPORTANT: we are legally bound to make sure that this data does not leave our chair! This is why this folder can only be read by group bahn.
For each year, there are 2 versions:
/nfs/datasets/bahn/rohdaten-<YEAR> /nfs/datasets/bahn/bahn<YEAR>_gtfs
The former are the "HAFAS Rohdaten" (raw data) downloaded from the Bahn FTP server. For some years, two datasets (an additional one with suffix -nvbw) were published.
The latter are the GTFS feeds generated from the raw data. The feeds are partitioned in the same way the raw data was partitioned (usually per "Verkehrsverbund" or internal Bahn subdivision like Fernverkehr, Regio BW, Regio Bayern, etc.)
The conversion is quite complicated and has to be started by hand, as Deutsche Bahn makes minor undocumented changes to the raw data format each year. The file endings of the raw data format also changes frequently (no ending, .txt, ...), and the encoding is also not very stable. The HAFAS format doesnt has a concept of vehicle types, only "Zuggattungen" (types of train), so the mappings "Zuggattung"->vehicle type also must be adjusted each year (some of them are quite stable, for example 'BUS' or 'STR', others are frequently changing). The conversion to GTFS is done with a script Patrick wrote during his time at geOps, which is documented here.