AD Research Wiki:

Wikidata

Two versions (prepared for the Information Retrieval WS 17/18 course) can be found unter vulcano:/local/raid/ad/korzen

It's not trivial to understand how Wikidata handles n-ary predicates and how to obtain the names of entities and predicates. See here for various explanations on how to work with Wikidata.

Freebase RDF Dump

The last dump of Freebase that was available as well as scripts for common conversions:

/nfs/raid2/haussmae/freebase/latest

ClueWeb 12 Full

/nfs/raid1/clueweb12

Elmar extracted sentences containing entities from FACC1:

/nfs/raid1/haussmae/clueweb_extracted_sentences

Elmar also parsed the sentences using SpaCy and identified binary and mediated relations in Freebase between FACC1 entities

/nfs/raid5/haussmae/keyword-translation-data/clueweb_sentences

Elmar also filtered the above to only questions:

/nfs/raid5/haussmae/qa-completion-data/questions

ClueWeb 09 B

/nfs/raid1/clueweb09_b

FACC 1

Google's entity annotations on ClueWeb 09 and 12

/nfs/raid2/haussmae/freebase_annotations

Other corpora

/nfs/raid2/haussmae/datasets

Deutsche Bahn Schedule

The entire timetables for Germany since 2012 are in

/nfs/datasets/bahn

IMPORTANT: we are legally bound to make sure that this data does not leave our chair! This is why this folder can only be read by group bahn.

For each year, there are 2 versions:

/nfs/datasets/bahn/rohdaten-<YEAR>
/nfs/datasets/bahn/bahn<YEAR>_gtfs

The former are the "HAFAS Rohdaten" (raw data) downloaded from the Bahn FTP server. For some years, two datasets (an additional one with suffix -nvbw) were published.

The latter are the GTFS feeds generated from the raw data. The feeds are partitioned in the same way the raw data was partitioned (usually per "Verkehrsverbund" or internal Bahn subdivision like Fernverkehr, Regio BW, Regio Bayern, etc.)

The conversion is quite complicated and has to be started by hand, as Deutsche Bahn makes minor undocumented changes to the raw data format each year. The file endings of the raw data format also changes frequently (no ending, .txt, ...), and the encoding is also not very stable. The HAFAS format doesnt has a concept of vehicle types, only "Zuggattungen" (types of train), so the mappings "Zuggattung"->vehicle type also must be adjusted each year (some of them are quite stable, for example 'BUS' or 'STR', others are frequently changing). The conversion to GTFS is done with a script Patrick wrote during his time at geOps, which is documented here.

AD Research Wiki: Datasets (last edited 2018-12-17 17:13:50 by Patrick Brosi)