Building a Blazegraph index for Freebase (Log Hannah 30.04.2020)

I followed the instructions on this user manual on MediaWiki and some tips form Johannes (who has done this before).

Step 1: Download and run the service

After this is done, the service is up and is available under http://panarea.informatik.privat:9998/bigdata/

wget https://search.maven.org/search?q=g:org.wikidata.query.rdf%20AND%20a:service
unzip wikidata-query-service-0.3.2-dist.zip
cd service-0.3.2
vim runBlazegraph.sh
--> edit HOST (localhost -> panarea.informatik.privat)
--> edit PORT (9999 -> 9998)
--> remove the two -Xlog options in the GC_LOGS definition (after the colon) in the if part
./runBlazegraph.sh > blazegraph-freebase.build-log.txt &

== Step 2: Divide the input data into chunks ==

I simple did this with split. I also tried the munge.sh script afterwards, but that didn't work for me (it created the first file in the series, but the size remained 0 forever and it didn't move on, I have no idea why).

cat /nfs/datasets/clueweb_freebase_qlever/freebase-rdf-latest-english-sortable-measurements.nt \
  | split -a 3 --numeric-suffixes=1 --additional-suffix=.ttl -l 10000000 --filter='gzip > $FILE.gz' - freebase-
watch -n 1 "ls -lht freebase-*"

== Step 3: Load the data into the Blazegraph instance ==

vim loadData.sh
--> edit HOST (http://localhost:9999 -> http://panarea.informatik.privat:9998)
--> edit FORMAT (wikidump-%09d.ttl.gz -> freebase-%03d.ttl.gz)
./loadData.sh -n wdq
# the wdq namespace seems to be hard-coded, didn't bother to find out how to change it

== Step 4: Ask queries ==

Already after Step 1, the service is up and running under http://panarea.informatik.privat:9998/bigdata/

Here is an example how to get a result via a GET or POST request. Note the -H option, in order to obtain the result as JSON (the default is XML). Also note the --data-urlencode, which allows to write the query as a clear string. The URL parameter maxQueryTimeMillis allows to specify a timeout (1 second for the example)

curl -H "Accept: application/json" http://panarea.informatik.privat:9998/bigdata/sparql --data-urlencode 'query=SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o }' --data maxQueryTimeMillis=1000