Building a Blazegraph index for Freebase (Log Hannah 30.04.2020)

I followed the instructions on this user manual on MediaWiki and some tips form Johannes (who has done this before).

Step 1: Download and run the service

After this is done, the service is up and is available under http://panarea.informatik.privat:9998/bigdata/

wget https://search.maven.org/search?q=g:org.wikidata.query.rdf%20AND%20a:service
unzip wikidata-query-service-0.3.2-dist.zip
cd service-0.3.2
vim runBlazegraph.sh
--> edit HOST (localhost -> panarea.informatik.privat)
--> edit PORT (9999 -> 9998)
--> remove the two -Xlog options in the GC_LOGS definition (after the colon) in the if part
./runBlazegraph.sh > blazegraph-freebase.build-log.txt &

Step 2: Divide the input data into chunks

I simple did this with split. I also tried the munge.sh script afterwards, but that didn't work for me (it created the first file in the series, but the size remained 0 forever and it didn't move on, I have no idea why).

cat /nfs/datasets/clueweb_freebase_qlever/freebase-rdf-latest-english-sortable-measurements.nt \
  | split -a 3 --numeric-suffixes=1 --additional-suffix=.ttl -l 10000000 --filter='gzip > $FILE.gz' - freebase-
watch -n 1 "ls -lht freebase-*"

Step 3: Load the data into the Blazegraph instance

vim loadData.sh
--> edit HOST (http://localhost:9999 -> http://panarea.informatik.privat:9998)
--> edit FORMAT (wikidump-%09d.ttl.gz -> freebase-%03d.ttl.gz)
./loadData.sh -n wdq
# the wdq namespace seems to be hard-coded, didn't bother to find out how to change it

Step 4: Ask queries

Already after Step 1, the service is up and running under http://panarea.informatik.privat:9998/bigdata/

Here is an example how to get a result via a GET or POST request. Note the -H option, in order to obtain the result as JSON (the default is XML). Also note the --data-urlencode, which allows to write the query as a clear string. The URL parameter maxQueryTimeMillis allows to specify a timeout (1 second for the example)

curl -H "Accept: application/json" http://panarea.informatik.privat:9998/bigdata/sparql --data-urlencode 'query=SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o }' --data maxQueryTimeMillis=1000

Step 5: Rebuilding the index

To rebuild an index, simply kill the runBlazegraph process (Step 1) and remove the file wikidata.jnl (it contains the index). The latter is the only file that is created/changed during the index building process (I checked).

Note that it is no problem for the integrity of the index to kill the runBlazegraph process and then start it again, as long as the file wikidata.jnl is left intact. However, if the runBlazegraph is killed while the loadData.sh is running, the batch that was being processed at the time of the killing will be lost.

-  ⇤ ← Revision 2 as of 2020-04-30 23:46:07 → 
  Size: 2403
  Editor: Hannah Bast
  Comment:
+   ← Revision 5 as of 2020-05-05 08:25:51 → ⇥
  Size: 3009
  Editor: Hannah Bast
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
+<<TableOfContents(2)>>
-Line 4:
+Line 6:
-<<TableOfContents(1)>>
 Line 22:
 == Step 2: Divide the input data into chunks ==
 Line 32:
 == Step 3: Load the data into the Blazegraph instance ==
 Line 42:
 == Step 4: Ask queries ==
 Line 49:
+== Step 5: Rebuilding the index ==

To rebuild an index, simply kill the ``runBlazegraph'' process (Step 1) and remove the file ''wikidata.jnl'' (it contains the index). The latter is the only file that is created/changed during the index building process (I checked).

Note that it is no problem for the integrity of the index to kill the ''runBlazegraph'' process and then start it again, as long as the file ''wikidata.jnl'' is left intact. However, if the ''runBlazegraph'' is killed while the ''loadData.sh'' is running, the batch that was being processed at the time of the killing will be lost.