<<TableOfContents(2)>>

= Building a Blazegraph index for Freebase (Log Hannah 30.04.2020) =

I followed the instructions on [[https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Installing|this user manual on MediaWiki]] and some tips form Johannes (who has done this before).

== Step 1: Download and run the service ==

After this is done, the service is up and is available under http://panarea.informatik.privat:9998/bigdata/

{{{
wget https://search.maven.org/search?q=g:org.wikidata.query.rdf%20AND%20a:service
unzip wikidata-query-service-0.3.2-dist.zip
cd service-0.3.2
vim runBlazegraph.sh
--> edit HOST (localhost -> panarea.informatik.privat)
--> edit PORT (9999 -> 9998)
--> remove the two -Xlog options in the GC_LOGS definition (after the colon) in the if part
./runBlazegraph.sh > blazegraph-freebase.build-log.txt &
}}}

== Step 2: Divide the input data into chunks ==

I simple did this with split. I also tried the munge.sh script afterwards, but that didn't work for me (it created the first file in the series, but the size remained 0 forever and it didn't move on, I have no idea why).

{{{
cat /nfs/datasets/clueweb_freebase_qlever/freebase-rdf-latest-english-sortable-measurements.nt \
  | split -a 3 --numeric-suffixes=1 --additional-suffix=.ttl -l 10000000 --filter='gzip > $FILE.gz' - freebase-
watch -n 1 "ls -lht freebase-*"
}}}

== Step 3: Load the data into the Blazegraph instance ==

{{{
vim loadData.sh
--> edit HOST (http://localhost:9999 -> http://panarea.informatik.privat:9998)
--> edit FORMAT (wikidump-%09d.ttl.gz -> freebase-%03d.ttl.gz)
./loadData.sh -n wdq
# the wdq namespace seems to be hard-coded, didn't bother to find out how to change it
}}}

== Step 4: Ask queries ==

Already after Step 1, the service is up and running under http://panarea.informatik.privat:9998/bigdata/

Here is an example how to get a result via a GET or POST request. Note the -H option, in order to obtain the result as JSON (the default is XML). Also note the --data-urlencode, which allows to write the query as a clear string. The URL parameter maxQueryTimeMillis allows to specify a timeout (1 second for the example)

curl -H "Accept: application/json" http://panarea.informatik.privat:9998/bigdata/sparql --data-urlencode 'query=SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o }' --data maxQueryTimeMillis=1000

== Step 5: Rebuilding the index ==

To rebuild an index, simply kill the ``runBlazegraph'' process (Step 1) and remove the file ''wikidata.jnl'' (it contains the index). The latter is the only file that is created/changed during the index building process (I checked).

Note that it is no problem for the integrity of the index to kill the ''runBlazegraph'' process and then start it again, as long as the file ''wikidata.jnl'' is left intact. However, if the ''runBlazegraph'' is killed while the ''loadData.sh'' is running, the batch that was being processed at the time of the killing will be lost.