2403
Comment:
|
← Revision 5 as of 2020-05-05 08:25:51 ⇥
3009
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
<<TableOfContents(2)>> |
|
Line 4: | Line 6: |
<<TableOfContents(1)>> |
|
Line 22: | Line 22: |
== Step 2: Divide the input data into chunks == | == Step 2: Divide the input data into chunks == |
Line 32: | Line 32: |
== Step 3: Load the data into the Blazegraph instance == | == Step 3: Load the data into the Blazegraph instance == |
Line 42: | Line 42: |
== Step 4: Ask queries == | == Step 4: Ask queries == |
Line 49: | Line 49: |
== Step 5: Rebuilding the index == To rebuild an index, simply kill the ``runBlazegraph'' process (Step 1) and remove the file ''wikidata.jnl'' (it contains the index). The latter is the only file that is created/changed during the index building process (I checked). Note that it is no problem for the integrity of the index to kill the ''runBlazegraph'' process and then start it again, as long as the file ''wikidata.jnl'' is left intact. However, if the ''runBlazegraph'' is killed while the ''loadData.sh'' is running, the batch that was being processed at the time of the killing will be lost. |
Contents
Building a Blazegraph index for Freebase (Log Hannah 30.04.2020)
I followed the instructions on this user manual on MediaWiki and some tips form Johannes (who has done this before).
Step 1: Download and run the service
After this is done, the service is up and is available under http://panarea.informatik.privat:9998/bigdata/
wget https://search.maven.org/search?q=g:org.wikidata.query.rdf%20AND%20a:service unzip wikidata-query-service-0.3.2-dist.zip cd service-0.3.2 vim runBlazegraph.sh --> edit HOST (localhost -> panarea.informatik.privat) --> edit PORT (9999 -> 9998) --> remove the two -Xlog options in the GC_LOGS definition (after the colon) in the if part ./runBlazegraph.sh > blazegraph-freebase.build-log.txt &
Step 2: Divide the input data into chunks
I simple did this with split. I also tried the munge.sh script afterwards, but that didn't work for me (it created the first file in the series, but the size remained 0 forever and it didn't move on, I have no idea why).
cat /nfs/datasets/clueweb_freebase_qlever/freebase-rdf-latest-english-sortable-measurements.nt \ | split -a 3 --numeric-suffixes=1 --additional-suffix=.ttl -l 10000000 --filter='gzip > $FILE.gz' - freebase- watch -n 1 "ls -lht freebase-*"
Step 3: Load the data into the Blazegraph instance
vim loadData.sh --> edit HOST (http://localhost:9999 -> http://panarea.informatik.privat:9998) --> edit FORMAT (wikidump-%09d.ttl.gz -> freebase-%03d.ttl.gz) ./loadData.sh -n wdq # the wdq namespace seems to be hard-coded, didn't bother to find out how to change it
Step 4: Ask queries
Already after Step 1, the service is up and running under http://panarea.informatik.privat:9998/bigdata/
Here is an example how to get a result via a GET or POST request. Note the -H option, in order to obtain the result as JSON (the default is XML). Also note the --data-urlencode, which allows to write the query as a clear string. The URL parameter maxQueryTimeMillis allows to specify a timeout (1 second for the example)
curl -H "Accept: application/json" http://panarea.informatik.privat:9998/bigdata/sparql --data-urlencode 'query=SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o }' --data maxQueryTimeMillis=1000
Step 5: Rebuilding the index
To rebuild an index, simply kill the runBlazegraph process (Step 1) and remove the file wikidata.jnl (it contains the index). The latter is the only file that is created/changed during the index building process (I checked). Note that it is no problem for the integrity of the index to kill the