2377
Comment:
|
2403
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
<<TableOfContents(2)>> |
Contents
Building a Blazegraph index for Freebase (Log Hannah 30.04.2020)
I followed the instructions on this user manual on MediaWiki and some tips form Johannes (who has done this before).
Step 1: Download and run the service
After this is done, the service is up and is available under http://panarea.informatik.privat:9998/bigdata/
wget https://search.maven.org/search?q=g:org.wikidata.query.rdf%20AND%20a:service unzip wikidata-query-service-0.3.2-dist.zip cd service-0.3.2 vim runBlazegraph.sh --> edit HOST (localhost -> panarea.informatik.privat) --> edit PORT (9999 -> 9998) --> remove the two -Xlog options in the GC_LOGS definition (after the colon) in the if part ./runBlazegraph.sh > blazegraph-freebase.build-log.txt &
== Step 2: Divide the input data into chunks ==
I simple did this with split. I also tried the munge.sh script afterwards, but that didn't work for me (it created the first file in the series, but the size remained 0 forever and it didn't move on, I have no idea why).
cat /nfs/datasets/clueweb_freebase_qlever/freebase-rdf-latest-english-sortable-measurements.nt \ | split -a 3 --numeric-suffixes=1 --additional-suffix=.ttl -l 10000000 --filter='gzip > $FILE.gz' - freebase- watch -n 1 "ls -lht freebase-*"
== Step 3: Load the data into the Blazegraph instance ==
vim loadData.sh --> edit HOST (http://localhost:9999 -> http://panarea.informatik.privat:9998) --> edit FORMAT (wikidump-%09d.ttl.gz -> freebase-%03d.ttl.gz) ./loadData.sh -n wdq # the wdq namespace seems to be hard-coded, didn't bother to find out how to change it
== Step 4: Ask queries ==
Already after Step 1, the service is up and running under http://panarea.informatik.privat:9998/bigdata/
Here is an example how to get a result via a GET or POST request. Note the -H option, in order to obtain the result as JSON (the default is XML). Also note the --data-urlencode, which allows to write the query as a clear string. The URL parameter maxQueryTimeMillis allows to specify a timeout (1 second for the example)
curl -H "Accept: application/json" http://panarea.informatik.privat:9998/bigdata/sparql --data-urlencode 'query=SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o }' --data maxQueryTimeMillis=1000