Log of joint meeting of Johannes and Hannah to install Virtuoso and build an index for Freebase Easy
Contents
Build an index
Pull the latest docker image, run Virtuoso in the background, and isql in the foreground as follows (on galera).
cd /local/data/virtuoso docker pull openlink/virtuoso-opensource-7 docker run -dt -e DBA_PASSWORD=dba -p 1111:1111 -p 8890:8890 -v $(pwd):/database --name virtuoso openlink/virtuoso-opensource-7 sudo vim virtuoso.ini [comment in high-memory settings for NumberOfBuffers and MaxDirtyBuffers] docker exec -i virtuoso isql 1111 SQL> ld_dir('.', 'fbeasy.clean.ttl', 'https://fbeasy.cs.uni-freiburg.de'); SQL> rdf_loader_run(); SQL> checkpoint;
To load multiple files, just use a wildcard pattern in the second argument of ld_dir, for example: *.ttl.
Note that the checkpoint; instruction at the end is needed to keep the loaded data after a container restart.
The instructions for bulk loading follow http://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader . The docker image is described under https://hub.docker.com/r/openlink/virtuoso-opensource-7 .
Bulk load started on 19-01-2021 at 01:00 CET. At 03:30 CET it was at 150M triples -> estimated total loading time 5 hours.
The Virtuoso SPARQL editor is then available under http://galera:8890/sparql . A good check whether everything works is always the following query, which returns the number of triples. Note that this also works during bulk upload and can be used to track the progress of the upload.
SELECT COUNT(*) WHERE { ?s ?p ?o }
Start the server with an existing index
In the folder with the virtuoso.db, run the following:
docker run -dt -e DBA_PASSWORD=dba -p 1111:1111 -p 8890:8890 -v $(pwd):/database --name virtuoso openlink/virtuoso-opensource-7
The Virtuoso SPARQL editor is then available under http://galera:8890/sparql . It can be asked queries like this:
curl -Gs -H "Accept: application/json" http://galera:8890/sparql --data-urlencode "query=SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o }"
Explain
Here is an example. The compilation&execution plan is similar to what the QLever UI shows when clicking on "Analzye".
docker exec -i virtuoso isql 111 SQL> SET SPARQL_TRANSLATE ON; SQL> PREFIX fbe: <> SELECT ?p (COUNT(?entity) AS ?count) WHERE { ?entity fbe:is-a <Person> . ?entity ?p ?o } GROUP BY ?p ORDER BY DESC(?count) LIMIT 10; SQL> SET SPARQL_TRANSLATE OFF; ... will show the corresponding SQL query ... SQL> SET EXPLAIN ON; SQL> ... copy&paste SQL query ... ... will show the compilation plan ... SQL> SET EXPLAIN OFF; SQL> SET PROFILE ON; SQL> ... copy&paste SQL query ... ... will show the compilatio&execution plan ... SQL> SET PROFILE OFF;
Without docker (first trial, which eventually failed)
Installation
We followed the instructions on http://vos.openlinksw.com/owiki/wiki/VOS/VOSUbuntuNotes . The following command worked on galera (Ubuntu 18.04). I entered a password , as strongly recommended in the instructions (they said, it wouldn't work otherwise). But the message came that something went wrong with storing the password, but the subsequent steps worked anyway.
sudo apt install virtuoso-opensource
After the installation, the web page is immediately live under http://galera:8890
Index Build
On the console, just type the following (after copying the TTL file to the respective location):
isql-vt SQL> DB.DBA.TTLP_MT (file_to_string_output ('/local/data/virtuoso/fbeasy.clean.ttl'), '', 'http://freebase-easy.cs.uni-freiburg.de');
The index build ran fine for while, with a rate between 0.5M - 1M triples / minutes, but then it stalled and galera:8890 became unresponsive and I could also not stop the server from the command line Maybe the reason was simply that I did not set NumberOfBuffers and MaxDirtyBuffers to higher values in virtuoso.ini, but I then found the docker image on docker hub and switched to that, see above.
NOTE 1: The TTL file was produced from fbeasy.ttl as follows. Control characters give the error message Error 37000: [Virtuoso Driver][Virtuoso Server]SP029: TURTLE RDF loader, line 5684442: Invalid characters in angle-bracketed name. URIs that are longer than 1900 bytes give the error message Error 23000: [Virtuoso Driver][Virtuoso Server]SR133: Can not set NULL to not nullable column.
grep -Pv "[\x00-\x08\x0a-\x1f]" fbeasy.ttl | awk 'length($0) <= 1000' > fbeasy.clean.ttl
NOTE 2: We first tried isqlw-vt (the Unicode-enabled variant), but that always returned an obscure error about a hostname that could not be resolved.