Diff for "Projects/Broccoli" - AD Research Wiki

Differences between revisions 42 and 50 (spanning 8 versions)

TODO: write some text that explains the contents parts of this section, so that one knows where to jump / read for a particular use case.

Contents

Current Broccoli version
1. Start the service on elba
Code
1. Compilation
Building an Index
Set up a Broccoli instance for different data
Image Cache
Image Service
Mediator Only Index (CIKM)

Current Broccoli version

Built by Björn beginning of August 2016, Wikipedia version from August 2016 (2.8B postings), latest Freebase dump (freebase-rdf-latest, 372M statements extracted a la Freebase Easy).

Start the service on elba

cd /local/data2/broccoli/16-09-26
./start.sh

The script contains all files as variables. Changing something (e.g. other custom scores or triple scores should be self explanatory).

For reference (btw: I have no idea how to write proper shellscrips, don't judge )

 #!/bin/bash

dir='/local/data2/broccoli/16-09-26/'
binary='ServerMain.2017-02-13'

index='semantic-wikipedia-full-jul16'
kbsuffix='-ontology'
kb=$index$kbsuffix
port='6002'

stopwords='semantic-wikipedia.stop-words'
triplescores='triple-scores.14oct'
mapping='semantic-wikipedia-full-jul16-ontology.url-mapping'
customscores='custom-scores.txt'

date=`date +%Y-%m-%d`
log="server-log.$date"

call=$dir$binary
args=(-p $port -o $dir$kb -m $dir$mapping -t $dir$triplescores -s $dir$stopwords -c $dir$customscores $dir$index)

echo "Starting broccoli instance and writing output into $dir$log"
echo "Starting now will be available within the minute."
$call "${args[@]}" > $dir$log &

Code

Code is in https://ad-websvn.informatik.uni-freiburg.de/broccoli/. The Code for CSD is in the subfolder: https://ad-websvn.informatik.uni-freiburg.de/broccoli/nlp/

Compilation

make all -j

Ignore possible lint problems.

Building an Index

First, obtain a Wikipedia XML and a Freebase dump (usually stored in /nfs/raid5/broccoli/...) and make sure the correct files are referenced in broccoli/Makefile and broccoli/freebase/Makefile. Give a proper name to your index using the variable DBTAIL in broccoli/Makefile

Create ontology.txt

If you use an existing RDF3X DB (as usually the case), make sure it is referenced correctly in broccoli/freebase/Makefile and only run (inside the broccoli folder):

    make get-freebase-ontology

Otherwise run:

    make -C freebase/ build-db
    make get-freebase-ontology

Create cas0.zip

This is the first of two parts of our UIMA Chain. For an explanation see Broccoli Uima. In this first step, the Wikipedia XML is parsed (from XML to a UIMA model), text is tokenized and parse trees are constructed. Since running an off-the-shelf parser is computationally expensive, this part uses the asynchronous scale-out and is run on many of our machines.

Make sure all paths are set correctly in paths.mak

on FILICUDI!! (if you want to start the broker elsewhere, the Makefile has to be adjusted), run in the broccoli/ folder:

    make deploy-broker

Then, on the server you want to have the reader and writer (the "main" part), run

    make deploy-reader

To get things going run the following on as many PC's (and servers) as possible

    make deploy-senna

Create a broccoli index

Make sure to copy/move or reference the cas0.zip you built in broccoli/Makefile then run

    make build-txt build-index

The first target calls the second part of our UIMA chain. It performs entity recognition and our NLP (CSD) and then writes a words- and a docs-file. The second target builds all necessary binary indices (and files like vocabularies) from that words- and docs-file.

Start the server

    make start PORT=<PORT>

Set up a Broccoli instance for different data

With parts of our chain

Understand the process above and the UIMA framework (see official documentation) and modify accordingly.

From scratch

There is a special folder broccoli/example-data now that has files that can be used as a blue-print for your own data.

Produce the following files:

1. A Knowledge Base file, name: $(DBTAIL)-ontology.txt

Tab-separated, one line per triple: subject<TAB>predicate<TAB>object<TAB>. For example broccoli/example-data/example-ontology.txt:

Björn Buchhold  is-a    PhD Student     .
Björn Buchhold  is-a    Person  .
Björn Buchhold  Country of nationality  Germany .

Values use XML Schema notation, e.g.:

Some Song    Length    "19.0"^^<http://www.w3.org/2001/XMLSchema#float>    .

IMPORTANT: The name has to end with -ontology.txt and there has to be at least one triple with a relation is-a and one triple with some other relation (for historical, technical reasons).

2. A file with text postings, named: $(DBTAIL).words-by-contexts.txt

Tab-separated, one line per posting. word<TAB>contextId<TAB>score<TAB>position, entities use underscores for spaces and are prefixed with :e:, e.g.:

this                    1       1       0
is                      1       1       1
just                    1       1       2
an                      1       1       3
example                 1       1       4
I                       2       5       0
:e:Björn_Buchhold       2       5       0
hope                    2       1       1
it                      2       1       2
helps                   2       1       3

3. A file with with info to display for hits, named: $(DBTAIL).docs-by-contexts.txt

Tab-separated, one line per context. contextId<TAB>URL<TAB>Title<TAB>Text to display (positions from wordsfile are separated by @@)<TAB>Which positions belong to the context (used for grey vs black highlighting in the UI)<TAB>context-range (in terms of position for grey/black highlighting in the UI)<TAB>parse-tree<TAB>all-context-boundaries-in-sentence.

The last three columns can be filled with dummy/empty information (contexts 0-maxpos, empty parse tree). The number of tabs must remain the same, though. Maybe the UI needs to be adjusted with an empty parse tree.

See the example.docs-by-contexts.txt for somethind that is rather easy to understand (context 1 has positions 0-5, context 2 has positions 0-3):

1       Example_Document        http://example.com      This@@ is@@ just@@ an@@ example@@.      0-5     NoParseTree     0-5
1       Example_Document        http://example.com      I@@ hope@@ it@@ helps@@.        0-3     NoParseTree     0-3

For real-world data with parse-tree information, this can get quite complex (see first sentence in the normal Broccoli data):

1       http://en.wikipedia.org/wiki/Alain_Connes       Alain Connes    Alain Connes@@ (@@;@@ born@@ 1@@ April@@ 1947@@)@@ is@@ a@@ French@@ mathematician@@,@@ currently@@ Professor@@ at@@ the@@ Collège de France@@,@@ IHÉS@@,@@ The Ohio State University@@ and@@ Vanderbilt University@@.  0-1,7-11,13-17  0_Alain_NNP_*_(S1,(S,(NP_(ENUM,(C,(CH 0_Connes_NNP_*_)_) 1_(_*_*_*_* 2_;_:_*_(S1,(NP_(C* 3_born_VBN_*_(NP_* 4_1_CD_*_*_* 5_April_NNP_*_*_* 6_1947_NN_*_),),)_) 7_)_*_*_*_* 8_is_VBZ_*_(VP_* 9_a_DT_*_(NP,(NP_* 10_French_JJ_*_*_* 11_mathematician_NN_*_)_* 12_,_,_*_*_* 13_currently_RB_*_(ADVP,)_* 14_Professor_NNP_*_(NP,(NP,)_* 15_at_IN_*_(PP_* 16_the_DT_*_(NP,(NP_(ENUM,(C 17_Collège_NNP_*_*_* 17_de_IN_*_*_* 17_France_NNP_*_)_) 18_,_,_*_*_* 19_IHÉS_NNP_*_(NP,)_(C,) 20_,_,_*_*_* 21_The_DT_*_(NP_(C 21_Ohio_NNP_*_*_* 21_State_NNP_*_*_* 21_University_NNP_*_)_) 22_and_CC_*_*_* 23_Vanderbilt_NNP_*_(NP_(C 23_University_NNP_*_),),),),),)_),) 24_._._*_),)_),)    0-1,7-11,13-17;0-1,7-11,13-15,19-19;0-1,7-11,13-15,21-21;0-1,7-11,13-15,23-23;0-0,3-6

4. Create the following empty files

Fill them with actual data if you have it available. It works with empty files (see example-data)

touch $(DBTAIL)-ontology.entity-scores.noabs
touch $(DBTAIL)-ontology.name-mapping
touch $(DBTAIL)-ontology.reverse-relations

Build an index

Call make build-index and set the variables DATA_DIRECTORY and DBTAIL:

make build-index DATA_DIRECTORY=/home/buchholb/broccoli/example-data DBTAIL=example

Start a server instance

Call make start and set the variables PORT, DATA_DIRECTORY and DBTAIL:

make start PORT=6001 DATA_DIRECTORY=/home/buchholb/broccoli/example-data DBTAIL=example

Deploy a user interface (to a tomcat webserver)

Go to

cd broccoli/userinterface

edit the files with your data (adjust: instance name, Tomcat location, maybe username+password, server, port)

vim build.properties
vim war/server_properties.xml

compile and deploy

ant build war deploy

Testing if everything works

If you used the files in example-data, a query for PhD-Student occurs-with helps should yield a result.

Image Cache

The image cache service of the current instance (on elba) runs on filicudi. It is located under /var/www/freebase-imgsvc. It is simply a checkout of http://ad-svn.informatik.uni-freiburg.de/broccoli/freebase-imgsvc/.

The actually directory where the images are cached is located at /nfs/raid5/broccoli/freebase-thumb-cache.

NEW 22-03-2017: Robin Krahl has written a new version of the script wpthumbsvc.php that asks the Mediawiki API of the English Wikipedia. The script greps the file /nfs/raid5/broccoli/freebase-thumb-cache.mid-to-wikipedia.unique-mids, which contains the last mid from the file freebase-thumb-cache.mid-to-wikipedia and the corresponding Wikipedia name (this is usally the canonical Wikipedia name of the entity).

To add individual images (for demos, needs access to raid so it can write to cache folder, and img has to have a file extension for convert to work, tested on filicudi, does not work on stromboli because code requires Python version >= 3.3):

python3 ~/broccoli/img-hack/image_to_cache.py --mid <MID> --img 'http://...'

Image Service

Florians code has a mechanism for removing outdated images, which also removes images in the cache which now return a 404 not found (which effectively removes all images from the cache after the shutdown of the Freebase API). This should be corrected. Here is the guilty piece of code from https://ad-websvn.informatik.uni-freiburg.de/broccoli/freebase-imgsvc/fbthumbsvc.php:

// If no image could be found (404 error) create a 404 cache file for the
// current id, return a 404 error and end the script.
if ($return_status_code == 404)
{
  // If there still was an expired cache file then remove it now!
  if ($cachefile_exists)
  {
      unlink($cachefile_path);
  }

  // Create a 404 cache file for the current id.
  touch($cachefile_path . '_404');
 
  returnMissingError();
}

Mediator Only Index (CIKM)

An index that contains mediators (used for the CIKM presentation) is available in /nfs/raid5/haussmae/demos/broccoli_mediators_no_text

to start (on filicudi, port 7099, should work as any user that can read the files):

/home/haussmae/demos/broccoli_mediators_no_text/ServerMain -p 7099 -o /home/haussmae/demos/broccoli_mediators_no_text/semantic-wikipedia-scientists-ontology -s /home/haussmae/demos/broccoli_mediators_no_text/semantic-wikipedia.stop-words /home/haussmae/demos/broccoli_mediators_no_text/semantic-wikipedia-scientists -m /home/haussmae/demos/broccoli_mediators_no_text/semantic-wikipedia-scientists-ontology.url-mapping

The user interface for backend filicudi:7099 is available at http://filicudi.informatik.uni-freiburg.de:6222/BroccoliCIKM (no UI hack) and http://filicudi.informatik.uni-freiburg.de:6222/BroccoliCIKM2 (UI hack). The UI hack makes specific mediator names readable in the query graph (and only there). The hack adjusts the nameLabel variable in the File src/de/uni/freiburg/broccoli/client/ui/BreadcrumbLabel.java of userinterface (in the broccoli respository).

-  ⇤ ← Revision 42 as of 2017-03-22 17:32:07 → 
  Size: 10350
  Editor: Hannah Bast
  Comment:
+   ← Revision 50 as of 2017-04-05 12:03:46 → ⇥
  Size: 11889
  Editor: Björn Buchhold
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-= Broccoli =

<<TableOfContents(1)>>



== Code ==

Code is in [[https://ad-websvn.informatik.uni-freiburg.de/broccoli/]].
The Code for CSD is in the subfolder: [[https://ad-websvn.informatik.uni-freiburg.de/broccoli/nlp/]]

=== Compilation ===

{{{
make all -j
}}}

Ignore possible lint problems.
+TODO: write some text that explains the contents parts of this section, so that one knows where to jump / read for a particular use case.

<<TableOfContents(3)>>
-Line 22:
+Line 8:
-Built by Björn beginning of August 2016 (TODO: copy to elba), Wikipedia version from August 2016 (2.8B postings), latest Freebase dump (freebase-rdf-latest, 372M statements extracted a la Freebase Easy).
+Built by Björn beginning of August 2016, Wikipedia version from August 2016 (2.8B postings), latest Freebase dump (freebase-rdf-latest, 372M statements extracted a la Freebase Easy).
-Line 65:
+Line 51:
-== UIMA Chain ==

For an explanation see [[Broccoli Uima]]
+== Code ==

Code is in [[https://ad-websvn.informatik.uni-freiburg.de/broccoli/]].
The Code for CSD is in the subfolder: [[https://ad-websvn.informatik.uni-freiburg.de/broccoli/nlp/]]

=== Compilation ===

{{{
make all -j
}}}

Ignore possible lint problems.
-Line 71:
+Line 67:
-First, obtain a Wikipedia XML and a Freebase dump (usually stored in /nfs/raid5/broccoli/...) and make sure these correct files are referenced in broccoli/Makefile and broccoli/freebase/Makefile.
+First, obtain a Wikipedia XML and a Freebase dump (usually stored in /nfs/raid5/broccoli/...) and make sure the correct files are referenced in broccoli/Makefile and broccoli/freebase/Makefile.
-Line 89:
+Line 85:
+This is the first of two parts of our UIMA Chain.
For an explanation see [[Broccoli Uima]].
In this first step, the Wikipedia XML is parsed (from XML to a UIMA model), text is tokenized and parse trees are constructed.
Since running an off-the-shelf parser is computationally expensive, this part uses the asynchronous scale-out and is run on many of our machines.
-Line 91:
+Line 93:
-on FILICUDI!! run (in the broccoli fodler)
+on FILICUDI!! (if you want to start the broker elsewhere, the Makefile has to be adjusted),
run in the broccoli/ folder:
-Line 97:
+Line 100:
-Then, on the server you want to have the reader and writer, run
+Then, on the server you want to have the reader and writer (the "main" part), run
-Line 115:
+Line 118:
+The first target calls the second part of our UIMA chain. It performs entity recognition and our NLP (CSD) and then writes a words- and a docs-file.
The second target builds all necessary binary indices (and files like vocabularies) from that words- and docs-file.
-Line 238:
+Line 243:
+The image cache service of the current instance (on elba) runs on ''filicudi''. It is located under ''/var/www/freebase-imgsvc''. It is simply a checkout of http://ad-svn.informatik.uni-freiburg.de/broccoli/freebase-imgsvc/.

The actually directory where the images are cached is located at ''/nfs/raid5/broccoli/freebase-thumb-cache''.

NEW 22-03-2017: [[https://docs.google.com/document/d/1yGYNuEYb1jvWK-qrRp2io1kb6LE6urNwABuc3oplGFE|Robin Krahl]] has written a new version of the script ''wpthumbsvc.php'' that asks the [[https://en.wikipedia.org/w/api.php|Mediawiki API of the English Wikipedia]]. The script greps the file ''/nfs/raid5/broccoli/freebase-thumb-cache.mid-to-wikipedia.unique-mids'', which contains the last mid from the file ''freebase-thumb-cache.mid-to-wikipedia'' and the corresponding Wikipedia name (this is usally the canonical Wikipedia name of the entity).
-Line 247:
+Line 261:
-!HiWi project Kai Haase: see [[https://docs.google.com/document/d/1xVXSGWG9kB92LAfsbdq7-0p2fHbJFYbv1fKPqqq_eGQ|Google Doc]].

TODO: Florians code has a mechanism for removing outdated images, which also removes images in the cache which now return a 404 not found (which effectively removes all images from the cache after the shutdown of the Freebase API). This should be corrected. Here is the guilty piece of code from https://ad-websvn.informatik.uni-freiburg.de/broccoli/freebase-imgsvc/fbthumbsvc.php:
+Florians code has a mechanism for removing outdated images, which also removes images in the cache which now return a 404 not found (which effectively removes all images from the cache after the shutdown of the Freebase API). This should be corrected. Here is the guilty piece of code from https://ad-websvn.informatik.uni-freiburg.de/broccoli/freebase-imgsvc/fbthumbsvc.php: