= Broccoli = == Code == Code is in [[https://ad-websvn.informatik.uni-freiburg.de/broccoli/]]. The Code for CSD is in the subfolder: [[https://ad-websvn.informatik.uni-freiburg.de/broccoli/nlp/]] === Compilation === {{{ make all -j }}} Ignore possible lint problems. == Current Broccoli version == Built by Björn beginning of August 2016 (TODO: copy to elba), Wikipedia version from August 2016 (2.8B postings), latest Freebase dump (freebase-rdf-latest, 372M statements extracted a la Freebase Easy). == UIMA Chain == For an explanation see [[Broccoli Uima]] == Building an Index == First, obtain a Wikipedia XML and a Freebase dump (usually stored in /nfs/raid5/broccoli/...) and make sure these correct files are referenced in broccoli/Makefile and broccoli/freebase/Makefile. Give a proper name to your index using the variable DBTAIL in broccoli/Makefile === Create ontology.txt === If you use an existing RDF3X DB (as usually the case), make sure it is referenced correctly in broccoli/freebase/Makefile and only run (inside the broccoli folder): {{{ make get-freebase-ontology }}} Otherwise run: {{{ make -C freebase/ build-db make get-freebase-ontology }}} === Create cas0.zip === Make sure all paths are set correctly in paths.mak on FILICUDI!! run (in the broccoli fodler) {{{ make deploy-broker }}} Then, on the server you want to have the reader and writer, run {{{ make deploy-reader }}} To get things going run the following on as many PC's (and servers) as possible {{{ make deploy-senna }}} === Create a broccoli index === Make sure to copy/move or reference the cas0.zip you built in broccoli/Makefile then run {{{ make build-txt build-index }}} === Start the server === {{{ make start PORT= }}} == Set up a Broccoli instance for different data == === With parts of our chain === Understand the process above and the UIMA framework (see official documentation) and modify accordingly. === From scratch === There is a special folder broccoli/example-data now that has files that can be used as a blue-print for your own data. ==== Produce the following files: ==== ''' 1. A Knowledge Base file, name: $(DBTAIL)-ontology.txt ''' Tab-separated, one line per triple: subjectpredicateobject. For example broccoli/example-data/example-ontology.txt: {{{ Björn Buchhold is-a PhD Student . Björn Buchhold is-a Person . Björn Buchhold Country of nationality Germany . }}} Values use XML Schema notation, e.g.: {{{ Some Song Length "19.0"^^ . }}} IMPORTANT: The name has to end with -ontology.txt and there has to be at least one triple with a relation is-a and one triple with some other relation (for historical, technical reasons). ''' 2. A file with text postings, named: $(DBTAIL).words-by-contexts.txt ''' Tab-separated, one line per posting. wordcontextIdscoreposition, entities use underscores for spaces and are prefixed with :e:, e.g.: {{{ this 1 1 0 is 1 1 1 just 1 1 2 an 1 1 3 example 1 1 4 I 2 5 0 :e:Björn_Buchhold 2 5 0 hope 2 1 1 it 2 1 2 helps 2 1 3 }}} ''' 3. A file with with info to display for hits, named: $(DBTAIL).docs-by-contexts.txt ''' Tab-separated, one line per context. contextIdURLTitleText to display (positions from wordsfile are separated by @@)Which positions belong to the context (used for grey vs black highlighting in the UI)parse-treeall-context-boundaries-in-sentence. The last two columns can be filled with dummy/empty information (contexts 0-maxpos, empty parse tree). The number of tabs must remain the same, though. Maybe the UI needs to be adjusted with an empty parse tree. See the example.docs-by-contexts.txt for somethind that is rather easy to understand (context 1 has positions 0-5, context 2 has positions 0-3): {{{ 1 Example_Document http://example.com This@@ is@@ just@@ an@@ example@@. 0-5 NoParseTree 0-5 1 Example_Document http://example.com I@@ hope@@ it@@ helps@@. 0-3 NoParseTree 0-3 }}} For real-world data with parse-tree information, this can get quite complex (see first sentence in the normal Broccoli data): {{{ 1 http://en.wikipedia.org/wiki/Alain_Connes Alain Connes Alain Connes@@ (@@;@@ born@@ 1@@ April@@ 1947@@)@@ is@@ a@@ French@@ mathematician@@,@@ currently@@ Professor@@ at@@ the@@ Collège de France@@,@@ IHÉS@@,@@ The Ohio State University@@ and@@ Vanderbilt University@@. 0-1,7-11,13-17 0_Alain_NNP_*_(S1,(S,(NP_(ENUM,(C,(CH 0_Connes_NNP_*_)_) 1_(_*_*_*_* 2_;_:_*_(S1,(NP_(C* 3_born_VBN_*_(NP_* 4_1_CD_*_*_* 5_April_NNP_*_*_* 6_1947_NN_*_),),)_) 7_)_*_*_*_* 8_is_VBZ_*_(VP_* 9_a_DT_*_(NP,(NP_* 10_French_JJ_*_*_* 11_mathematician_NN_*_)_* 12_,_,_*_*_* 13_currently_RB_*_(ADVP,)_* 14_Professor_NNP_*_(NP,(NP,)_* 15_at_IN_*_(PP_* 16_the_DT_*_(NP,(NP_(ENUM,(C 17_Collège_NNP_*_*_* 17_de_IN_*_*_* 17_France_NNP_*_)_) 18_,_,_*_*_* 19_IHÉS_NNP_*_(NP,)_(C,) 20_,_,_*_*_* 21_The_DT_*_(NP_(C 21_Ohio_NNP_*_*_* 21_State_NNP_*_*_* 21_University_NNP_*_)_) 22_and_CC_*_*_* 23_Vanderbilt_NNP_*_(NP_(C 23_University_NNP_*_),),),),),)_),) 24_._._*_),)_),) 0-1,7-11,13-17;0-1,7-11,13-15,19-19;0-1,7-11,13-15,21-21;0-1,7-11,13-15,23-23;0-0,3-6 }}} ''' 4. Create the following empty files ''' Fill them with actual data if you have it available. It works with empty files (see example-data) {{{ touch $(DBTAIL)-ontology.entity-scores.noabs touch $(DBTAIL)-ontology.name-mapping touch $(DBTAIL)-ontology.reverse-relations }}} === Build an index === Call make build-index and set the variables DATA_DIRECTORY and DBTAIL: {{{ make build-index DATA_DIRECTORY=/home/buchholb/broccoli/example-data DBTAIL=example }}} === Start a server instance === Call make start and set the variables PORT, DATA_DIRECTORY and DBTAIL: {{{ make start PORT=6001 DATA_DIRECTORY=/home/buchholb/broccoli/example-data DBTAIL=example }}} === Deploy a user interface (to a tomcat webserver) === Go to {{{ cd broccoli/userintance }}} edit the files with your data (adjust: instance name, Tomcat location, maybe username+password, server, port) {{{ vim build.properties vim war/server_properties.xml }}} compile and deploy {{{ ant build war deploy }}} === Testing if everything works === If you used the files in example-data, a query for PhD-Student occurs-with helps should yield a result. == Image Cache == To add individual images (for demos, needs access to raid so it can write to cache folder, and img has to have a file extension for convert to work, tested on '''filicudi''', does not work on stromboli because code requires Python version >= 3.3): {{{ python3 ~/broccoli/img-hack/image_to_cache.py --mid --img 'http://...' }}} == Image Service == !HiWi project Kai Haase: see [[https://docs.google.com/document/d/1xVXSGWG9kB92LAfsbdq7-0p2fHbJFYbv1fKPqqq_eGQ|Google Doc]]. TODO: Florians code has a mechanism for removing outdated images, which also removes images in the cache which now return a 404 not found (which effectively removes all images from the cache after the shutdown of the Freebase API). This should be corrected. Here is the guilty piece of code from https://ad-websvn.informatik.uni-freiburg.de/broccoli/freebase-imgsvc/fbthumbsvc.php: {{{ // If no image could be found (404 error) create a 404 cache file for the // current id, return a 404 error and end the script. if ($return_status_code == 404) { // If there still was an expired cache file then remove it now! if ($cachefile_exists) { unlink($cachefile_path); } // Create a 404 cache file for the current id. touch($cachefile_path . '_404'); returnMissingError(); } }}} == Mediator Only Index (CIKM) == An index that contains mediators (used for the CIKM presentation) is available in /nfs/raid5/haussmae/demos/broccoli_mediators_no_text to start (on filicudi, port 7099, should work as any user that can read the files): {{{ /home/haussmae/demos/broccoli_mediators_no_text/ServerMain -p 7099 -o /home/haussmae/demos/broccoli_mediators_no_text/semantic-wikipedia-scientists-ontology -s /home/haussmae/demos/broccoli_mediators_no_text/semantic-wikipedia.stop-words /home/haussmae/demos/broccoli_mediators_no_text/semantic-wikipedia-scientists -m /home/haussmae/demos/broccoli_mediators_no_text/semantic-wikipedia-scientists-ontology.url-mapping }}} The user interface for backend filicudi:7099 is available at http://filicudi.informatik.uni-freiburg.de:6222/BroccoliCIKM (no UI hack) and http://filicudi.informatik.uni-freiburg.de:6222/BroccoliCIKM2 (UI hack). The UI hack makes specific mediator names readable in the query graph (and only there). The hack adjusts the nameLabel variable in the File src/de/uni/freiburg/broccoli/client/ui/BreadcrumbLabel.java of userinterface (in the broccoli respository).