Differences between revisions 1 and 16 (spanning 15 versions)
Revision 1 as of 2007-08-11 01:33:18
Size: 1879
Editor: vpn-114
Comment:
Revision 16 as of 2008-12-22 15:32:33
Size: 6710
Editor: mpiat1403
Comment: How to parse the XML input
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
== 1. Completion Server (C++) == Complete``Search has three components
Line 3: Line 3:
=== 1.1 Get it === '''1. Index Building''' Tools for building, from a collection of documents, the index files required for fast query processing

'''2. Completion Server:''' This is a (multi-threaded) C++ program that listens to queries on a specified port, and answers with the help of the previously built index.

'''3. Web Interface:''' [PHP code (apache) + Java``Script (client), TODO: explain]

In the following, it is (briefly) explained how to install and work with each of these components.


[[Anchor(indexbuilding)]]

== 1 Index building ==

=== 1.1 Parsing ===
Line 6: Line 19:
svn checkout file://KM/ir/SVN/autocompletion svn checkout file:///KM/ir/SVN/xmlparser
vim <db>.parse.cpp
g++ -o <db>.parse <db>.parse.cpp xmlparser/XmlParser.o
<db>.parse <db>.xml
Line 9: Line 25:
=== 1.2 Compile it === This should produce a <db>.words file and a <db>.docs file. Actually it doesn't matter how they are produced, as long as they adhere to the format [TODO: explain].


[''Joachim: The XML input format for the parser is described on CompleteSearch/DocumentFormats. Copy the file exampleParser.cpp to <db>.parser.cpp, adjust the makefile and look at the code. The most important method is getItem("elementname") with which you can extract the contents of an <elementname> and produce 1 line in the .words file per word contained in this element. If the XML input contains several consecutive <elemenname>s, you can iterate over these by getItems("elementname").'']

=== 1.2 Sorting ===

{{{
sort -k1,1 -k2,2n -k4,4n <db>.words_unsorted > <db>.words
sort -k1,1n <db>.docs_unsorted > <db>.docs
}}}

In case the <db>.words and <db>.docs file are not sorted acoording to the format specifications [TODO: link], this would be a way to sort them. If the files were produced with the help of xmlparser above (or by any parser which just outputs things in the order it encounters them), sorting is necessary.

=== 1.3 Words Index (<db>.hybrid and <db>.vocabulary) ===

{{{
make buildIndex
buildIndex HYB <db>.words
}}}

For good efficiency it is actually a bit more tricky (good choice block boundaries is important). TODO: take this burden from the user, and incorporate it into buildIndex

=== 1.4 Docs Index (<db>.docs.DB) ===

{{{
make buildDocsDB
buildDocsDB <db>.docs
}}}

Run without arguments for detailed usage information about command line options etc.



== 2. Completion Server (C++) ==

=== 2.1 Get it ===

{{{
svn checkout file:///KM/ir/SVN/autocompletion
}}}

=== 2.2 Compile it ===
Line 15: Line 73:
=== 1.3 Run it === === 2.3 Run it ===
Line 25: Line 83:
See Section 4 on how to produce these files See Section [#indexbuilding Index building] on how to produce these files
Line 27: Line 85:
== 2. Web Server (Apache + PHP) == == 3. Web Interface ==
Line 29: Line 87:
=== 2.1 Get it === === 3.1 Get it ===
Line 32: Line 90:
svn checkout file://KM/ir/SVN/autocomplete-php svn checkout file:///KM/ir/SVN/autocomplete-php
Line 37: Line 95:
2.1.1 application-independent code: the ''autocomplete'' directory 3.1.1 application-independent code: the ''autocomplete'' directory
Line 39: Line 97:
2.1.2 application-specific code: the files ''index.php'', ''autocomplete.css'', ''autocomplete_config.php'' 3.1.2 application-specific code: the files ''index.php'', ''autocomplete.css'', ''autocomplete_config.php''
Line 41: Line 99:
2.1.3 web-server configuration details: the file ''STILL_HAS_TO_BE_WRITTEN.txt'' 3.1.3 web-server configuration details: the file ''STILL_HAS_TO_BE_WRITTEN.txt''
Line 43: Line 101:
=== 2.2 Configure it === === 3.2 Configure it ===
Line 63: Line 121:


== Installing under Windows ==

=== Getting the source code ===

This section tells you how to install a subproject such as {{{autocompletion}}}.

Install the [http://tortoisesvn.tigris.org/ Tortoise] SVN client.

Open Windows Explorer, press the right mouse button, and select "SVN Checkout".

(Tortoise is fully integrated into Explorer. If the "SVN Checkout" menu does not appear after the right click, go to any non-special directory such as "My Documents".)

As the URL of the repository, enter

{{{
svn+ssh://ziegler@contact.mpi-sb.mpg.de/KM/ir/SVN/autocompletion
}}}

wherein you have to substitute {{{ziegler}}} for your login name on contact and {{{autocompletion}}} for the subproject that you want to checkout.

As the checkout directory, give the path to the directory that you want the subproject to be checked out to; this path should end with the name of the subproject, for example,

{{{
D:\MyProjects\autocompletion
}}}


You will notice that Tortoise asks you 3 times for your password. Later, when you do your daily work with Tortoise, you will notice that this happens on ''each and every action that accesses the remote repository''. If this is annoying to you (and it surely will be), you can follow the instructions given in the
[http://tortoisesvn.net/ssh_howto TortoiseSVN SSH Howto]. See also this
[http://subversion.tigris.org/faq.html#ssh-auth-cache FAQ]: "When I access a repository using svn+ssh, my password is not cached in ~/.subversion/auth/. How do I avoid having to type it so often?"

A short summary of this procedure:

 * On any Unix host, cd to your ~/.ssh
 * {{{$ ssh-keygen -b 1024 -t dsa -f mykey}}}
 * {{{$ cat mykey.pub >> authorized_keys}}}
 * Transfer the file {{{mykey}}} to your windows machine into a directory only readable by you, and remove it on the Unix host
 * On your windows machine, install the programs puttygen and pageant from [http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html here]
 * Open puttygen, load the file {{{mykey}}}, and save the generated key as file {{{mykey.ppk}}}
 * Open pageant. It will be running in the system tray. Go to "Add Key", enter {{{mykey.ppk}}}
 * Start working with Tortoise without having to enter any password
 * You must keep pageant running while working with Tortoise

To perform any of the usual SVN client actions on a file, right click the filename in Explorer and select the TortoiseSVN submenu.

CompleteSearch has three components

1. Index Building Tools for building, from a collection of documents, the index files required for fast query processing

2. Completion Server: This is a (multi-threaded) C++ program that listens to queries on a specified port, and answers with the help of the previously built index.

3. Web Interface: [PHP code (apache) + JavaScript (client), TODO: explain]

In the following, it is (briefly) explained how to install and work with each of these components.

Anchor(indexbuilding)

1 Index building

1.1 Parsing

svn checkout file:///KM/ir/SVN/xmlparser
vim <db>.parse.cpp
g++ -o <db>.parse <db>.parse.cpp xmlparser/XmlParser.o
<db>.parse <db>.xml

This should produce a <db>.words file and a <db>.docs file. Actually it doesn't matter how they are produced, as long as they adhere to the format [TODO: explain].

[Joachim: The XML input format for the parser is described on CompleteSearch/DocumentFormats. Copy the file exampleParser.cpp to <db>.parser.cpp, adjust the makefile and look at the code. The most important method is getItem("elementname") with which you can extract the contents of an <elementname> and produce 1 line in the .words file per word contained in this element. If the XML input contains several consecutive <elemenname>s, you can iterate over these by getItems("elementname").]

1.2 Sorting

sort -k1,1 -k2,2n -k4,4n <db>.words_unsorted > <db>.words
sort -k1,1n <db>.docs_unsorted > <db>.docs

In case the <db>.words and <db>.docs file are not sorted acoording to the format specifications [TODO: link], this would be a way to sort them. If the files were produced with the help of xmlparser above (or by any parser which just outputs things in the order it encounters them), sorting is necessary.

1.3 Words Index (<db>.hybrid and <db>.vocabulary)

make buildIndex
buildIndex HYB <db>.words

For good efficiency it is actually a bit more tricky (good choice block boundaries is important). TODO: take this burden from the user, and incorporate it into buildIndex

1.4 Docs Index (<db>.docs.DB)

make buildDocsDB
buildDocsDB <db>.docs

Run without arguments for detailed usage information about command line options etc.

2. Completion Server (C++)

2.1 Get it

svn checkout file:///KM/ir/SVN/autocompletion

2.2 Compile it

make startCompletionServer

2.3 Run it

startCompletionServer -p <port> -l <log file> <db>.hybrid

call without arguments for detailed usage information

requires three files: <db>.hybrid (the HYB index in some binary format), <db>.vocabulary (list of distinct words in sorted order), <db>.docs.db (url + title + raw text of the documents)

See Section [#indexbuilding Index building] on how to produce these files

3. Web Interface

3.1 Get it

svn checkout file:///KM/ir/SVN/autocomplete-php

contains three components:

3.1.1 application-independent code: the autocomplete directory

3.1.2 application-specific code: the files index.php, autocomplete.css, autocomplete_config.php

3.1.3 web-server configuration details: the file STILL_HAS_TO_BE_WRITTEN.txt

3.2 Configure it

mkdir <some dir>
cp index.php autocomplete.css autocomplete_config.php <some dir>

<some dir> should be a directory served by your web server. The search engine will then be accessible unter http://<server name>/.../<some dir>

vim autocomplete_config.php

The config file is just a list of variable assignements. It is well-documented and clearly says [TODO: not yet] which parts you must adapt (few), and which you can just leave to their default values (you can always come back and change them later).

vim <apache config file>

Adapt it as explained in the web-server configuration file from 2.1.3 above. You must have Apache + PHP installed (PHP4 or PHP5 are both fine). [TODO: give details on how to install it. For Debian, it's a simple package libapache2-mod-php4. For windows there is an installer [TODO].]

Installing under Windows

Getting the source code

This section tells you how to install a subproject such as autocompletion.

Install the [http://tortoisesvn.tigris.org/ Tortoise] SVN client.

Open Windows Explorer, press the right mouse button, and select "SVN Checkout".

(Tortoise is fully integrated into Explorer. If the "SVN Checkout" menu does not appear after the right click, go to any non-special directory such as "My Documents".)

As the URL of the repository, enter

svn+ssh://ziegler@contact.mpi-sb.mpg.de/KM/ir/SVN/autocompletion

wherein you have to substitute ziegler for your login name on contact and autocompletion for the subproject that you want to checkout.

As the checkout directory, give the path to the directory that you want the subproject to be checked out to; this path should end with the name of the subproject, for example,

D:\MyProjects\autocompletion

You will notice that Tortoise asks you 3 times for your password. Later, when you do your daily work with Tortoise, you will notice that this happens on each and every action that accesses the remote repository. If this is annoying to you (and it surely will be), you can follow the instructions given in the [http://tortoisesvn.net/ssh_howto TortoiseSVN SSH Howto]. See also this [http://subversion.tigris.org/faq.html#ssh-auth-cache FAQ]: "When I access a repository using svn+ssh, my password is not cached in ~/.subversion/auth/. How do I avoid having to type it so often?"

A short summary of this procedure:

  • On any Unix host, cd to your ~/.ssh
  • $ ssh-keygen -b 1024 -t dsa -f mykey

  • $ cat mykey.pub >> authorized_keys

  • Transfer the file mykey to your windows machine into a directory only readable by you, and remove it on the Unix host

  • On your windows machine, install the programs puttygen and pageant from [http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html here]

  • Open puttygen, load the file mykey, and save the generated key as file mykey.ppk

  • Open pageant. It will be running in the system tray. Go to "Add Key", enter mykey.ppk

  • Start working with Tortoise without having to enter any password
  • You must keep pageant running while working with Tortoise

To perform any of the usual SVN client actions on a file, right click the filename in Explorer and select the TortoiseSVN submenu.

CompleteSearch: completesearch/Installation (last edited 2009-01-19 14:57:16 by mpino1301)