Differences between revisions 3 and 4
Revision 3 as of 2007-08-11 01:52:52
Size: 3631
Editor: vpn-114
Comment:
Revision 4 as of 2007-08-11 01:55:53
Size: 3666
Editor: vpn-114
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
Line 4: Line 3:
'''Index Building''' Tools for building, from a collection of documents, the index files required for fast query processing '''1. Index Building''' Tools for building, from a collection of documents, the index files required for fast query processing
Line 6: Line 5:
'''Completion Server:''' This is a (multi-threaded) C++ program that listens to queries on a specified port, and answers with the help of the previously built index. '''2. Completion Server:''' This is a (multi-threaded) C++ program that listens to queries on a specified port, and answers with the help of the previously built index.
Line 8: Line 7:
'''Web Server:''' [PHP code (apache) + JavaScript (client), TODO: explain] '''3. User Interface (web-browser based)''' [PHP code (apache) + JavaScript (client), TODO: explain]
Line 12: Line 11:
== 1 Index building ==
Line 13: Line 13:
== 1. Completion Server (C++) == === 1.1 Parsing ===
Line 15: Line 15:
=== 1.1 Get it === {{{
svn checkout file://KM/ir/SVN/xmlparser
vim <db>.parse.cpp
g++ <db>.parse.cpp
}}}

This should produce a <db>.words file and a <db>.docs file. Actually it doesn't matter how they are produced, as long as they adhere to the format [TODO: explain].

=== 1.2 Sorting ===

{{{
sort -k1,1 -k2,2n -k4,4n <db>.words_unsorted > <db>.words
sort -k1,1n <db>.docs_unsorted > <db>.docs
}}}

In case the <db>.words and <db>.docs file are not sorted acoording to the format specifications [TODO: link], this would be a way to sort them. If the files were produced with the help of xmlparser above (or by any parser which just outputs things in the order it encounters them), sorting is necessary.

=== 1.3 Words Index (<db>.hybrid and <db>.vocabulary) ===

{{{
make buildIndex
buildIndex HYB <db>.words
}}}

For good efficiency it is actually a bit more tricky (good choice block boundaries is important). TODO: take this burden from the user, and incorporate it into buildIndex

=== 1.4 Docs Index (<db>.docs.DB) ===

{{{
make buildDocsDB
buildDocsDB <db>.docs
}}}

Run without arguments for detailed usage information about command line options etc.



== 2. Completion Server (C++) ==

=== 2.1 Get it ===
Line 21: Line 60:
=== 1.2 Compile it === === 2.2 Compile it ===
Line 27: Line 66:
=== 1.3 Run it === === 2.3 Run it ===
Line 39: Line 78:
== 2. Web Server (Apache + PHP) == == 3. User Interface (web-browser based) ==
Line 41: Line 80:
=== 2.1 Get it === === 3.1 Get it ===
Line 49: Line 88:
2.1.1 application-independent code: the ''autocomplete'' directory 3.1.1 application-independent code: the ''autocomplete'' directory
Line 51: Line 90:
2.1.2 application-specific code: the files ''index.php'', ''autocomplete.css'', ''autocomplete_config.php'' 3.1.2 application-specific code: the files ''index.php'', ''autocomplete.css'', ''autocomplete_config.php''
Line 53: Line 92:
2.1.3 web-server configuration details: the file ''STILL_HAS_TO_BE_WRITTEN.txt'' 3.1.3 web-server configuration details: the file ''STILL_HAS_TO_BE_WRITTEN.txt''
Line 55: Line 94:
=== 2.2 Configure it === === 3.2 Configure it ===
Line 75: Line 114:


== 3 Index building ==

=== 3.1 Parsing ===

{{{
svn checkout file://KM/ir/SVN/xmlparser
vim <db>.parse.cpp
g++ <db>.parse.cpp
}}}

This should produce a <db>.words file and a <db>.docs file. Actually it doesn't matter how they are produced, as long as they adhere to the format [TODO: explain].

=== 3.2 Sorting ===

{{{
sort -k1,1 -k2,2n -k4,4n <db>.words_unsorted > <db>.words
sort -k1,1n <db>.docs_unsorted > <db>.docs
}}}

In case the <db>.words and <db>.docs file are not sorted acoording to the format specifications [TODO: link], this would be a way to sort them. If the files were produced with the help of xmlparser above (or by any parser which just outputs things in the order it encounters them), sorting is necessary.

=== 3.3 Words Index (<db>.hybrid and <db>.vocabulary) ===

{{{
make buildIndex
buildIndex HYB <db>.words
}}}

For good efficiency it is actually a bit more tricky (good choice block boundaries is important). TODO: take this burden from the user, and incorporate it into buildIndex

=== 3.4 Docs Index (<db>.docs.DB) ===

{{{
make buildDocsDB
buildDocsDB <db>.docs
}}}

Run without arguments for detailed usage information about command line options etc.

CompleteSearch has three components

1. Index Building Tools for building, from a collection of documents, the index files required for fast query processing

2. Completion Server: This is a (multi-threaded) C++ program that listens to queries on a specified port, and answers with the help of the previously built index.

3. User Interface (web-browser based) [PHP code (apache) + JavaScript (client), TODO: explain]

In the following, it is (briefly) explained how to install and work with each of these components.

1 Index building

1.1 Parsing

svn checkout file://KM/ir/SVN/xmlparser
vim <db>.parse.cpp
g++ <db>.parse.cpp

This should produce a <db>.words file and a <db>.docs file. Actually it doesn't matter how they are produced, as long as they adhere to the format [TODO: explain].

1.2 Sorting

sort -k1,1 -k2,2n -k4,4n <db>.words_unsorted > <db>.words
sort -k1,1n <db>.docs_unsorted > <db>.docs

In case the <db>.words and <db>.docs file are not sorted acoording to the format specifications [TODO: link], this would be a way to sort them. If the files were produced with the help of xmlparser above (or by any parser which just outputs things in the order it encounters them), sorting is necessary.

1.3 Words Index (<db>.hybrid and <db>.vocabulary)

make buildIndex
buildIndex HYB <db>.words

For good efficiency it is actually a bit more tricky (good choice block boundaries is important). TODO: take this burden from the user, and incorporate it into buildIndex

1.4 Docs Index (<db>.docs.DB)

make buildDocsDB
buildDocsDB <db>.docs

Run without arguments for detailed usage information about command line options etc.

2. Completion Server (C++)

2.1 Get it

svn checkout file://KM/ir/SVN/autocompletion

2.2 Compile it

make startCompletionServer

2.3 Run it

startCompletionServer -p <port> -l <log file> <db>.hybrid

call without arguments for detailed usage information

requires three files: <db>.hybrid (the HYB index in some binary format), <db>.vocabulary (list of distinct words in sorted order), <db>.docs.db (url + title + raw text of the documents)

See Section 3 on how to produce these files

3. User Interface (web-browser based)

3.1 Get it

svn checkout file://KM/ir/SVN/autocomplete-php

contains three components:

3.1.1 application-independent code: the autocomplete directory

3.1.2 application-specific code: the files index.php, autocomplete.css, autocomplete_config.php

3.1.3 web-server configuration details: the file STILL_HAS_TO_BE_WRITTEN.txt

3.2 Configure it

mkdir <some dir>
cp index.php autocomplete.css autocomplete_config.php <some dir>

<some dir> should be a directory served by your web server. The search engine will then be accessible unter http://<server name>/.../<some dir>

vim autocomplete_config.php

The config file is just a list of variable assignements. It is well-documented and clearly says [TODO: not yet] which parts you must adapt (few), and which you can just leave to their default values (you can always come back and change them later).

vim <apache config file>

Adapt it as explained in the web-server configuration file from 2.1.3 above. You must have Apache + PHP installed (PHP4 or PHP5 are both fine). [TODO: give details on how to install it. For Debian, it's a simple package libapache2-mod-php4. For windows there is an installer [TODO].]

CompleteSearch: completesearch/Installation (last edited 2009-01-19 14:57:16 by mpino1301)