Diff for "CompletionServer"

Differences between revisions 3 and 4

CompleteSearch Completion Server

startCompletionServer [options] <base-name>.hybrid

This starts the CompletionServer.

There are several options available, which should provide most of the necessary functionality.

Explicit Server Options

--zero-fork Run the server in the foreground, and output everything to the console, which is convenient for testing. The default mode is to run as a background process and write all output to a log file.
--no-double-fork Single fork, process will run forever or until server killed.
--multi-threaded Run in multithreaded mode (default: process one query after the other; still recommended).
--auto-restart Automatically restart the server, if it crashes (requires double fork mode, which is default).
--kill <port number> Stop the server running at the specified port.
--kill-running-server If there is a running server, do kill it before starting the new one.
--port <port number> Specify the port, where the server is listening (default is 8888).
--pid-file <file name> Specifiy name of file containing the process id. Leading ~ will be replaced by the home directory, first %s will be replaced by host name, second %s will be replaced by port (default is ~/.completesearch_<hostname>_<port>).
--locale <encoding> Set LC_ALL to this string, irrespective of special "!encoding:..." word in index.
--maps-directory <dir> Specify the directory containing the maps utf8.map and iso8859-1.map (default is the execution directory).
--index-type [INV|HYB] Type of index (default: guess from index file name).
-e <docs file> Name of file containing excerpts info (default: <db>.docs.DB).

Query Processing Options

--normalize-words Normalize all non-facet words. This allows to find Müller, even if muller is requested. It's recommended to also set the option --use-suffix-for-exact-query. Take care, to achieve the intended behaviour, it's necessary to parse with the same option. See also CsvParser.
--word-part-separator-backend <character> We used to separate the words in special words like :facet:year:* by using the colon. We noticed, that the colon is positioned between numbers and letters in the ascii code, which might lead to problems on reading word ranges from the words file. There should not occur any problems, but it's still recommended to use a character which is positioned in front of numbers, like '!' (the default now). It's necessary to build the words file with the same delimuter. See also CsvParser.
--query-timeout <timeout> Specify a timeout a request is allowed to be processed to prevent critical queries from bringing the server to a standstill (default is 5000 ms).
--word-part-separator-frontend <character> Specify the separator, which is used in the api to request special queries like :facet:year:1993 (default is ':').
--use-suffix-for-exact-query Allows to find müller, if normalization is enabled. Otherwise it's necessary to look for müller:*, instead of müller.
--disable-cdata-tags It's recommended to use this option, if the info field for each document is valid xml and if invalid xml is already escaped using cdata. Otherwise your whole output will be escaped by using cdata.
-E On error the error message is appended to the response and sent to the client.
--document-root <path on filesystem> Allows to request e.g. HTML pages located under the given path by requesting <host>:<port>/<someHTML>. Per default this feature is disabled.
--exe-command <command> If specified, the usage of the query parameter exe=<someValue> leads to the execution of the command <command><someValue>.
Cache/history sizes must be greater than 0 and are given in one of the form: n meaning n bytes, nK meaning n kilobytes, nM meaning n megabytes, nG meaning n gigabytes.
- --max-size-history <size> Set the history size (default: 32 megabytes).
- --max-queries-history <size> At most that many queries in history (default: 200; note: current impl. is quadratic).
- --cache-size-excerpts <size> Sets the cache size for the excerpts generator (default: 16 megabytes).
--cleanup-query-before-processing Cleanup query before processing by correcting the order of the letters ^, * and ~ and erasing multiple interpretable characters like #, . and *.
--how-to-rank-docs <rankingType> Specify how to rank documents (0 = by score, 1 = by doc id, 2 = by word id followed by a = ascending or d = descending, default os 0d).
--how-to-rank-words <rankingType> Specify how to rank words (0 = by score, 1 = by doc count, 2 = by occ count, 3 = by word id, followed by a = ascending or d = descending, default is 0d).
--score-aggregations <aggregation> Specify score aggregation by a 4-letter string over the alphabet {S,M,B}, see explanations below.
- There are currently three types of score aggregation, S = sum, M = max, B = sum with bonus for proximity and exact word match. There are two aggregations for doc scores (same completion, different completion) and two aggregations for word scores (same doc, different doc).

Logging Options

--log-file <logfile> Specify file name for the log messages (default is <base-name>.log).
--show-query-result Log information about the query result.
--verbosity <verbosity level> Set the log verbosity, especially for debugging (1 = normal, 2 = high, 3 = highest; default is 1).
--no-statistics Don't write time statistics to the log file.

Existing options, which are not yet explained in depth, but copied from the source code.

--use-generalized-edit-distance-slow Use generalized edit distance to rank the word-ids (slow!).
--read-custom-scores (-0)
To enable synonym search, use enable-synonym-search.
To enable fuzzy search, use enable-fuzzy-search. This allows to find e.g. algorithm even by requesting the wrong written algoritm~ (the tilde is essential).
--fuzzy-normalize-words (-W)
--use-baseline-fuzzysearch (-B)

For more details, a look at the code that processes these command line options might be helpful. You can find the code in file https://ad-websvn.informatik.uni-freiburg.de/completesearch/codebase/server/StartCompletionServer.cpp.

CompleteSearch: CompletionServer (last edited 2016-07-15 15:44:26 by Hannah Bast)

-  ⇤ ← Revision 3 as of 2014-01-08 15:51:41 → 
  Size: 6913
  Editor: Ina Baumgarten
  Comment:
+   ← Revision 4 as of 2014-03-13 14:48:57 → ⇥
  Size: 7310
  Editor: Ina Baumgarten
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 9:
-There are several options available, which should provide any
+There are several options available, which should provide most of the necessary functionality.
 Line 11:
+==== Explicit Server Options ====
-Line 12:
+Line 13:
+ * '''''--no-double-fork''''' Single fork, process will run forever or until server killed.
 * '''''--multi-threaded''''' Run in multithreaded mode (default: process one query after the other; still recommended).  
 * '''''--auto-restart''''' Automatically restart the server, if it crashes (requires double fork mode, which is default).
 * '''''--kill''' <port number>''    Stop the server running at the specified port.
 * '''''--kill-running-server''''' If there is a running server, do kill it before starting the new one.
-Line 13:
+Line 19:
- * '''''--kill''' <port number>''    Stop the server running at the specified port.
 * '''''--log-file''' <logfile>''        Specify file name for the log messages (default is <base-name>.log).
 * '''''--verbosity''' <verbosity level>''   Set the log verbosity, especially for debugging (1 = normal, 2 = high, 3 = highest; default is 1).
 * '''''--document-root''' <path on filesystem>'' Allows to request e.g. HTML pages located under the given path by requesting ''<host>:<port>/<someHTML>''. Per default this feature is disabled.
+ * '''''--pid-file''' <file name>'' Specifiy name of file containing the process id. Leading ~ will be replaced by the home directory, first %s will be replaced by host name, second %s will be replaced by port (default is ~/.completesearch_<hostname>_<port>).
 * '''''--locale''' <encoding>'' Set LC_ALL to this string, irrespective of special "!encoding:..." word in index.
 * '''''--maps-directory''' <dir>'' Specify the directory containing the maps ''utf8.map'' and ''iso8859-1.map'' (default is the execution directory).
 * '''''--index-type''' [INV|HYB]'' Type of index (default: guess from index file name).
 * '''''-e''''' <docs file> Name of file containing excerpts info (default: <db>.docs.DB).

==== Query Processing Options ====
 * '''''--normalize-words''''' Normalize all non-facet words. This allows to find ''Müller'', even if ''muller'' is requested. It's recommended to also set the option ''--use-suffix-for-exact-query''. Take care, to achieve the intended behaviour, it's necessary to parse with the same option. See also [[CsvParser]].
 * '''''--word-part-separator-backend''' <character>''  We used to separate the words in special words like '':facet:year:*'' by using the colon. We noticed, that the colon is positioned between numbers and letters in the ascii code, which might lead to problems on reading word ranges from the words file. There should not occur any problems, but it's still recommended to use a character which is positioned in front of numbers, like '!' (the default now). It's necessary to build the words file with the same delimuter. See also [[CsvParser]].
-Line 19:
+Line 30:
- * '''''--maps-directory''' <dir>'' Specify the directory containing the maps ''utf8.map'' and ''iso8859-1.map'' (default is the execution directory).
-Line 22:
+Line 32:
- * '''''--kill-running-server''''' If there is a running server, do kill it before starting the new one.

Some of the options depend on your index and don't work, if the index files itself don't support them. It's recommended to use the generic [[CsvParser]] with the same options to create proper index files.
 * Use '''''--normalize-words''''' to normalize all non-facet words. This allows to find ''Müller'', even if ''muller'' is requested. It's recommended to also set the option ''--use-suffix-for-exact-query''.
 * It's possible provide different outputs (info fields) for one document by using '''''--info-delimiter''' <info-delimiter>''. This can be reasonable, if you want to return different columns (e.g. <document-as-xml> and <document-as-html>) in different situations. It's possible to request the different outputs by using the query parameter ''p=<pos>'', whereas pos defines if it's the first of the given outputs (p=0), the second outputs (p=1), etc.
 * We used to separate the words in special words like '':facet:year:*'' by using the colon. We noticed, that the colon is positioned between numbers and letters in the ascii code, which might lead to problems on reading word ranges from the words file. There should not occur any problems, but it's still recommended to use a character which is positioned in front of numbers, like '!' (the default now). The separator can be specified by using '''''word-part-separator-backend''' <character>''.
 * To enable fuzzy search, use '''''enable-fuzzy-search'''''. This allows to find e.g. ''algorithm'' even by requesting the wrong written ''algoritm~'' (the tilde is essential).

Existing options, which are not yet explained in depth, but copied from the source code.
 * '''''--auto-restart''''' Automatically restart the server if it crashes (requires double fork mode).
 * '''''--no-statistics''''' (-V)
 * '''''--index-type''' [INV|HYB]'' Type of index (default: guess from index file name).
 * '''''--no-double-fork''''' No double fork, process will run forever or until server killed.
 * '''''--multi-threaded''''' Run in multithreaded mode (default: process one query after the other; still recommended).
+ * '''''-E'''''  On error the error message is appended to the response and sent to the client.
 * '''''--document-root''' <path on filesystem>'' Allows to request e.g. HTML pages located under the given path by requesting ''<host>:<port>/<someHTML>''. Per default this feature is disabled.
 * '''''--exe-command''' <command>'' If specified, the usage of the query parameter ''exe=<someValue>'' leads to the execution of the command <command><someValue>.
 * Cache/history sizes must be greater than 0 and are given in one of the form:  ''n'' meaning ''n bytes'', ''nK''  meaning ''n kilobytes'', ''nM''  meaning ''n megabytes'', ''nG''  meaning ''n gigabytes''.
   * '''''--max-size-history''' <size>'' Set the history size (default: 32 megabytes).
   * '''''--max-queries-history''' <size>'' At most that many queries in history (default: 200; note: current impl. is quadratic).
   * '''''--cache-size-excerpts''' <size>''  Sets the cache size for the excerpts generator (default: 16 megabytes).
 * '''''--cleanup-query-before-processing''''' Cleanup query before processing by correcting the order of the letters ^, * and ~ and erasing multiple interpretable characters like #, . and *.
-Line 39:
+Line 43:
-   * There are currently three types of score aggregation, S = sum, M = max, B = sum with bonus for proximity and exact word match. There are two aggregations for doc scores (same completion, different completion) and two aggregations for word scores (same doc, different doc)-
 * '''''--pid-file''' <file name>'' Specifiy name of file containing the process id, leading ~ will be replaced by home dir, first %s will be replaced by host name, second %s will be replaced by port (default is ~/.completesearch_<hostname>_<port>).                           
 * '''''--exe-command''' <command>''
 * '''''--locale''' <encoding>'' Set LC_ALL to this string, irrespective of special "!encoding:..." word in index.
 * '''''--enable-synonym-search''''' (-S)
 * '''''--fuzzy-normalize-words''''' (-W) 
 * '''''--show-query-result''''' (-Q)
+   * There are currently three types of score aggregation, S = sum, M = max, B = sum with bonus for proximity and exact word match. There are two aggregations for doc scores (same completion, different completion) and two aggregations for word scores (same doc, different doc).


==== Logging Options ====
 * '''''--log-file''' <logfile>''        Specify file name for the log messages (default is <base-name>.log).
 * '''''--show-query-result''''' Log information about the query result.
 * '''''--verbosity''' <verbosity level>''   Set the log verbosity, especially for debugging (1 = normal, 2 = high, 3 = highest; default is 1).
 * '''''--no-statistics''''' Don't write time statistics to the log file.

{{{#!wiki comment 
* It's possible provide different outputs (info fields) for one document by using '''''--info-delimiter''' <info-delimiter>''. This can be reasonable, if you want to return different columns (e.g. <document-as-xml> and <document-as-html>) in different situations. It's possible to request the different outputs by using the query parameter ''p=<pos>'', whereas pos defines if it's the first of the given outputs (p=0), the second outputs (p=1), etc.
}}}

Existing options, which are not yet explained in depth, but copied from the source code.
-Line 47:
+Line 58:
+ * '''''--read-custom-scores''''' (-0)
 * To enable synonym search, use '''''enable-synonym-search'''''.
 * To enable fuzzy search, use '''''enable-fuzzy-search'''''. This allows to find e.g. ''algorithm'' even by requesting the wrong written ''algoritm~'' (the tilde is essential).
 * '''''--fuzzy-normalize-words''''' (-W)
-Line 48:
+Line 63:
- * '''''--cleanup-query-before-processing''''' Cleanup query before processing.
 * '''''--read-custom-scores''''' (-0)
 * Cache/history sizes must be greater than 0 and are given in one of the form:  n meaning n bytes, nK  meaning n kilobytes, nM  meaning n megabytes, nG  meaning n gigabytes.
   * '''''--cache-size-excerpts''' <size>''  Sets the cache size for the excerpts generator (default: 16 megabytes).
   * '''''--max-size-history''' <size>'' Set the history size (default: 32 megabytes).
   * '''''--max-queries-history''' <size>'' At most that many queries in history (default: 200; note: current impl. is quadratic).
 * '''''-E'''''   On error, send single hit with error message (will be seen in browser then).
 * '''''-e''''' <docs file> Name of file containing excerpts info (default: <db>.docs.DB).
 * '''''-T'''''                   Do not turn title from the docs t: field into link, but send it verbatim.