Diff for "CsvParser" - completesearch

Differences between revisions 6 and 12 (spanning 6 versions)

CompleteSearch CSV Parser

We have two general-purpose parsers for producing the <base-name>.words and <base-name>.docs file required for building an index and then search with CompleteSearch: a CSV Parser and an XML Parser.

The CSV Parser is located in the parser sub-directory of the CompleteSearch codebase. It can be invoked by

./CsvParserMain [many options] --base-name=<base name>

It then looks for a file <base name>.csv and produces <base-name>.words and <base-name>.docs according to the specified option.

For a short description of the most important options, call the program without arguments

./CsvParserMain

Here an explanation for the most typical options by using an example CSV (actually we are using TSVs). The TSV can have as many columns as necessary. Each line describes a document.

author         year      title
Paul Heinz     2010      How to squeeze lemons.
Paula Müller   2012      A performance evaluation of algorithms about squeezing lemons.

By specifying --full-text=author,title, it is set, which columns shall be searched on a simple request like q=lemons.

It is possible to restrict the search to specific columns by using filters and facets. Those can be defined by using --filter=author,year,title and --facets=author,year. The difference between both of them is, that there is a filter word for each single word in the column, whereas facets words help to find specific phrases. Small examples: q=:filter:author:paul* finds all documents who are written by someone whose name is Paul, Paula, etc., q=:facet:author:Paul_Heinz finds the specific author Paul Heinz, q=:filter:title:lemons :facet:author:* returns the all authors who have written an article about lemons.

If there are e.g. several authors for one document, it it possible to write them in one field by separating them with the specified within-field-separator=#. If author is defined as a facet and not all words shall be stored as facet, use --no-show-facet-prefix=* to avoid adding them. Example: Paula Müller#*Frida Peterson#*Hugo Schmitt.

Use show=year,title to specify which fields should be returned on a hit. If you want to provide various outputs, specify an info-delimiter=~ - which may not occur within your documents - and define the various outputs, by using a semicolon: show=year,title;year,author;year. You can fetch them by adding the query parameter p=0 for year and title, p=1 for author, year and title, p=2 for year and so on.

All words, beside the facet words, can be normalized by using --normalize-words. In this case it's possible to find müller by searching for muller or mueller, which depends on the mapping in your maps (utf8.map, iso8859-1.map). Set the option --encoding to either utf-8 or iso to specify the encoding of your file. The path to the directory, which contains the maps, can be defined by using --maps-directory=codebase/utility/. Default is the directory in which the binaries are called.

It is possible to define a field specific score by using --score=author:3,year:1 (comma-separated list of <fieldName>:<score> pairs).

A field specific ordering can be defined by specifying a comma-separated list of <fieldname>:<ordering> pairs. We support three different ordering types: order by literal, date or precision. For example --ordering=author:literal,price:5.2,release:date produces words like :ordering:author:Paul_Heinz, :ordering:price:0013582 (number 135.8264 in 5.2 precision) :ordering:release:20130824 (for the input word 24.8.2013). For more information on the supported date formats can be found in the file StringConverter.cpp in method StringConverter::convertDate.

For more details, a look at the code that processes these command line options might be helpful. You find them in two files. The code for the general (CSV independent) options can be found in file ParserBase.cpp method ParserBase::parseCommandLineOptions. The code for the CSV-specific options can be found in file CsvParserOptions.cpp method CsvParserOptions::parseCommandLineOptions.

-  ⇤ ← Revision 6 as of 2013-10-18 17:06:01 → 
  Size: 1315
  Editor: Ina Baumgarten
  Comment:
+   ← Revision 12 as of 2013-11-14 16:25:03 → ⇥
  Size: 4677
  Editor: Ina Baumgarten
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-#acl Ina Baumgarten:read,write
+#acl Ina Baumgarten:read,write All:read
 Line 21:
-For more details, a look at the code that processes these command line options might be helpful. You find them in two files. The code for the general (CSV independent) options can be found in file [[https://ad-svn.informatik.uni-freiburg.de/completesearch/codebase/parser/ParserBase.cpp|ParserBase.cpp]] method ''!ParserBase::parseCommandLineOptions''. The code for the CSV-specific options can be found in file [[https://ad-svn.informatik.uni-freiburg.de/completesearch/codebase/parser/CsvParserOptions.cpp|CsvParserOptions.cpp]]  method ''!CsvParserOptions::parseCommandLineOptions''.
+Here an explanation for the most typical options by using an example CSV (actually we are using TSVs).
The TSV can have as many columns as necessary. Each line describes a document.

{{{
author         year      title
Paul Heinz     2010      How to squeeze lemons.
Paula Müller   2012      A performance evaluation of algorithms about squeezing lemons.
}}}

By specifying '''''--full-text'''=author,title'', it is set, which columns shall be searched on a simple request like ''q=lemons''.

It is possible to restrict the search to specific columns by using filters and facets.
Those can be defined by using '''''--filter'''=author,year,title'' and '''''--facets'''=author,year''. The difference between both of them is, that there is a filter word for each single word in the column, whereas facets words help to find specific phrases. Small examples: ''q=:filter:author:paul*'' finds all documents who are written by someone whose name is Paul, Paula, etc., ''q=:facet:author:Paul_Heinz'' finds the specific author Paul Heinz, ''q=:filter:title:lemons :facet:author:*'' returns the all authors who have written an article about lemons.

If there are e.g. several authors for one document, it it possible to write them in one field by separating them with the specified '''''within-field-separator'''=#''. If author is defined as a facet and not all words shall be stored as facet, use '''''--no-show-facet-prefix'''=*'' to avoid adding them. Example: ''Paula Müller#*Frida Peterson#*Hugo Schmitt''.

Use '''''show='''year,title'' to specify which fields should be returned on a hit. If you want to provide various outputs, specify an '''''info-delimiter'''=~'' - which may not occur within your documents - and define the various outputs, by using a semicolon: ''show=year,title;year,author;year''. You can fetch them by adding the query parameter ''p=0'' for year and title, ''p=1'' for author, year and title, ''p=2'' for year and so on.

All words, beside the facet words, can be normalized by using '''''--normalize-words'''''. In this case it's possible to find ''müller'' by searching for ''muller'' or ''mueller'', which depends on the mapping in your maps (utf8.map, iso8859-1.map). Set the option '''''--encoding''''' to either ''utf-8'' or ''iso'' to specify the encoding of your file. The path to the directory, which contains the maps, can be defined by using '''''--maps-directory'''=codebase/utility/''. Default is the directory in which the binaries are called.

It is possible to define a field specific score by using '''''--score='''author:3,year:1'' (comma-separated list of <fieldName>:<score> pairs).

A field specific ordering can be defined by specifying a comma-separated list of <fieldname>:<ordering> pairs. We support three different ordering types: order by ''literal'', ''date'' or precision. For example '''''--ordering='''author:literal,price:5.2,release:date'' produces words like '':ordering:author:Paul_Heinz'', '':ordering:price:0013582'' (number 135.8264 in 5.2 precision) '':ordering:release:20130824'' (for the input word 24.8.2013). For more information on the supported date formats can be found in the file [[http://ad-svn.informatik.uni-freiburg.de/wsvn/completesearch/codebase/utility/StringConverter.cpp|StringConverter.cpp]] in method ''StringConverter::convertDate''.

For more details, a look at the code that processes these command line options might be helpful. You find them in two files. The code for the general (CSV independent) options can be found in file [[http://ad-svn.informatik.uni-freiburg.de/wsvn/completesearch/codebase/parser/ParserBase.cpp|ParserBase.cpp]] method ''!ParserBase::parseCommandLineOptions''. The code for the CSV-specific options can be found in file [[http://ad-svn.informatik.uni-freiburg.de/wsvn/completesearch/codebase/parser/CsvParserOptions.cpp|CsvParserOptions.cpp]]  method ''!CsvParserOptions::parseCommandLineOptions''.