CsvParser - completesearch

CompleteSearch CSV Parser

We have two general-purpose parsers for producing the <base-name>.words and <base-name>.docs file required for building an index and then search with CompleteSearch: a CSV Parser and an XML Parser.

The CSV Parser is located in the parser sub-directory of the CompleteSearch codebase. It can be invoked by

./CsvParserMain [many options] --base-name=<base name>

It then looks for a file <base name>.csv and produces <base-name>.words and <base-name>.docs according to the specified option.

For a short description of the most important options, call the program without arguments

./CsvParserMain

Here an explanation for the most typical options by using an example CSV (actually we are using TSVs). The TSV can have as many columns as necessary. Each line describes a document.

author         year      title
Paul Heinz     2010      How to squeeze lemons.
Paula Müller   2012      A performance evaluation of algorithms about squeezing lemons.

By specifying --full-text=author,title, it is set, which columns shall be searched on a simple request like q=lemons.

It is possible to restrict the search to specific columns by using filters and facets. Those can be defined by using --filter=author,year,title and --facets=author,year. The difference between both of them is, that there is a filter word for each single word in the column, whereas facets words help to find specific phrases. Small examples: q=:filter:author:paul* finds all documents who are written by someone whose name is Paul, Paula, etc., q=:facet:author:Paul_Heinz finds the specific author Paul Heinz, q=:filter:title:lemons :facet:author:* returns the all authors who have written an article about lemons.

If there are e.g. several authors for one document, it it possible to write them in one field by separating them with the specified within-field-separator=#. It's necessary to specify, which fields might contain the within-field-separator by using the option allow-multiple-items=author,editor.

The CsvParser supports three different input formats, namely json, xml and pure text. You can specify the input format of each field by using --field-format=author:text,title:xml,year:text,description:json. Per default each field is handled as text. For facets it's recommended to use text, since CDATA fields and xml entities are not handled right now.

Use show=year,title to specify which fields should be returned on a hit. If you want to provide various outputs, specify an info-delimiter=~ - which may not occur within your documents - and define the various outputs, by using a semicolon: show=year,title;year,author;year. You can fetch them by adding the query parameter p=0 for year and title, p=1 for author, year and title, p=2 for year and so on. Moreover it's possible to specify the output format of each show list by appending :json or respectively :xml, which is the default anyway. Example: show=year,title:json;year,author:xml;year. Be aware that the CompletionServer does just support a conversion from xml to json, but not support a conversion from json to xml. So, if json is used as output format, it's not possible to return a xml result.

If want to prevent an item from being added as facet or within a show field, use --no-show-prefix=*. Example: Paula Müller#*Frida Peterson#*Hugo Schmitt. If the given field is specified as facet, only Paula Müller is added to the words file. If the given field is part of a show list, only Paula Müller is returned to the user.

All words, beside the facet words, can be normalized by using --normalize-words. In this case it's possible to find müller by searching for muller or mueller, which depends on the mapping in your maps (utf8.map, iso8859-1.map).

Set the option --encoding to either utf-8 or iso to specify the encoding of your file. The path to the directory, which contains the maps, can be defined by using --maps-directory=codebase/utility/. Default is the directory in which the binaries are called.

It is possible to define a field specific score by using --score=author:3,year:1 (comma-separated list of <fieldName>:<score> pairs).

A field specific ordering can be defined by specifying a comma-separated list of <fieldname>:<ordering> pairs. We support three different ordering types: order by literal, date or precision. For example --ordering=author:literal,price:5.2,release:date produces words like :ordering:author:Paul_Heinz (literal), :ordering:price:0013582 (number 135.8264 in 5.2 precision) and :ordering:release:20130824 (for the input word 24.8.2013). For more information on the supported date formats can be found in the file StringConverter.cpp in method StringConverter::convertDate.

For more details, a look at the code that processes these command line options might be helpful. You find them in two files. The code for the general (CSV independent) options can be found in file ParserBase.cpp method ParserBase::parseCommandLineOptions. The code for the CSV-specific options can be found in file CsvParserOptions.cpp method CsvParserOptions::parseCommandLineOptions.