CsvParser - completesearch

CompleteSearch CSV Parser

We have two general-purpose parsers for producing the <base-name>.words and <base-name>.docs file required for building an index and then search with CompleteSearch: a CSV Parser and an XML Parser.

The CSV Parser is located in the parser sub-directory of the CompleteSearch codebase. It can be invoked by

./CsvParserMain [many options] --base-name=<base name>

It then looks for a file <base name>.csv and produces <base-name>.words and <base-name>.docs according to the specified option.

For a short description of the most important options, call the program without arguments

./CsvParserMain

Here an explanation for the most typical options by using an example CSV (actually we are using TSVs). The TSV can have as many columns as necessary. Each line describes a document.

author         year      title
Paul Heinz     2010      How to squeeze lemons.
Paula Müller   2012      A performance evaluation of algorithms about squeezing lemons.

By specifying --full-text=author,title, it is set, which columns shall be searched on a simple request like q=lemons.

It is possible to restrict the search to specific columns by using filters and facets. Those can be defined by using --filter=author,year,title and --facets=author,year. The difference between both of them is, that there is a filter word for each single word in the column, whereas facets words help to find specific phrases. Small examples: q=:filter:author:paul* finds all documents who are written by someone whose name is Paul, Paula, etc., q=:facet:author:Paul_Heinz finds the specific author Paul Heinz, q=:filter:title:lemons :facet:author:* returns the all authors who have written an article about lemons.

If there are e.g. several authors for one document, it it possible to write them in one field by separating them with the specified within-field-separator=#. If author is defined as a facet and not all words shall be stored as facet, use --no-show-facet-prefix=* to avoid adding them. Example: Paula Müller#*Frida Peterson#*Hugo Schmitt.

Use show=year,title to specify which fields should be returned on a hit. If you want to provide various outputs, specify an info-delimiter=~ - which may not occur within your documents - and define the various outputs, by using a semicolon: show=year,title;year,author;year. You can fetch them by adding the query parameter p=0 for year and title, p=1 for author, year and title, p=2 for year and so on.

All words, beside the facet words, can be normalized by using --normalize. In this case it's possible to find müller by searching for muller or mueller, which depends on the mapping in your maps (utf8.map, iso8859-1.map). Set the option --encoding to either utf-8 or iso to specify the encoding of your file. The path to the directory, which contains the maps, can be defined by using --maps-directory=codebase/utility/. Default is the directory in which the binaries are called.

For more details, a look at the code that processes these command line options might be helpful. You find them in two files. The code for the general (CSV independent) options can be found in file ParserBase.cpp method ParserBase::parseCommandLineOptions. The code for the CSV-specific options can be found in file CsvParserOptions.cpp method CsvParserOptions::parseCommandLineOptions.