raw data

The raw data may be anything: it may reside in a bunch of files, it may be packed in a single (possibly huge) xml, etc.

It is the job of the parser to process the raw data and produce the following files, required for index building.


List of all words which occur at least once in a document.

Consists of one line per distinct word, in the following format:

<word written in ASCII><TAB><frequency rank>

Note: The lines are sorted by word, that is, by the first column

Note: A frequency rank of i means that this word is the ith most frequent. The most frequent word has rank 1.

Note: In the <collection_name>.vocabulary files we have so far, the frequency ranks are missing. But we need them for the new excerpt generator; see <collection_name>.docs_binary below

<collection_name>.docs (TODO: rename to <collection_name>.docs_ascii)

Intermediate list of documents, produced by parser, used for building <collection_name>.docs.DB

Consists of one line per document, in the following format:

<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<full text of document without linebreaks>

Note: The format comes from the [http://www.htdig.org htDig search engine], which we used for crawling and parsing in the initial stages of the CompleteSearch project

<collection_name>.docs_binary (NEW: needed for the new excerpt generator to be written by Joachim)

Like <collection_name>.docs_ascii above, but the line format is

<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<text of documents