completesearch/DocumentFormats

raw data

The raw data may be anything: it may reside in a bunch of files, it may be packed in a single (possibly huge) xml, etc.

It is the job of the parser to process the raw data and produce the following files, required for index building.

<collection_name>.vocabulary

List of all words which occur at least once in a document.

Consists of one line per distinct word, in the following format:

<word written in ASCII><TAB><frequency rank>

Note: The lines are sorted by word, that is, by the first column

Note: A frequency rank of i means that this word is the ith most frequent. The most frequent word has rank 1.

Note: In the <collection_name>.vocabulary files we have so far, the frequency ranks are missing. But we need them for the new excerpt generator; see <collection_name>.docs_binary below

<collection_name>.docs (TODO: rename to <collection_name>.docs_ascii)

Intermediate list of documents, produced by parser, used for building <collection_name>.docs.DB

Consists of one line per document, in the following format:

<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<full text of document without linebreaks>

Note: The format comes from the [http://www.htdig.org htDig search engine], which we used for crawling and parsing in the initial stages of the CompleteSearch project

<collection_name>.docs_binary (NEW: needed for the new excerpt generator to be written by Joachim)

Like <collection_name>.docs_ascii above, but the line format is

<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<text of documents