raw data
The raw data may be anything: it may reside in a bunch of files, it may be packed in a single (possibly huge) xml, etc.
It is the job of the parser to process the raw data and produce the following files, required for index building.
<collection_name>.vocabulary
List of all words which occur at least once in a document.
Consists of one line per distinct word, in the following format:
<word written in ASCII><TAB><frequency rank>
Note: The lines are sorted by word, that is, by the first column
Note: A frequency rank of i means that this word is the ith most frequent. The most frequent word has rank 1.
Note: In the <collection_name>.vocabulary files we have so far, the frequency ranks are missing. But we need them for the new excerpt generator; see <collection_name>.docs_binary below
<collection_name>.docs (TODO: rename to <collection_name>.docs_ascii)
Intermediate list of documents, produced by parser, used for building <collection_name>.docs.DB
Consists of one line per document, in the following format:
<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<full text of document without linebreaks>
Note: The format comes from the [http://www.htdig.org htDig search engine], which we used for crawling and parsing in the initial stages of the CompleteSearch project
<collection_name>.docs_binary (NEW: needed for the new excerpt generator to be written by Joachim)
Like <collection_name>.docs_ascii above, but the line format is
<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<text of documents