completesearch/DocumentFormats

raw data

The raw data may be anything: it may reside in a bunch of files, it may be packed in a single (possibly huge) xml, etc.

It is the job of the parser to process the raw data and produce the following files, required for index building.

<collection_name>.vocabulary

List of all words which occur at least once in a document.

Consists of one line per distinct word, in the following format:

<word written in ASCII><TAB><frequency rank>

Note: The lines are sorted by word, that is, by the first column

Note: A frequency rank of i means that this word is the ith most frequent. The most frequent word has rank 1.

Note: In the <collection_name>.vocabulary files we have so far, the frequency ranks are missing. But we need them for the new excerpt generator; see <collection_name>.docs_binary below

Note: For the new excerpt generator, the vocabulary should also contain non-words (sequences of non-word characters between words). All non-words should come after all words, they should be easily distinguishable from the words by a special prefix, and they should have their own frequency ranks, that is, the most frequent non-word should have a frequency rank of 1, etc.

<collection_name>.docs (TODO: rename to <collection_name>.docs_ascii)

Intermediate list of documents, produced by parser, used for building <collection_name>.docs.DB

Consists of one line per document, in the following format:

<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<full text of document without linebreaks>

Note: The format comes from the [http://www.htdig.org htDig search engine], which we used for crawling and parsing in the initial stages of the CompleteSearch project

<collection_name>.docs_binary (NEW: needed for the new excerpt generator to be written by Joachim)

Like <collection_name>.docs_ascii above, but the line format is

<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<id-encoded text of document>

Note: In the id-encoded text each word and each non-word is encoded by its frequency rank according to <collection_name>.vocabulary. Each id is written as an integer of type DocId (in Globals.h, currently an unsigned int, which on a 32-bit machine is a 4-byte integer)

<collection_name>.docs.DB

Compressed version of <collection_name>.docs_ascii or <collection_name>.docs_binary augmented by an index that allows random access to a particular document via its id.

Consists of four parts.

The first part consists one line per document, in the following format

<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<variable-byte compressed version of frequency-encoded text of document>

Note: The last part is derived as follows from the id-encoded text from <collection_name>.docs: Each id is converted to the corresponding frequency rank, and the written in variable-byte encoded format. For the non-words, two (TODO: or more?) bits are reserved for capitalization info. It's all pretty much like in the CTS-paper by Turpin et al. (TODO: upload the paper and link to it), SIGIR'07. CTS stands for Compressed Token System.

The second part consists of an array of n+1 offsets, where n is the number of documents

<offset of line 1><offset of line 2>...<offset of line n><offset of offset list>

Note: The offset of line i, is the position of the ith line in the document. In particular, the first offset is 0. Each offset is an integer of type off_t (which, if we compile with -D_LARGE_FILE_OFFSETS etc. takes 8 bytes)

The third part consists of an array of the n document ids

<id of document 1><id of document 2>...<id of document n>

Note: Each id is of type DocId, see above.

The fourth part consists of a single integer, that specifies the number of documents:

<number of documents>

Note: This integer is of type unsigned int, which on a 32-bit machine takes 4 bytes