completesearch/DocumentFormats

(Hint: Also see the [wiki:completesearch/DocumentFormats/Discussion discussion] page.)

The raw data

The raw data may be anything: it may reside in a bunch of files, it may be packed in a single (possibly huge) XML file, etc.

It is the job of the parser to process the raw data and produce the following files, required for index building.

The parser that we have been using for all the recent collections (e.g., wikipedia) takes as input one single large XML file and considers everything between a pair of tags on the first nesting level as one document. For example, the wikipedia xml (which can be downloaded from [http://download.wikimedia.org/enwiki here]), has the format

<mediawiki>...<page>...</page>... ...<page>...</page></mediawiki>

and everything between <page>...</page> is taken as a document.

<collection_name>.vocabulary

List of all words that occur at least once in a document.

Consists of one line per distinct word, in the following format:

<word written in ASCII><TAB><lexical rank><TAB><frequency rank><TAB><parse rank>

Note: A lexical rank of i means that this word is the i-th smallest lexicographically. The lexicographically smallest word has lexical rank 1.

Note: A frequency rank of i means that this word is the i-th most frequent. The most frequent word has frequency rank 1.

Note: A parse rank of i means that this word is the i-th distinct word encountered while parsing. The very first word of the first document has parse rank 1.

Note: In the <collection_name>.vocabulary files we have so far, the frequency ranks are missing. But we need them for the new excerpt generator; see <collection_name>.docs_binary below.

Note: For the new excerpt generator, the vocabulary should also contain non-words (sequences of non-word characters between words). All non-words should come after all words, they should be easily distinguishable from the words by a special prefix, and they should have their own frequency ranks, that is, the most frequent non-word should have a frequency rank of 1, etc.

Note: The order of the lines does not matter.

Note: In the format we have so far, there is only the first column, and the lines are sorted lexicographically so that the lexical rank is equal to the line number. Information about the other two ranks was missing so far, but we need it for the new <collection_name>.docs.DB format; see below.

<collection_name>.docs (TODO: rename to <collection_name>.docs_ascii)

Intermediate list of documents, produced by parser, used for building <collection_name>.docs.DB

Consists of one line per document, in the following format:

<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<full text of document without linebreaks>

Note: The format comes from the [http://www.htdig.org ht://Dig search engine], which we used for crawling and parsing in the initial stages of the CompleteSearch project

<collection_name>.docs_binary (NEW: needed for the new excerpt generator to be written by Joachim)

Like <collection_name>.docs_ascii above, but the line format is

<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<id-encoded text of document>

Note: In the id-encoded text each word and each non-word is encoded by its frequency rank according to <collection_name>.vocabulary. Each id is written as an integer of type DocId (in Globals.h, currently an unsigned int, which on a 32-bit machine is a 4-byte integer)

<collection_name>.docs.DB

Compressed version of <collection_name>.docs_ascii or <collection_name>.docs_binary augmented by an index that allows random access to a particular document via its id.

Consists of four parts.

The first part consists one line per document, in the following format

<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<variable-byte compressed version of frequency-encoded text of document>

Note: The last part is derived as follows from the id-encoded text from <collection_name>.docs: Each id is converted to the corresponding frequency rank, and the written in variable-byte encoded format. For the non-words, two (TODO: or more?) bits are reserved for capitalization info. It's all pretty much like in the CTS-paper by Turpin et al. (TODO: upload the paper and link to it), SIGIR'07. CTS stands for Compressed Token System.

The second part consists of an array of n+1 offsets, where n is the number of documents

<offset of line 1><offset of line 2>...<offset of line n><offset of offset list>

Note: The offset of line i is the position of the i-th line in the document. In particular, the first offset is 0. Each offset is an integer of type off_t (which, if we compile with -D_LARGE_FILE_OFFSETS etc., takes 8 bytes).

The third part consists of an array of the n document ids.

<id of document 1><id of document 2>...<id of document n>

Note: Each id is of type DocId, see above.

The fourth part consists of a single integer, that specifies the number of documents:

<number of documents>

Note: This integer is of type unsigned int, which on a 32-bit machine takes 4 bytes

<collection_name>.words

TODO: describe and explain. Line format is

<word><TAB><doc id><TAB><score><TAB><position>

[Joachim: At present, the score is always 0. The positions start with position 1.]

<collection_name>.hybrid

TODO: describe. It's the main index data structure, for the fast processing of prefix queries. It's in compressed binary format. It's great

CompleteSearch: completesearch/DocumentFormats (last edited 2012-01-25 20:32:40 by Hannah Bast)