4112
Comment:
|
4109
|
Deletions are marked like this. | Additions are marked like this. |
Line 35: | Line 35: |
Note: The format comes from the [http://www.htdig.org ht``Dig search engine], which we used for crawling and parsing in the initial stages of the CompleteSearch project | Note: The format comes from the [http://www.htdig.org ht://Dig search engine], which we used for crawling and parsing in the initial stages of the CompleteSearch project |
Line 97: | Line 97: |
raw data
The raw data may be anything: it may reside in a bunch of files, it may be packed in a single (possibly huge) xml, etc.
It is the job of the parser to process the raw data and produce the following files, required for index building.
<collection_name>.vocabulary
List of all words which occur at least once in a document.
Consists of one line per distinct word, in the following format:
<word written in ASCII><TAB><frequency rank>
Note: The lines are sorted by word, that is, by the first column
Note: A frequency rank of i means that this word is the ith most frequent. The most frequent word has rank 1.
Note: In the <collection_name>.vocabulary files we have so far, the frequency ranks are missing. But we need them for the new excerpt generator; see <collection_name>.docs_binary below
Note: For the new excerpt generator, the vocabulary should also contain non-words (sequences of non-word characters between words). All non-words should come after all words, they should be easily distinguishable from the words by a special prefix, and they should have their own frequency ranks, that is, the most frequent non-word should have a frequency rank of 1, etc.
<collection_name>.docs (TODO: rename to <collection_name>.docs_ascii)
Intermediate list of documents, produced by parser, used for building <collection_name>.docs.DB
Consists of one line per document, in the following format:
<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<full text of document without linebreaks>
Note: The format comes from the [http://www.htdig.org ht://Dig search engine], which we used for crawling and parsing in the initial stages of the CompleteSearch project
<collection_name>.docs_binary (NEW: needed for the new excerpt generator to be written by Joachim)
Like <collection_name>.docs_ascii above, but the line format is
<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<id-encoded text of document>
Note: In the id-encoded text each word and each non-word is encoded by its frequency rank according to <collection_name>.vocabulary. Each id is written as an integer of type DocId (in Globals.h, currently an unsigned int, which on a 32-bit machine is a 4-byte integer)
<collection_name>.docs.DB
Compressed version of <collection_name>.docs_ascii or <collection_name>.docs_binary augmented by an index that allows random access to a particular document via its id.
Consists of four parts.
The first part consists one line per document, in the following format
<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<variable-byte compressed version of frequency-encoded text of document>
Note: The last part is derived as follows from the id-encoded text from <collection_name>.docs: Each id is converted to the corresponding frequency rank, and the written in variable-byte encoded format. For the non-words, two (TODO: or more?) bits are reserved for capitalization info. It's all pretty much like in the CTS-paper by Turpin et al. (TODO: upload the paper and link to it), SIGIR'07. CTS stands for Compressed Token System.
The second part consists of an array of n+1 offsets, where n is the number of documents
<offset of line 1><offset of line 2>...<offset of line n><offset of offset list>
Note: The offset of line i, is the position of the ith line in the document. In particular, the first offset is 0. Each offset is an integer of type off_t (which, if we compile with -D_LARGE_FILE_OFFSETS etc. takes 8 bytes)
The third part consists of an array of the n document ids
<id of document 1><id of document 2>...<id of document n>
Note: Each id is of type DocId, see above.
The fourth part consists of a single integer, that specifies the number of documents:
<number of documents>
Note: This integer is of type unsigned int, which on a 32-bit machine takes 4 bytes