4112
Comment:
|
228
<a href="http://www.isitpinkshirtthursday.com/">zovirax</a> yvh <a href="http://www.thrilltheworldottawa.com/">synthroid</a> 259998 <a href="http://www.nicknsteel.com/">fioricet pharmacy cheap</a> 45
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
== raw data == The raw data may be anything: it may reside in a bunch of files, it may be packed in a single (possibly huge) xml, etc. It is the job of the '''parser''' to process the raw data and produce the following files, required for index building. == <collection_name>.vocabulary == List of all words which occur at least once in a document. Consists of one line per distinct word, in the following format: {{{ <word written in ASCII><TAB><frequency rank> }}} Note: The lines are sorted by word, that is, by the first column Note: A frequency rank of ''i'' means that this word is the ''i''th most frequent. The most frequent word has rank 1. Note: In the <collection_name>.vocabulary files we have so far, the frequency ranks are missing. But we need them for the new excerpt generator; see <collection_name>.docs_binary below Note: For the new excerpt generator, the vocabulary should also contain ''non-words'' (sequences of non-word characters between words). All non-words should come after all words, they should be easily distinguishable from the words by a special prefix, and they should have their own frequency ranks, that is, the most frequent non-word should have a frequency rank of 1, etc. == <collection_name>.docs (TODO: rename to <collection_name>.docs_ascii) == Intermediate list of documents, produced by parser, used for building <collection_name>.docs.DB Consists of one line per document, in the following format: {{{ <doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<full text of document without linebreaks> }}} Note: The format comes from the [http://www.htdig.org ht``Dig search engine], which we used for crawling and parsing in the initial stages of the CompleteSearch project == <collection_name>.docs_binary (NEW: needed for the new excerpt generator to be written by Joachim) == Like <collection_name>.docs_ascii above, but the line format is {{{ <doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<id-encoded text of document> }}} Note: In the id-encoded text each word and each non-word is encoded by its frequency rank according to <collection_name>.vocabulary. Each id is written as an integer of type ''DocId'' (in ''Globals.h'', currently an ''unsigned int'', which on a 32-bit machine is a 4-byte integer) == <collection_name>.docs.DB == Compressed version of <collection_name>.docs_ascii or <collection_name>.docs_binary augmented by an index that allows random access to a particular document via its id. Consists of four parts. The first part consists one line per document, in the following format {{{ <doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<variable-byte compressed version of frequency-encoded text of document> }}} Note: The last part is derived as follows from the id-encoded text from <collection_name>.docs: Each id is converted to the corresponding frequency rank, and the written in variable-byte encoded format. For the non-words, two (TODO: or more?) bits are reserved for capitalization info. It's all pretty much like in the CTS-paper by Turpin et al. (TODO: upload the paper and link to it), SIGIR'07. CTS stands for ''Compressed Token System''. The second part consists of an array of ''n+1'' offsets, where ''n'' is the number of documents {{{ <offset of line 1><offset of line 2>...<offset of line n><offset of offset list> }}} Note: The offset of line ''i'', is the position of the ''i''th line in the document. In particular, the first offset is 0. Each offset is an integer of type ''off_t'' (which, if we compile with -D_LARGE_FILE_OFFSETS etc. takes 8 bytes) The third part consists of an array of the ''n'' document ids {{{ <id of document 1><id of document 2>...<id of document n> }}} Note: Each id is of type ''DocId'', see above. The fourth part consists of a single integer, that specifies the number of documents: {{{ <number of documents> }}} Note: This integer is of type ''unsigned int'', which on a 32-bit machine takes 4 bytes |
<a href="http://www.isitpinkshirtthursday.com/">zovirax</a> yvh <a href="http://www.thrilltheworldottawa.com/">synthroid</a> 259998 <a href="http://www.nicknsteel.com/">fioricet pharmacy cheap</a> 4557 ---- CategoryCategory |
<a href="http://www.isitpinkshirtthursday.com/">zovirax</a> yvh <a href="http://www.thrilltheworldottawa.com/">synthroid</a> 259998 <a href="http://www.nicknsteel.com/">fioricet pharmacy cheap</a> 4557