Diff for "completesearch/DocumentFormats"

Differences between revisions 2 and 32 (spanning 30 versions)

<a href="http://www.isitpinkshirtthursday.com/">zovirax</a> yvh <a href="http://www.thrilltheworldottawa.com/">synthroid</a> 259998 <a href="http://www.nicknsteel.com/">fioricet pharmacy cheap</a> 4557

CategoryCategory

CompleteSearch: completesearch/DocumentFormats (last edited 2012-01-25 20:32:40 by Hannah Bast)

-  ⇤ ← Revision 2 as of 2007-08-24 19:11:20 → 
  Size: 4112
  Editor: infno1613
  Comment:
+   ← Revision 32 as of 2011-06-30 03:42:15 → ⇥
  Size: 228
  Editor: camel
  Comment:  <a href="http://www.isitpinkshirtthursday.com/">zovirax</a> yvh <a href="http://www.thrilltheworldottawa.com/">synthroid</a> 259998 <a href="http://www.nicknsteel.com/">fioricet pharmacy cheap</a> 45
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-== raw data ==

The raw data may be anything: it may reside in a bunch of files, it may be packed in a single (possibly huge) xml, etc.

It is the job of the '''parser''' to process the raw data and produce the following files, required for index building.

== <collection_name>.vocabulary ==

List of all words which occur at least once in a document.

Consists of one line per distinct word, in the following format:

{{{
<word written in ASCII><TAB><frequency rank>
}}}

Note: The lines are sorted by word, that is, by the first column

Note: A frequency rank of ''i'' means that this word is the ''i''th most frequent. The most frequent word has rank 1.

Note: In the <collection_name>.vocabulary files we have so far, the frequency ranks are missing. But we need them for the new excerpt generator; see <collection_name>.docs_binary below

Note: For the new excerpt generator, the vocabulary should also contain ''non-words'' (sequences of non-word characters between words). All non-words should come after all words, they should be easily distinguishable from the words by a special prefix, and they should have their own frequency ranks, that is, the most frequent non-word should have a frequency rank of 1, etc.

== <collection_name>.docs (TODO: rename to <collection_name>.docs_ascii) ==

Intermediate list of documents, produced by parser, used for building <collection_name>.docs.DB

Consists of one line per document, in the following format:

{{{
<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<full text of document without linebreaks>
}}}

Note: The format comes from the [http://www.htdig.org ht``Dig search engine], which we used for crawling and parsing in the initial stages of the CompleteSearch project

== <collection_name>.docs_binary (NEW: needed for the new excerpt generator to be written by Joachim) ==

Like <collection_name>.docs_ascii above, but the line format is

{{{
<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<id-encoded text of document>
}}}

Note: In the id-encoded text each word and each non-word is encoded by its frequency rank according to <collection_name>.vocabulary. Each id is written as an integer of type ''DocId'' (in ''Globals.h'', currently an ''unsigned int'', which on a 32-bit machine is a 4-byte integer)

== <collection_name>.docs.DB ==

Compressed version of <collection_name>.docs_ascii or <collection_name>.docs_binary augmented by an index that allows random access to a particular document via its id.

Consists of four parts.

The first part consists one line per document, in the following format

{{{
<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<variable-byte compressed version of frequency-encoded text of document>
}}}

Note: The last part is derived as follows from the id-encoded text from <collection_name>.docs: Each id is converted to the corresponding frequency rank, and the written in variable-byte encoded format. For the non-words, two (TODO: or more?) bits are reserved for capitalization info. It's all pretty much like in the CTS-paper by Turpin et al. (TODO: upload the paper and link to it), SIGIR'07. CTS stands for ''Compressed Token System''.

The second part consists of an array of ''n+1'' offsets, where ''n'' is the number of documents

{{{
<offset of line 1><offset of line 2>...<offset of line n><offset of offset list>
}}}

Note: The offset of line ''i'', is the position of the ''i''th line in the document. In particular, the first offset is 0. Each offset is an integer of type ''off_t'' (which, if we compile with -D_LARGE_FILE_OFFSETS etc. takes 8 bytes)

The third part consists of an array of the ''n'' document ids

{{{
<id of document 1><id of document 2>...<id of document n>
}}}

Note: Each id is of type ''DocId'', see above.

The fourth part consists of a single integer, that specifies the number of documents:

{{{
<number of documents>
}}}

Note: This integer is of type ''unsigned int'', which on a 32-bit machine takes 4 bytes
+ <a href="http://www.isitpinkshirtthursday.com/">zovirax</a> yvh <a href="http://www.thrilltheworldottawa.com/">synthroid</a> 259998 <a href="http://www.nicknsteel.com/">fioricet pharmacy cheap</a> 4557
----
CategoryCategory