1671
Comment:
|
373
<a href="http://www.destroywhitchurch.com/prednisone.html">where to buy prednisone</a> >:-) <a href="http://www.thecylinder.net/priligy.html">priligy online</a> 290329 <a href="http://www.langebaanbe
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
== raw data == The raw data may be anything: it may reside in a bunch of files, it may be packed in a single (possibly huge) xml, etc. It is the job of the '''parser''' to process the raw data and produce the following files, required for index building. == <collection_name>.vocabulary == List of all words which occur at least once in a document. Consists of one line per distinct word, in the following format: {{{ <word written in ASCII><TAB><frequency rank> }}} Note: The lines are sorted by word, that is, by the first column Note: A frequency rank of ''i'' means that this word is the ''i''th most frequent. The most frequent word has rank 1. Note: In the <collection_name>.vocabulary files we have so far, the frequency ranks are missing. But we need them for the new excerpt generator; see <collection_name>.docs_binary below == <collection_name>.docs (TODO: rename to <collection_name>.docs_ascii) == Intermediate list of documents, produced by parser, used for building <collection_name>.docs.DB Consists of one line per document, in the following format: {{{ <doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<full text of document without linebreaks> }}} Note: The format comes from the [http://www.htdig.org ht``Dig search engine], which we used for crawling and parsing in the initial stages of the CompleteSearch project == <collection_name>.docs_binary (NEW: needed for the new excerpt generator to be written by Joachim) == Like <collection_name>.docs_ascii above, but the line format is {{{ <doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<text of documents }}} |
<a href="http://www.destroywhitchurch.com/prednisone.html">where to buy prednisone</a> >:-) <a href="http://www.thecylinder.net/priligy.html">priligy online</a> 290329 <a href="http://www.langebaanbeachcats.com/">propecia</a> 1913 <a href="http://www.brasilportugal.net/">levitra</a> >:[[ <a href="http://www.cardenalbilbao.com/">cialis</a> 00495 ---- CategoryCategory |
<a href="http://www.destroywhitchurch.com/prednisone.html">where to buy prednisone</a> >:-) <a href="http://www.thecylinder.net/priligy.html">priligy online</a> 290329 <a href="http://www.langebaanbeachcats.com/">propecia</a> 1913 <a href="http://www.brasilportugal.net/">levitra</a> >:[[ <a href="http://www.cardenalbilbao.com/">cialis</a> 00495