Differences between revisions 1 and 42 (spanning 41 versions)
Revision 1 as of 2007-08-24 18:41:14
Size: 1671
Editor: infno1613
Comment:
Revision 42 as of 2011-07-28 21:59:37
Size: 317
Editor: p57B0BAF6
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
== raw data ==

The raw data may be anything: it may reside in a bunch of files, it may be packed in a single (possibly huge) xml, etc.

It is the job of the '''parser''' to process the raw data and produce the following files, required for index building.

== <collection_name>.vocabulary ==

List of all words which occur at least once in a document.

Consists of one line per distinct word, in the following format:

{{{
<word written in ASCII><TAB><frequency rank>
}}}

Note: The lines are sorted by word, that is, by the first column

Note: A frequency rank of ''i'' means that this word is the ''i''th most frequent. The most frequent word has rank 1.

Note: In the <collection_name>.vocabulary files we have so far, the frequency ranks are missing. But we need them for the new excerpt generator; see <collection_name>.docs_binary below

== <collection_name>.docs (TODO: rename to <collection_name>.docs_ascii) ==

Intermediate list of documents, produced by parser, used for building <collection_name>.docs.DB

Consists of one line per document, in the following format:

{{{
<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<full text of document without linebreaks>
}}}

Note: The format comes from the [http://www.htdig.org ht``Dig search engine], which we used for crawling and parsing in the initial stages of the CompleteSearch project

== <collection_name>.docs_binary (NEW: needed for the new excerpt generator to be written by Joachim) ==

Like <collection_name>.docs_ascii above, but the line format is

{{{
<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<text of documents
}}}
 <a href="http://www.ellencparkhurst.com/">accutane</a> 250 <a href="http://www.thecylinder.net/omeprazole.html">omeprazole w</a> 131385 <a href="http://www.cardenalbilbao.com/">cialis</a> kwd <a href="http://www.harborlandingrentals.com/cialis.aspx">non perscription generic cialis</a> qzuw
----
CategoryCategory

CompleteSearch: completesearch/DocumentFormats (last edited 2012-01-25 20:32:40 by Hannah Bast)