Differences between revisions 1 and 21 (spanning 20 versions)
Revision 1 as of 2007-08-24 18:41:14
Size: 1671
Editor: infno1613
Comment:
Revision 21 as of 2011-06-11 08:52:00
Size: 85
Editor: ec2-50-18-63-234
Comment: T3tK6S <a href="http://ceobjnpkskew.com/">ceobjnpkskew</a>
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
== raw data ==

The raw data may be anything: it may reside in a bunch of files, it may be packed in a single (possibly huge) xml, etc.

It is the job of the '''parser''' to process the raw data and produce the following files, required for index building.

== <collection_name>.vocabulary ==

List of all words which occur at least once in a document.

Consists of one line per distinct word, in the following format:

{{{
<word written in ASCII><TAB><frequency rank>
}}}

Note: The lines are sorted by word, that is, by the first column

Note: A frequency rank of ''i'' means that this word is the ''i''th most frequent. The most frequent word has rank 1.

Note: In the <collection_name>.vocabulary files we have so far, the frequency ranks are missing. But we need them for the new excerpt generator; see <collection_name>.docs_binary below

== <collection_name>.docs (TODO: rename to <collection_name>.docs_ascii) ==

Intermediate list of documents, produced by parser, used for building <collection_name>.docs.DB

Consists of one line per document, in the following format:

{{{
<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<full text of document without linebreaks>
}}}

Note: The format comes from the [http://www.htdig.org ht``Dig search engine], which we used for crawling and parsing in the initial stages of the CompleteSearch project

== <collection_name>.docs_binary (NEW: needed for the new excerpt generator to be written by Joachim) ==

Like <collection_name>.docs_ascii above, but the line format is

{{{
<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<text of documents
}}}
T3tK6S <a href="http://ceobjnpkskew.com/">ceobjnpkskew</a>
----
CategoryCategory

CompleteSearch: completesearch/DocumentFormats (last edited 2012-01-25 20:32:40 by Hannah Bast)