Differences between revisions 1 and 23 (spanning 22 versions)
Revision 1 as of 2007-08-24 18:41:14
Size: 1671
Editor: infno1613
Comment:
Revision 23 as of 2011-06-20 02:38:38
Size: 161
Editor: www22239u
Comment: <a href="http://www.medicainfo.net/">purchase cialis on the internet</a> 612704 <a href="http://www.medsadviser.com/">Cialis</a> amadd
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
== raw data ==

The raw data may be anything: it may reside in a bunch of files, it may be packed in a single (possibly huge) xml, etc.

It is the job of the '''parser''' to process the raw data and produce the following files, required for index building.

== <collection_name>.vocabulary ==

List of all words which occur at least once in a document.

Consists of one line per distinct word, in the following format:

{{{
<word written in ASCII><TAB><frequency rank>
}}}

Note: The lines are sorted by word, that is, by the first column

Note: A frequency rank of ''i'' means that this word is the ''i''th most frequent. The most frequent word has rank 1.

Note: In the <collection_name>.vocabulary files we have so far, the frequency ranks are missing. But we need them for the new excerpt generator; see <collection_name>.docs_binary below

== <collection_name>.docs (TODO: rename to <collection_name>.docs_ascii) ==

Intermediate list of documents, produced by parser, used for building <collection_name>.docs.DB

Consists of one line per document, in the following format:

{{{
<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<full text of document without linebreaks>
}}}

Note: The format comes from the [http://www.htdig.org ht``Dig search engine], which we used for crawling and parsing in the initial stages of the CompleteSearch project

== <collection_name>.docs_binary (NEW: needed for the new excerpt generator to be written by Joachim) ==

Like <collection_name>.docs_ascii above, but the line format is

{{{
<doc id><TAB>u:<url of document><TAB>t:<title of document><TAB>H:<text of documents
}}}
 <a href="http://www.medicainfo.net/">purchase cialis on the internet</a> 612704 <a href="http://www.medsadviser.com/">Cialis</a> amadd
----
CategoryCategory


CategoryCategory

CompleteSearch: completesearch/DocumentFormats (last edited 2012-01-25 20:32:40 by Hannah Bast)