Differences between revisions 5 and 35 (spanning 30 versions)
Revision 5 as of 2007-08-27 09:17:49
Size: 1032
Editor: mpiat1403
Comment: Comments about section ".docs"
Revision 35 as of 2011-07-28 21:59:06
Size: 213
Editor: p57B0BAF6
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
== The raw data ==

"The raw data may be anything: [...] It is the job of the parser to process the raw data and produce the following files". If the raw data really may be '''anything''', then the parser must be able to parse '''anything'''. This is quite impossible.


== <collection_name>.vocabulary ==

<word written in ASCII>: What to do with words that have non-ASCII chars in the input document?


Terms "word" and "non-word": Please give an exact definition.


== <collection_name>.docs (TODO: rename to <collection_name>.docs_ascii) ==

I've put the TODO onto the TODO page of the wiki. OK? Shall we elminate it here?

H: Is this the textual content of a page, that is, with all (HTML) markup removed? Or is it guaranteed that `<page>` elements in the input have no further subelements? I think it should be specified in the section on the parser what exactly it has to do to produce H. Put differently: what is the format of H?

How do we comoute the doc id? What's the difference to the line number?
riJM5p <a href="http://uhiluunjnmmw.com/">uhiluunjnmmw</a>, [url=http://bpxcldlnmdzk.com/]bpxcldlnmdzk[/url], [link=http://rcptdnettqve.com/]rcptdnettqve[/link], http://mdmykkwmrbvq.com/
----
CategoryCategory

CompleteSearch: completesearch/DocumentFormats/Discussion (last edited 2011-07-28 21:59:06 by p57B0BAF6)