completesearch/DocumentFormats/Discussion

The raw data

"The raw data may be anything: [...] It is the job of the parser to process the raw data and produce the following files". If the raw data really may be anything, then the parser must be able to parse anything. This is quite impossible.

<collection_name>.vocabulary

<word written in ASCII>: What to do with words that have non-ASCII chars in the input document?

Terms "word" and "non-word": Please give an exact definition.

<collection_name>.docs (TODO: rename to <collection_name>.docs_ascii)

I've put the TODO onto the TODO page of the wiki. OK? Shall we elminate it here?

How do we compute the doc id? What's the difference to the line number?

How do we compute the title? What chars are allowed?

H: Is this the textual content of a page, that is, with all (HTML) markup removed? Or is it guaranteed that <page> elements in the input have no further subelements? I think it should be specified in the section on the parser what exactly it has to do to produce H. Put differently: what is the format of H?

<collection_name>.docs.DB

"The last part is derived as follows from the id-encoded text from <collection_name>.docs: Each id is converted to the corresponding frequency rank..." You mean: The last part is derived from the <collection_name>.docs.binary (where each ID already is the frequency rank).

The second part: How can a parser know where the second part begins? Reading back from the end of file, using <number of documents> etc.?

"The third part consists of an array of the n document ids.": Again, I do not understand what you mean with document ID. Hasn't the first document the ID 1, the second the ID 2, and so on?

<collection_name>.hybrid

"It's great": This should suffice as a description of a data structure.