Diff for "completesearch/DocumentFormats/Discussion"

Differences between revisions 12 and 13

General remarks

I suppose that the documents are created in exactly the same order as their headings appear on CompleteSearch/DocumentFormats. Correct?

Holger: Ja und nein. Alles was den excerpt generator betrifft (alles mit docs im Namen) ist in der Reihenfolge wie du sagst. Aber das .words wird zum Beispiel auf schon gleich vom parser erzeugt, gehört aber zum Index.

The raw data

"The raw data may be anything: [...] It is the job of the parser to process the raw data and produce the following files". If the raw data really may be anything, then the parser must be able to parse anything. This is quite impossible.

Holger: Was ich da geschrieben habe, war mehr als Platzhalter gedacht

<collection_name>.vocabulary

<word written in ASCII>: What to do with words that have non-ASCII chars in the input document?

Holger: Actually, a word may be any sequence of non-whitespace characters. In particular, this allows arbitrary UTF-8.

Terms "word" and "non-word": Please give an exact definition.

Holger: Ah, you have a good point here. My first reaction was to write: just like in the CTS paper. But then I realizes that we also have a third category, and that is: things which one would call words but which for whatever reason are not indexed. Two examples of this are stopwords (very frequent words which would only blow up the index, like "the") and very short words (one of the input parameters to the parser is usually a minimum word length, typically taken to be 1). So we first need to discuss what to do with the non-indexed words. Given them an id, just like the indexed words? Given them their own run of ids (starting from one)? Do something especial, e.g., an escape sequence?

Joachim: If words such as "the" are not indexed, how can you reconstruct a sentence from frequency IDs? Not being indexed means not appearing in the vocabulary!

<collection_name>.docs (TODO: rename to <collection_name>.docs_ascii)

I've put the TODO onto the TODO page of the wiki. OK? Shall we elminate it here?

Holger: No, don't remove it there, it's a reminder for me.

How do we compute the doc id? What's the difference to the line number?

Holger: It's up to the parser what doc id it assigns to the documents. For wikipedia, for example, we take it to be the original wikipedia id (what is between the first <id>...</id> pair). The doc ids also have implications for the ranking: if two documents get the same score by the internal scoring mechanism (which happens quite frequently, because scores are discrete small numbers; for example, for a 1-word query, two documents which both contain the query word once in the title get the same score) the document with the smaller id gets ranked higher. We make use of this for a number of collections. For example, for one collection of emails, the document ids are given in the order of date (the lower the id, the more recente the email).

How do we compute the title? What chars are allowed?

Holger: Again, up to the parser. Whatever it puts there is displayed as title of the document. For Wikipedia, we of course take the document title. For an archive of email meesages, we take the subject of the mail + information about the sender etc. So it really depends on the collection. Note that for each collection, we write it's own parser (not from scratch of course, but derived from a generic one).

H: Is this the textual content of a page, that is, with all (HTML) markup removed? Or is it guaranteed that <page> elements in the input have no further subelements? I think it should be specified in the section on the parser what exactly it has to do to produce H. Put differently: what is the format of H?

Holger: Again, up to the parser. Whatever it produces, will be takes as the text of the document. Just two things are important: < and > and & should be escaped like for the text inside of an XML element (< etc.). And there should be no whitespace except an ordinary space, in particular, no newline.

Joachim: The escaping of <, >, and & seems artificial to me: Is H meant to be interpreted as an XML document somewhere? If so, the escaping can also be computed on the run.

<collection_name>.docs.DB

"The last part is derived as follows from the id-encoded text from <collection_name>.docs: Each id is converted to the corresponding frequency rank..." You mean: The last part is derived from the <collection_name>.docs.binary (where each ID already is the frequency rank).

The second part: How can a parser know where the second part begins? Reading back from the end of file, using <number of documents> etc.?

Holger: Yes, exactly!

"The third part consists of an array of the n document ids.": Again, I do not understand what you mean with document ID. Hasn't the first document the ID 1, the second the ID 2, and so on?

Holger: Not necessarily, see above.

<collection_name>.hybrid

"It's great": This should suffice as a description of a data structure.

Holger: Yes, it says everything.

CompleteSearch: completesearch/DocumentFormats/Discussion (last edited 2011-07-28 21:59:06 by p57B0BAF6)

-  ⇤ ← Revision 12 as of 2007-08-27 12:47:31 → 
  Size: 4989
  Editor: mpiat1403
  Comment: Question about not indexing some words
+   ← Revision 13 as of 2007-08-27 12:58:08 → ⇥
  Size: 5174
  Editor: mpiat1403
  Comment: Question about escaping of XML meta chars
-Deletions are marked like this.
+Additions are marked like this.
 Line 43:
+ '''Joachim''': The  escaping of <, >, and & seems artificial to me: Is H  meant to be interpreted as an XML document somewhere? If so, the escaping can also be computed on the run.