completesearch/ExcerptGenerator

The following is a list of requirements to the excerpt generator. This list is still under construction.

Currently, everything is listed which came to my (Holger's) mind. Not all of these might be necessary.

The material given on this page led to the design and implementation of the excerpt generator's central function, which is described on the subpage [wiki:Self/CompleteSearch/ExcerptGenerator/CentralFunction CentralFunction].

Example document

Many of the requirements in the following will be explained by example. They all refer to the following example document, which is in fact the first few lines from the English Wikipedia article on The Beatles.

The Beatles

The Beatles were an English musical group from Liverpool whose members were John Lennon, Paul McCartney, George Harrison, and Ringo Starr. They are one of the most commercially successful and critically acclaimed bands in the history of popular music. The Beatles are the best-selling musical act of all time in the United States of America, according to the Recording Industry Association of America, which certified them as the highest selling band of all time based on American sales of singles and albums.

One query word

Query: beatles

Excerpt: The <hl1>Beatles</hl1> were an English musical group from Liverpool ... The <hl1>Beatles</hl1> are the best-selling musicial act of all time in the United States of America ...

The name of the highlight tag, hl1 in this case, should, of course be a parameter.

The ... which separate different snippets should be a parameter, too. Note that there are no ... in the beginning, because these are the very first words of the document.

The unit of what is shown is a sentence. That is, we need some mechanism to recognize sentence boundaries. A simple heuristic would be to take every full stop or every line break as the end of a sentence. However this is done, it should be in a separate module and easily exchangeable.

Q: What if a sentence consists of very many words, such as the above "The Beatles are the best-selling musical act of all..."? How do we know when we should stop with three "..."?

A: Just have as one of the parameters the maximal number of words to be displayed for one part. And maybe also a maximal number of characters to avoid the (pathological) case of few veeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeery long words.

C: Your example suggests that you stop at the comma of "in the United States of America,". Therefore my question.

Alternatively, one could just display r words to the left and right of each occurrence of the query word, for some parameter r.

C: Then we should definitely use a compression scheme such as the CTS, which basically searches through lists of integers. Parsing back plain text for r words efficiently is very tedious to program.

C: The paper says that the sentence should be the minimal unit for extraction and presentation to the user and cites a corresp. paper.

If a query word occurs many times in a document, we only want to see the first few occurrences. That is, there should be a parameter for the maximal number of parts (separated by ...) to be shown in the excerpt. In the example above, two parts are shown. If the parameter were 1, only one of the sentences should be shown.

C: We should distinguish between the marker "..." (which indicates that we have abbreviated a longer sentence) and a marker "XXX" that distinguishes between snippets. We should consistently use a common vocabulary in the requirements and all other documents we produce, that is, we should clearly define the terms "sentence", "snippet", "word", "document", "token", "part" etc. (Concerning this, the paper is a good example of a Babylonian confusion.)

When there are more occurrences than can be show, we need a rule to determine which parts are to be preferred.

Q: What about sentence reordering as they do it in the paper?

A: Yes, the heuristics summarized in their Figure 2 make a lot of sense.

C: But Figure 2 gives an algorithm w.r.t. one or more given query terms. Sentence reordering is done without respect to any given query term - at least as I (Joachim) understand the paper.)

More than one query words

Query: beatles music*

Q: Is this an implicit AND or an OR? Does it refer to a sentence where the terms have to appear or to the document as a whole? This also leads to the question of what the input to the generator is: a single document (which is a priori known to match at least one of the search terms) or a set of documents?

Q: Shouldn't we first consider the easier case of mutliple words without wildcards?

A: I did not find it necessary to explain this in two steps.

Excerpt: The <hl1>Beatles</hl1> were an English <hl2>musical</hl2> group from Liverpool ... The <hl1>Beatles</hl1> are the best-selling <hl2>musicial</hl2> act of all time in the United States of America ...

Note the use of the wild-card in music* (with the obvious semantics that it should match every word starting with music).

Q: How do you do regular expression matching when a word is represented by an integer?

A: Not clear yet, just as for the semantic match (musician matching John Lennon) below.

Q: Do you allow only simple wildcards (prefix*) or do you allow full regular expressions with arbitrary complexity? Something in between? Define the term query term!

Note that the tag names are different for the two query words, hl1 for the first query word, and hl2 for the second.

Note that the first part is only shown once, although more than one query words match. It would be a mistake to produce: The <hl1>Beatles</hl1> were an English musical group from Liverpool ... The Beatles were an English <hl2>musical</hl2> group from Liverpool ... (same part twice, with a different query word highlighted earch time)

Same for more than two query words (no new phenomenons arise).

Non-syntactic matches

Query: musician

Excerpt: The Beatles were an English musical group from Liverpool whose members were <h1l>John Lennon</h1l>, <h1l>Paul McCartney</h1l>, <h1l>George Harrison</h1l>, and <h1l>Ringo Starr</h1l>.

Note that what is to be found and highlighted does not match the query literally, but only via some indirection. It is not yet clear and up to debate how the information about this indirection (e.g., that the occurrence of John Lennon in the sentence above should match musician) is to be stored.

C: This could be included into the code for the regular pattern match. It is just a special case of testing whether a word (an integer) belongs to a final set of words (of integers). It is a special case of a regular expression (one that defines a finite language).

THIS IS ONE OF THE ESSENTIAL REQUIREMENTS

Q: Really??? I wonder what the overall goal is...

Avoid parsing the document at query time

For efficiency reasons, parsing of large parts of the document at query time should be avoided. (Even simple parsing of a 10-page PDF takes considerable time, and would be orders of magnitude slower than the actual query processing, which determined which document to display as hits.)
Q: By "parsing" you mean parsing the character stream of the original document? Then this again cries for something like CTS. Alternatively, a dictionary could be precomputed that gives for every possible query term its position(s) in the document(s), but porobalby this dictionary would become too huge.
It is not yet clear and up to debate how we can avoid this.

Take advantage of pre-filtering

Our system could be easily modified in a way that it could pass the excerpt generator a relatively small list of candidate positions for matches. For example, the Excerpt Generator could be called with the additional information that it only needs to consider the sentences 15, 34, and 173 of the document.
Like this maybe parsing at excerpt generation time (now for a small portion of the document) becomes an option again.
Q: Please define for me (Joachim) the terms "system" and, again, "parsing".
As a rule of thumb, THE EXCERPT GENERATION OF EVEN A LARGE DOCUMENT SHOULD TAKE NO LONGER THAN 1 MILLISECOND

CompleteSearch: completesearch/ExcerptGenerator (last edited 2007-10-23 15:58:11 by mpiat1403)