Diff for "completesearch/ExcerptGenerator"

Differences between revisions 10 and 11

The following is a list of requirements to the excerpt generator. This list is still under construction.

Currently, everything is listed which came to my (Holger's) mind. Not all of these might be necessary.

Example document

Many of the requirements in the following will be explained by example. They all refer to the following example document, which is in fact the first few lines from the English Wikipedia article on The Beatles.

The Beatles

The Beatles were an English musical group from Liverpool whose members were John Lennon, Paul McCartney, George Harrison, and Ringo Starr. They are one of the most commercially successful and critically acclaimed bands in the history of popular music. The Beatles are the best-selling musical act of all time in the United States of America, according to the Recording Industry Association of America, which certified them as the highest selling band of all time based on American sales of singles and albums.

One query word

Query: beatles

Excerpt: The <hl1>Beatles</hl1> were an English musical group from Liverpool ... The <hl1>Beatles</hl1> are the best-selling musicial act of all time in the United States of America ...

The name of the highlight tag, hl1 in this case, should, of course be a parameter.

The ... which separate different snippets should be a parameter, too. Note that there are no ... in the beginning, because these are the very first words of the document.

The unit of what is shown is a sentence. That is, we need some mechanism to recognize sentence boundaries. A simple heuristic would be to take every full stop or every line break as the end of a sentence. However this is done, it should be in a separate module and easily exchangeable.

Q: What if a sentence consists of very many words, such as the above "The Beatles are the best-selling musical act of all..."? How do we know when we should stop with three "..."?

Alternatively, one could just display r words to the left and right of each occurrence of the query word, for some parameter r.

C: Then we should defnitely use a compression scheme such as the CTS, which basically searches through lists of integers. Parsing back plain text for r words efficiently is very tedious to program.

C: The paper says that the sentence should be the minimal unit for extraction and presentation to the user and cites a corresp. paper.

If a query word occurs many times in a document, we only want to see the first few occurrences. That is, there should be a parameter for the maximal number of parts (separated by ...) to be shown in the excerpt. In the example above, two parts are shown. If the parameter were 1, only one of the sentences should be shown.

When there are more occurrences than can be show, we need a rule to determine which parts are to be preferred.

More than one query words

Query: beatles music*

Excerpt: The <hl1>Beatles</hl1> were an English <hl2>musical</hl2> group from Liverpool ... The <hl1>Beatles</hl1> are the best-selling <hl2>musicial</hl2> act of all time in the United States of America ...

Note the use of the wild-card in music* (with the obvious semantics that it should match every word starting with music).

Note that the tag names are different for the two query words, hl1 for the first query word, and hl2 for the second.

Note that the first part is only shown once, although more than one query words match. It would be a mistake to produce: The <hl1>Beatles</hl1> were an English musical group from Liverpool ... The Beatles were an English <hl2>musical</hl2> group from Liverpool ... (same part twice, with a different query word highlighted earch time)

Same for more than two query words (no new phenomenons arise).

Non-syntactic matches

Query: musician

Excerpt: The Beatles were an English musical group from Liverpool whose members were <h1l>John Lennon</h1l>, <h1l>Paul McCartney</h1l>, <h1l>George Harrison</h1l>, and <h1l>Ringo Starr</h1l>.

Note that what is to be found and highlighted does not match the query literally, but only via some indirection. It is not yet clear and up to debate how the information about this indirection (e.g., that the occurrence of John Lennon in the sentence above should match musician) is to be stored.

THIS IS ONE OF THE ESSENTIAL REQUIREMENTS

Avoid parsing the document at query time

For efficiency reasons, parsing of large parts of the document at query time should be avoided. (Even simple parsing of a 10-page PDF takes considerable time, and would be orders of magnitude slower than the actual query processing, which determined which document to display as hits.)
It is not yet clear and up to debate how we can avoid this.

Take advantage of pre-filtering

Our system could be easily modified in a way that it could pass the excerpt generator a relatively small list of candidate positions for matches. For example, the Excerpt Generator could be called with the additional information that it only needs to consider the sentences 15, 34, and 173 of the document.
Like this maybe parsing at excerpt generation time (now for a small portion of the document) becomes an option again.
As a rule of thumb, THE EXCERPT GENERATION OF EVEN A LARGE DOCUMENT SHOULD TAKE NO LONGER THAN 1 MILLISECOND

CompleteSearch: completesearch/ExcerptGenerator (last edited 2007-10-23 15:58:11 by mpiat1403)

-  ⇤ ← Revision 10 as of 2007-08-21 09:57:35 → 
  Size: 4926
  Editor: infno1613
  Comment:
+   ← Revision 11 as of 2007-08-21 13:03:49 → ⇥
  Size: 5469
  Editor: mpiat1403
  Comment: Added question and comments
-Deletions are marked like this.
+Additions are marked like this.
 Line 25:
+'''Q: What if a sentence consists of very many words, such as the above "The Beatles are the best-selling musical act of all..."? How do we know when we should stop with three "..."?'''
-Line 26:
+Line 28:
+'''C: Then we should defnitely use a compression scheme such as the CTS, which basically searches through lists of integers. Parsing back plain text for r words efficiently is very tedious to program. '''

'''C: The paper says that the sentence should be the minimal unit for extraction and presentation to the user and cites a corresp. paper.'''