This page describes the central function of the [:../:ExcerptGenerator]. It first gives a detailed requirement specification, together with a running example. Then it describes the implementation.

Requirements

Anchor(Terminology)

Terminology and basic task

A document D trivially consists of positions

D = (0,1,2,3,4,5,6,7,8,9,10,...).

Each position holds the code of a word or a non-word. By a lookup in a dictionary, each position can be mapped to its corresponding word or non-word, so that the document can easily be (re-)constructed from its positions. The following algorithmic description mostly deals with positions, that is, with integers, not with words (strings). It suffices to bear in mind that the positions can easily be translated into the corresponding words if needed. Conversely, each word or non-word gives rise to a list of positions: the positions at which this word occurs in the document.

Each document is (conceptually) divided into segments, that is, into intervals of positions. For example, the segments may be the sentences of the document. The two extreme cases of segmentation are:

The whole document is one single segment.
Each word is a segment of its own.

For example, D could be segmented at the segment bounds

SB = (0,5,10,15,20,25,...),

Each segment bound (which is a position) starts a new segment, that is, the document D is segmented into the segments

S = ([0,4],[5,9],[10,14],[15,19],[20,24],...).

A list L of positions is called a position list. In general, a position list describes a set of words, each being located in a (not necessarily different) segment of the document. Such a segment is said to match the position (i.e., the word) and conversely the position (the word) is said to match the segment.

For example, with the position list

L₀ = (6,7,12)

the matching segments are

([5,9],[10,14]).

The basic task of the central function is the following: Given the segmentation S of the document and some position lists L₀, L₁,..., the function computes all segments that match at least one of the positions in one of the L_i.

For example, with the additional position lists

L₁ = (8,11,21)

L₂ = (5,10,22)

the matching segments are:

([5,9],[10,14],[20,24]).

For each segment matching a position in one of the L_i, the function returns all words the segment consists of. If desired, the matching positions (i.e., the words) are highlighted. If more than one matching position is contained in a segment, this segment is not returned twice. Rather, with highlighting enabled, the matching positions in this segment coming from a different position list are given a different highlighting.

In the running example, with highlighting enabled and * being the highlighting for L₀, + for L₁, and $ for L₂, the function would return the following parts (of course, words are returned, not integers):

($5$,*6*,*7*,+8+,9), ($10$,+11+,*12*,13,14), (20,+21+,$22$,23,24).

So a part is defined to be the positions (words) of a matching segment, maybe augmented by highlighting markup.

The final output of the central function is a (string) concatenation of all parts computed, divided by a separator from one another (e.g., "..."). This output is called the excerpt.

In the example, the excerpt returned would be

$5$ *6* *7* +8+ 9 ... $10$ +11+ *12* 13 14 ... 20 +21+ $22$ 23 24

Input

The current implementation splits the central function into two functions (see below): one that computes the positions of the matching segments together with highlighting information, and one (based on top of the first function) that builds the excerpt string from these positions and from the highlighting markup. Some of the following parameters are parameters of the first function, some of the second.

Remark: Some of the following parameters should not be passed as arguments to the function, but rather be members of the object representing an [:../:ExcerptGenerator].

A document D (type Document).
A list of position lists (type vector<vector<Position>>).
The radius (type unsigned int): specifies how many segments around a matching segment should additionally go into the corresponding part. The default is 0, that is, only the matching segment. With radius 1, the segments one to the left and one to the right become also part of the part etc.
maxMatchingSegmentsOutput (type int): the maximum number of output segments that contain at least one matching position. If set to -1, all matching segments are output.
maxPartSize (type unsigned int): the maximum size of a part, measured in characters; has priority over the radius; see below.
partsSeparator (type string): the separator string dividing the parts of the excerpt.
doHighlighting (type bool): specifies whether the words at the matching positions shall be highlighted.
highlighingTags (type vector<pair<string, string>>): a list of pairs of start and end tags for highlighting markup, such as the pair (<b style="color:red">,</b>); used to highlight words on matching positions; each position list is assigned a new pair from highlightingTags; if there are more position lists than the vector has elements, the highlighting markup is cyclically repeated.

Output

For the output, the excerpt is computed as explained in the section [#Terminology Terminology and basic task], with the following additional requirements:

Parts must be augmented to the left and to the right according to the value of the radius.
Adjacent parts should not be separated with the partsSeparator. Thus, in the example, the output should be $5$,*6*,*7*,+8+,9,$10$,+11+,*12*,13,14...20,+21+,$22$,23,24

The overall output then is the title of the document followed by the excerpt.

Here are some additional requirements that we do not know yet whether and how to fulfill:

The excerpt contains up to maxMatchingSegmentsOutput matching segments. The parts should contain positions from the position lists in shares as equal as possible. In particular, if maxMatchingSegmentsOutput is larger than the number of position lists, for each list at least one part with a position in this list should go into the excerpt. If the matching segments cannot be distributed in same shares to the position lists, lists with smaller index should have priority.
When constructing the excerpt, parts located near the start of the document have priority over parts located near the end.
If maxPartSize is less than the part computed according to the radius (this may occur even with radius=0, i.e., the matching segment itself is too long), a shorter string shall be output instead of the part, computed as follows: Compute the substring of the part that begins with the leftmost matching position (word) and ends with the rightmost matching position. If this substring is still longer than maxPartSize allows, output it anyway. If it is shorter, expand it to the left and to the right by as many characters as maxPartSize allows in addition.

Implementation

The implementation is split into several functions.

Computing the matching segment counters nms[ ]

To achieve the goal of outputting segments from the different lists "in shares as equal as possible", we first count for each position list i how many segments contain a position from this list. We denote this number of matching segments by nms[i]. The computation of nms[] is carried out by the following algorithm:

m = maxMatchingSegmentsOutput
k = number of position lists
[left,right) = current segment, initialized to [0,first segment bound)

for i = 0, ..., k-1:
  nms[i] = 0
  insert minimal position pos from list i into PQ: PQ.push(<pos,i>)

while(PQ not empty) {
  <pos,poslist> = PQ.pop();

  Shift [left,right) until it contains pos

  nms[poslist]++
  if(nms[poslist] == m)
     continue; // do not count more than m segments for each list

  Search in list poslist for the smallest position posn not in the current segment
  if(posn exists)
    PQ.push(<posn,poslist>)
}

Note that the number of PQ operations is <= 2 * k * m.

Computing the shares soc[ ]

To compute how many segments soc[i] (segment output counter) we must (at least) output for list i (i.e., how many segments we must output that contain at least one position from list i), we feed the nms[] values into the following portioning algorithm as the capacity values c[]. The algorithm will distribute m segments onto the k lists, yielding counters soc[i], for i=0, ..., k-1. Note that by using a stable sort subroutine, this algorithms also fulfills the above constraint "If the matching segments cannot be distributed in same shares to the position lists, lists with smaller index should have priority".

Portioning algorithm specification

The meaning of the above requirement "in shares as equal as possible" is as follows. We give an algorithmic description of how to achieve this well-balanced portioning. Consider the problem of filling n buckets, with bucket i having a capacity of c[i] atomic units, with m units of something in shares as equal as possible.

Input
-----
- n buckets (bucket 0,...,n-1)
- For each bucket: an integer capacity c[i] of (atomic) units (of CDs, stones,...) 
  that can be filled into the bucket
- m units to portion into the n buckets *in shares as equal as possible*

Output
------
- For each bucket, the amount a[i] of units filled (=portioned) into it

Algorithm
---------
Stable-sort the buckets in increasing order of their capacity values c[i]

while( m!=0 ) { // while there are units left to be portioned

 i = index of the leftmost bucket with a[i]<c[i] 
     // smallest non-saturated bucket

 n' = n - i // number of non-saturated buckets

 if(n' == 0) {problem unsolvable; stop;} /* original m exceeds sum of
all capacities; fill all buckets with highest possible value of c[i] units */

 min = c[i] - a[i] // units that can still be filled into bucket i

 if(n' * min <= m) {
   // fill min units into every non-saturated bucket
   a[j] += min for j = i,...,n-1
   m = m - n' * min
 }
 else {
  // fill int(m/n') into every non-saturated bucket
  a[j] += int(m/n') for j = i,...,n-1
  // fill remaining units into the first m%n' buckets, 1 unit per bucket
  a[j] = a[j] + 1 for j = i,..., i + (m%n') - 1
  m = 0 // stop iteration
 }
} // end of main loop

The central function

On first reading, skip the comments // if radius:. If the radius is to be output, the comments have to be turned into lines of code, at exactly the place where they appear. (Read the following subsection for an explanation of how to treat the radius.)

m = maxMatchingSegmentsOutput
k = number of position lists
[left,right) = [-1,-1) = the current segment
docpos = -1 // current position in the document, only advanced, never moved backwards

Compute the nms[] values as described above
Compute the soc[] values from nms[] and m by the portioning algorithm as described above

Create 2 priority queues: PQA and PQ. Each queue holds pairs <position, list that position comes from>

Create iterators pq_it[i] and pqa_it[i], for i =0, ..., k-1, to step through list i
Use these iterators when searching for positions to insert into PQ and PQA, resp.
These iterators are only advanced, never moved backwards

for i = 0, ..., k-1:
  if( soc[i] > 0 ) // Note: here soc[i] > 0 iff nms[i] > 0 iff list i is not empty
    pos = minimal (= first) position from list i (by pqa_it[i])
    PQA.push(<pos,i>)

// if radius: init windowLeft, windowRight, and counter

while(PQA not empty) {
  <pos,poslist> = PQA.pop()

  Shift [left,right) until it contains pos
  If [left,right) is actually shifted (and this is not the very first iteration):
    Before the first shift, output the remainder of the old current segment [left,right):
    Output every position from [docpos,right)
    // if radius: decrement counter or shift windowLeft, shift windowRight
    // if radius: output all positions if left < oldWindowRight

  soc[poslist]--
  if( soc[poslist] > 0 )
    Search (by pqa_it[poslist]) for the next position posn in list poslist that is not in [left,right)
    If posn exists: PQ.push(<posn,poslist>)
  
  If pos is the first matching position in [left,right) drawn from PQA
    // i.e., if [left,right) has actually been shifted
    // We know now that the whole segment must be output
    for i = 0, ..., k-1:
      Search (by pq_it[i]) for the minimal position p from list i that is contained in [left,right)
      If p exists: PQ.push(<p,i>) // at least pos exists (in list poslist)

    docpos = left // output everything from [left,right)
    // if radius: docpos = max(windowLeft, docpos)

    while(PQ not empty) {
      <p,pl> = PQ.pop()
      Search (by pq_it[pl]) for the next position pnext from list pl contained in [left,right)
      If pnext exists: PQ.push(<pnext,pl>)
      Output all positions from docpos to p
      Output p highlighted
      docpos = p + 1
    }
}

Output the remaining positions from [docpos,right)
// if radius: Output the remaining positions from [docpos,windowRight)

Taking the radius into account

Outputting radius segments around the current segment is interwoven into the shifting and outputting operations of the [left,right) segment. It works as follows.

At each point in time, [left,right) is conceptually contained in a window [windowLeft, windowRight). windowLeft and windowRight also are segment bounds.

Let us first ignore the window filling phase.

Then [left,right) always lies centered in [windowLeft, windowRight) such that exactly radius segments of the window lie to the left and to the right of the current segment.

If [left,right) is shifted segment for segment in the main loop, also the window must be shifted.

As long as during this shifting it holds that left < oldWindowRight (oldWindowRight beeing the value before the shifting operations), the positions of the just moved current segment must be output. (This treats the right boundary of the radius.)

If a matching position in the current segment is output in the main loop, everything from windowLeft must be output. This is done by initializing docpos to windowLeft each time a first matching position is drawn from PQA (instead of initializing it with the value of left as in the above pseudocode). (This treats the left boundary of the radius.)

The window filling phase is implemented as follows:

windowLeft is initialized to 0. A counter is initialized to radius. When [left,right) is shifted to the right, the counter is decremented. After radius shifts (when the counter is 0), windowLeft also is shifted to the right.

windowRight is initialized such that it is radius segment bounds to the right of right or, if there are not enough segment bounds, it is initialized to the very last segment bound.

Note: windowRight can also be implemented by means of an integer segmentRightCount that counts how many segments to the right of [left,right) must be output while the window is shifted to the right. On every shift, this counter is decremented and the just shifted [left,right) is output until the counter becomes 0. When a new matching segment is reached, the counter is reset to radius. This is a slightly easier and faster implementation.

Current implementation

The current trial implementation can be found in the file autocompletion/trials/try-CentralFunction.cpp.

The central function is split into the functions computeExcerptPositions() and computeExcerptText().

computeExcerptPositions() takes as arguments a vector of segment bounds, a vector of position lists, the radius, and maxMatchingSegmentsOutput. It computes the positions of the excerpt according to the specification above and outputs a list of pairs <pos, posList>. Here pos is a position in the document. If pos is a matching position, posList is the number of the position list that pos comes from (to point to the highlighting tag for this list later); pos is -1 else (to indicate "no highlighting" later).

Thus, in the example from above, the output of computeExcerptPositions() would be:

<5,2>, <6,0>, <7,0>, <8,1>, <9,-1>, <10,2>, <11,1>, <12,0>, <13,-1>, <14,-1>, <20,-1>, <21,1>, <22,2>, <23,-1>, <24,-1>

The main loop of computeExcerptPositions() is driven by extracting matching positions from a priority queue: The matching positions are drawn out of a PQ one by one, then the containing matching segment is computed based on the current matching position.

computeExcerptText() takes the list computed from computeExcerptPositions() and computes the excerpt string, taking into account the parameters partsSeparator, doHighlighting, highlightingTags, and maxPartSize.

TODO: The implementation of computeExcerptText() is only partial: Up to now, it merely outputs position numbers instead of words. Thus, it cannot consider maxPartSize yet.

Implementation study and discussion

There current implementation may serve us as a starting point for discussions about the final requirements and the final implementation.

The main architectural decision is: By what is the main loop driven? There are 3 basic approaches:

By matching positions (as in the example implementation): The matching positions are drawn out of a PQ one by one, the matching segments are computed based on the current matching position.
By segment bounds: Step through the document segment for segment, test whether the current segment contains a matching position. If so, output the segment.
By positions: Step through the document by enumerating all positions 1,2,3,4... For each position, test whether it is a matching position. If so, proceed as in step 1.

We think that approach 1, which is current, is the most efficient.

It is not fully clear whether to insist on and how to implement the following requirements, and what the best approach from 1,2,3 above is to implement them:

Considering maxPartSize

Original requirement

Quoting the original requirement from above: If maxPartSize is less than the part computed according to the radius (this may occur even with radius=0, i.e., the matching segment itself is too long), a shorter string shall be output instead of the part, computed as follows: Compute the substring of the part that begins with the leftmost matching position (word) and ends with the rightmost matching position. If this substring is still longer than maxPartSize allows, output it anyway. If it is shorter, expand it to the left and to the right by as many characters as maxPartSize allows in addition.

How can we do this?

Compute the part, taking into account the radius, as described above.
Turn the part into a string by looking up the words in the dictionary, yielding a string with a certain length l (number of chars).
The case of interest is that maxPartSize < l. If so:
During computing the part, keep an index to the leftmost and the rightmost matching position in a part.
Compute the substring of the part that begins with the leftmost matching position (word) and ends with the rightmost matching position by looking up words in a dictionary, yielding a string of length m.
If m >= maxPartSize, output the substring anyway. If not:
Get the word at the position 1 to the left of the leftmost matching position. Add it to the part at the left. If now the character length of part is still < maxPartSize:
Get the word at the position 1 to the right of the rightmost matching position...
And so on: Expand the part alternately at the left and at the right by the respective next word until the length of the part becomes >= maxPartSize.

Is this a good and efficient approach? I don't know. How does it go with one of the basic approaches from above? I don't know.

If m >= maxPartSize, output the substring anyway: Is this really desired? m could be very large (if the segments are very large; the extreme case being the document being one single segment).

The above definition of how to output parts with at most maxPartSize characters seems hard to be implemented efficiently. Moreover, it would imply that, by cutting at character bounds, only susbtrings of words could be output at each end of the part. Is this desirable?

Alternative requirement

Another approach is to define maxPartSize as the maximum size of a part, measured in words. I (Joachim) think that this is a better definition:

It is "more natural" to output only words, and no substrings of words.
The average size of a word (in non-degenerate cases) is surely < 12 chars.
Computing in terms of words is much faster: There is no lookup in a dictionary, no summing up of character lengths.

With this definition, let L₀=(11,13,16), maxPartSize=9. Then the first parts that are output are

7,8,9)(10,*11*,12,*13*,14)(15

How do we get this? Assume the implementation is matching-position-driven. Then

Pop 11 from the PQ
Go to the left 4 words (4 = int(9/2) )
Output everything from this bound to 11
Output *11* (highlighted)
Go to the right 4 words and output everything. While going to the right, pop 13 from the PQ; output it highlighted.
"Going to the left and right" means stepping by single positions, consuming all matching positions from the PQ that are located in segments that you step through.

In this approach, the concept of segment bounds (and radius) is completely irrelevant unless the rule is to always output whole segment. If so, we could, after the 4 steps to the left and to the right, go further to the left and to the right until we reach the next segment bound. This still would make the concept of the radius irrelevant.

CompleteSearch: completesearch/ExcerptGenerator/CentralFunction (last edited 2007-11-09 15:42:31 by mpiat1403)