This page describes the central function of the [:../:ExcerptGenerator]. It first gives a detailed requirement specification, together with a running example. Then it describes the implementation.

Requirements

Anchor(Terminology)

Terminology and basic task

A document D trivially consists of positions

D = (0,1,2,3,4,5,6,7,8,9,10,...).

Each position holds the code of a word or a non-word. By a lookup in a dictionary, each position can be mapped to its corresponding word or non-word, so that the document can easily be (re-)constructed from its positions. The following algorithmic description mostly deals with positions, that is, with integers, not with words (strings). It suffices to bear in mind that the positions can easily be translated into the corresponding words if needed. Conversely, each word or non-word gives rise to a list of positions: the positions at which this word occurs in the document.

Each document is (conceptually) divided into segments, that is, into intervals of positions. For example, the segments may be the sentences of the document. The two extreme cases of segmentation are:

  1. The whole document is one single segment.
  2. Each word is a segment of its own.

For example, D could be segmented at the segment bounds

SB = (0,5,10,15,20,25,...),

Each segment bound (which is a position) starts a new segment, that is, the document D is segmented into the segments

S = ([0,4],[5,9],[10,14],[15,19],[20,24],...).

A list L of positions is called a position list. In general, a position list describes a set of words, each being located in a (not necessarily different) segment of the document. Such a segment is said to match the position (i.e., the word) and conversely the position (the word) is said to match the segment.

For example, with the position list

L0 = (6,7,12)

the matching segments are

([5,9],[10,14]).

The basic task of the central function is the following: Given the segmentation S of the document and some position lists L0, L1,..., the function computes all segments that match at least one of the positions in one of the Li.

For example, with the additional position lists

L1 = (8,11,21)

L2 = (5,10,22)

the matching segments are:

([5,9],[10,14],[20,24]).

For each segment matching a position in one of the Li, the function returns all words the segment consists of. If desired, the matching positions (i.e., the words) are highlighted. If more than one matching position is contained in a segment, this segment is not returned twice. Rather, with highlighting enabled, the matching positions in this segment coming from a different position list are given a different highlighting.

In the running example, with highlighting enabled and * being the highlighting for L0, + for L1, and $ for L2, the function would return the following parts (of course, words are returned, not integers):

($5$,*6*,*7*,+8+,9), ($10$,+11+,*12*,13,14), (20,+21+,$22$,23,24).

So a part is defined to be the positions (words) of a matching segment, maybe augmented by highlighting markup.

The final output of the central function is a (string) concatenation of all parts computed, divided by a separator from one another (e.g., "..."). This output is called the excerpt.

In the example, the excerpt returned would be

$5$ *6* *7* +8+ 9 ... $10$ +11+ *12* 13 14 ... 20 +21+ $22$ 23 24

Input

The current implementation splits the central function into two functions (see below): one that computes the positions of the matching segments togehther with highlighting information, and one (based on top of the first function) that builds the excerpt string from these positions and from the highlighting markup. Some of the following parameters are parameters of the first function, some of the second.

Remark: Some of the following parameters should not be passed as arguments to the function, but rather be members of the object representing an [:../:ExcerptGenerator].

Output

For the output, the excerpt is computed as explained in the section [#Terminology Terminology and basic task], with the following additional requirements:

The overall output then is the title of the document followed by the excerpt.

Here are some additional requirements that we do not know yet whether and how to fulfill:

Implementation

The current trial implementation can be found in the file autocompletion/trials/try-CentralFunction.cpp.

The central function is split into the function computeExcerptPositions() and computeExcerptText().

computeExcerptPositions() takes as arguments a vector of segment bounds, a vector of position lists, and the radius. It computes the positions of the excerpt according to the specification above and outputs a list of pairs <pos, posList>. Here pos is a position in the document. If pos is a matching position, posList is the number of the position list that pos comes from (to point to the highlighting tag for this list later); pos is -1 else.

Thus, in the example from above, the output of computeExcerptPositions() would be:

<5,2>, <6,0>, <7,0>, <8,1>, <9,-1>, <10,2>, <11,1>, <12,0>, <13,-1>, <14,-1>, <20,-1>, <21,1>, <22,2>, <23,-1>, <24,-1>

computeExcerptText() takes the list computed from computeExcerptPositions() and computes the excerpt string, taking into account the parameters partsSeparator, doHighlighting, highlightingTags, and maxPartSize.

TODO: Implement computeExcerptText()

Implementation study and discussion

There is an example implementation at /KM/usr/ziegler/ExGen/CentralFunction/main.cpp, which may serve us as a starting point for dicussions about the final requirements and the final implementation. It computes the above example output.

The main architectural decision is: By what is the main loop driven? There are 3 basic approaches:

  1. By matching positions (as in the example implementation): The matching positions are drawn out of a PQ one by one, the matching segments are computed based on the current matching position.
  2. By segment bounds: Step through the document segment for segment, test whether the current segment contains a matching postion. If so, output the segment.
  3. By positions: Step through the document by enumerating all positions 1,2,3,4... For each position, test whether it is a matching position. If so, proceed as in step 1.

Which of these approaches is the most efficient is not fully clear.

Moreover, it is not fully clear whether to insist on and how to implement the following requirements, and what the best approach from 1,2,3 above is to implement them:

Considering the radius

Let SB = (0,5,10,15,20,25,...) and L0=(11,13,16). Additionally, let the radius be 1. Then the parts computed should read

(5,6,7,8,9)(10,*11*,12,*13*,14)(15,*16*,17,18,19)

How do we get this? Assume the implementation is matching-position-driven (as in the example implementation). Then

Now let L0=(11,13,16,21). When 21 is drawn from the PQ, with radius=1, the above processing would output the first part (containing position 16) again:

(15,*16*,17,18,19)(20,*21*,22,23,24)(25,26,27,28,29)

Is this to be avoided? If so, how? What shall be output in this case? Moreover, since we have already popped 16 from the PQ, we wouldn't recognize it as a matching position again, so that there would be no highlighting:

(15,16,17,18,19)(20,*21*,22,23,24)(25,26,27,28,29)

Basically, when going to the left and working with a PQ, we never recognize matching positions again because we have already popped them from the PQ. Is this a problem?

Is the overall approach to scan once from left to right (with or without a PQ) feasible to meet all additional requirements? Should we introduce a data structure that tells us "this is a matching position"? This could give rise to a completely different approach: "Slide a window" of maxPartSize positions over the document. Output the window when it contains enough matching positions, or when a matching position is falling off at the left end of the window (if it is the only matching position in the window, recenter the window around that position).

Considering maxPartSize

Original requirement

Quoting the original requirement from above: If maxPartSize is less than the part computed according to the radius (this may occur even with radius=0, i.e., the matching segment itself is too long), a shorter string shall be output instead of the part, computed as follows: Compute the substring of the part that begins with the leftmost matching position (word) and ends with the rightmost matching position. If this substring is still longer than maxPartSize allows, output it anyway. If it is shorter, expand it to the left and to the right by as many characters as maxPartSize allows in addition.

How can we do this?

Is this a good and efficient approach? I don't know. How does it go with one of the basic approaches from above? I don't know.

If m >= maxPartSize, output the substring anyway: Is this really desired? m could be very large (if the segments are very large; the extreme case being the document being one single segment).

The above definition of how to output parts with at most maxPartSize characters seems hard to be implemented efficiently. Moreover, it would imply that, by cutting at character bounds, only susbtrings of words could be output at each end of the part. Is this desirable?

Alternative requirement

Another approach is to define maxPartSize as the maximum size of a part, measured in words. I (Joachim) think that this is a better definition:

With this definition, let L0=(11,13,16), maxPartSize=9. Then the first parts that are output are

7,8,9)(10,*11*,12,*13*,14)(15

How do we get this? Assume the implementation is matching-position-driven (as in the example implementation). Then

In this approach, the concept of segment bounds (and radius) is completely irrelevant unless the rule is to always output whole segment. If so, we could, after the 4 steps to the left and to the right, go further to the left and to the right until we reach the next segment bound. This still would make the concept of the radius irrelevant.