Differences between revisions 12 and 13

This page describes the central function of the ExcerptGenerator. It first gives a detailed requirement specification, together with a running example. Then it describes the implementation.

Requirements

Anchor(Terminology)

Terminology and basic task

A document D trivially consists of positions

D = (0,1,2,3,4,5,6,7,8,9,10,...).

Each position holds the code of a word or a non-word. By a lookup in a dictionary, each position can be mapped to its corresponding word or non-word, so that the document can easily be (re-)constructed from its positions. The following algorithmic description mostly deals with positions, that is, with integers, not with words (strings). It suffices to bear in mind that the positions can easily be translated into the corresponding words if needed. Conversely, each word or non-word gives rise to a list of positions: the positions at which this word occurs in the document.

Each document is (conceptually) divided into segments, that is, into intervals of positions. For example, the segments may be the sentences of the document. The two extreme cases of segmentation are:

The whole document is one single segment.
Each word is a segment of its own.

For example, D could be segmented at the segment bounds

SB = (0,5,10,15,20,25,...),

Each segment bound (which is a position) starts a new segment, that is, the document D is segmented into the segments

S = ([0,4],[5,9],[10,14],[15,19],[20,24],...).

A list L of positions is called a position list. In general, a position list describes a set of words, each being located in a (not necessarily different) segment of the document. Such a segment is called to match the position (i.e., the word) and conversely the position (the word) is called to match the segment.

For example, with the position list

L₀ = (6,7,12)

the matching segments are

([5,9],[10,14]).

The basic task of the central function is the following: Given the segmentation S of the document and some position lists L₀, L₁,..., the function computes all segments that match at least one of the positions in one of the L_i.

For example, with the additional position lists

L₁ = (8,11,21)

L₂ = (5,10,22)

the matching segments are:

([5,9],[10,14],[20,24]).

For each segment matching a position in one of the L_i, the function returns all words the segment consists of. If desired, the matching positions (i.e., the words) are highlighted. If more than one matching position is contained in a segment, this segment is not returned twice. Rather, with highlighting enabled, the matching positions in this segment coming from a different position list are given a different highlighting.

In the running example, with highlighting enabled and * being the highlighting for L₀, + for L₁, and $ for L₂, the function would return the following parts (of course, words are returned, not integers):

($5$,*6*,*7*,+8+,9), ($10$,+11+,*12*,13,14), (20,+21+,$22$,23,24).

So a part is defined to be the positions (words) of a matching segment, maybe augmented by highlighting markup.

The final output of the central function is a concatenation of all parts computed, divided by a separator from one another (e.g., "..."). This output is called the excerpt.

In the example, the excerpt returned would be

$5$,*6*,*7*,+8+,9...$10$,+11+,*12*,13,14...20,+21+,$22$,23,24

Input

Remark: Some of the following parameters should not be passed as arguments to the function, but rather be members of the object representing an ExcerptGenerator.

A document D (type Document).
A list of position lists (type vector<vector<Position>>).
The radius (type unsigned int): specifies how many segments around a matching segment should additonally go into the corresponding part. The default is 0, that is, only the matching segment. With radius 1, the segments one to the left and one to the right become also part of the part etc.
maxPartSize (type unsigned int): the maximum size of a part, measured in characters; has priority over the radius; see below.
maxNumOfParts (type unsigned int): specifies the maximum number of parts of the excerpt.
partsSeparator (type string): the separator string dividing the parts of the excerpt.
doHighlighting (type bool): specifies whether the words at the matching positions shall be highlighted.
highlighingTags (type vector<pair<string, string>>): a list of pairs of start and end tags for highlighting markup, such as the pair (<b style="color:red">,</b>); used to highlight words on matching positions; each position list is assigned a new pair from highlightingTags; if there are more position lists than the vector has elements, the highlighting markup is cyclically repeated.

Output

For the output, the excerpt is computed as explained in the section [#Terminology Terminology and basic task], with the following additions:

The excerpt contains up to maxNumOfParts parts. The parts should contain positions from the position lists in shares as equal as possible. In particular, if maxNumOfParts > number of position list, for each list at least one part with a position in this list should go into the excerpt. Parts located near the start of the document have priority over parts located near the end.
Adjacent parts should not be separated with the partsSeparator. Thus, in the example, the output should be $5$,*6*,*7*,+8+,9,$10$,+11+,*12*,13,14...20,+21+,$22$,23,24
If maxPartSize is less than the part computed according to the radius (this may even occur with radius=0, i.e., the matching segment itself is too long), then compute the substring of the part that begins with the leftmost matching position (word) and ends with the rightmost matching position. If this substring is still longer than maxPartSize allows, output it anyway. If it is shorter, expand it to the left and to the right by as many characters as maxPartSize allows in addition.
If the same word is to be highlighted because it matches in different position lists, than the highlighting for position list i has priority over position list j if i < j.

The overall output then is the title of the document followed by the excerpt.

Implementation

TODO: This section is preliminary stuff.

The class Document has a method getSegmentBounds(), which returns a list of Positions. A Position is an unsigned int, see Globals.h. For testing purposes, it suffices a preliminary, trivial implementation that segments every document into segments of lenght 5, that is, into the segments ([0,4],[5,9],[10,14],[15,19],[20,24],...).

CompleteSearch: completesearch/ExcerptGenerator/CentralFunction (last edited 2007-11-09 15:42:31 by mpiat1403)

-  ⇤ ← Revision 12 as of 2007-08-29 11:40:40 → 
  Size: 6490
  Editor: mpiat1403
  Comment: Initial description
+   ← Revision 13 as of 2007-08-29 11:58:20 → ⇥
  Size: 6781
  Editor: mpiat1403
  Comment: Initial description
-Deletions are marked like this.
+Additions are marked like this.
 Line 13:
-Each position holds the code of a word or a non-word. By a lookup in a dictionary, each position can be mapped to its corresponding word or non-word, so that the document can easily be (re-)constructed from its positions. The following algorithmic description mostly deals with positions, that is, with integers, not with words (strings). It suffices to bear in mind that the positions can easily be translated into the corresponding words if needed.
+Each position holds the code of a word or a non-word. By a lookup in a dictionary, each position can be mapped to its corresponding word or non-word, so that the document can easily be (re-)constructed from its positions. The following algorithmic description mostly deals with positions, that is, with integers, not with words (strings). It suffices to bear in mind that the positions can easily be translated into the corresponding words if needed. Conversely, each word or non-word gives rise to a list of positions: the positions at which this word occurs in the document.
 Line 22:
-SP = (0,5,10,15,20,25,...),
+SB = (0,5,10,15,20,25,...),
 Line 24:
-Each segment bound starts a new segment, that is, the document is segmented into the segments
+Each segment bound (which is a position) starts a new segment, that is, the document D is segmented into the segments
 Line 28:
-A list L of positions is called a '''position list'''. In general, a position list describes a set of words, each being located in a segment of the document. Such a segment in called to '''match''' the position (word) and conversely the position is called to match the segment.
+A list L of positions is called a '''position list'''. In general, a position list describes a set of words, each being located in a (not necessarily different) segment of the document. Such a segment is called to '''match''' the position (i.e., the word) and conversely the position (the word) is called to match the segment.
 Line 50:
-For each segment matching a position in one of the L,,i,,, the function returns all words the segment consists of. If desired, the matching positions (words) are highlighted. If more than one matching position is contained in a segment, this segment is not returned twice. Rather, with highlighting enabled, the matching positions in this segment coming from a different position list are given a different highlighting.
+For each segment matching a position in one of the L,,i,,, the function returns all words the segment consists of. If desired, the matching positions (i.e., the words) are highlighted. If more than one matching position is contained in a segment, this segment is not returned twice. Rather, with highlighting enabled, the matching positions in this segment coming from a different position list are given a different highlighting.
 Line 52:
-In the running example, the function would return the following parts (with highlighting enabled and * being the highlighting for L,,0,,, + for L,,1,,, and $ for L,,2,,):
+In the running example, with highlighting enabled and * being the highlighting for L,,0,,, + for L,,1,,, and $ for L,,2,,, the function would return the following parts (of course, ''words'' are returned, not integers):
 Line 56:
-So a '''part''' is defined to be the positions (words) of a matching segment, maybe augmented by some highlighting markup.
+So a '''part''' is defined to be the positions (words) of a matching segment, maybe augmented by highlighting markup.
 Line 58:
-The final output of the central function is a concatenation of all parts computed, divided by a separator from one another (e.g., "..."). This output is called '''excerpt'''-
+The final output of the central function is a concatenation of all parts computed, divided by a separator from one another (e.g., "..."). This output is called the '''excerpt'''.
 Line 72:
- * m position lists (type `vector<vector<Position>>`).
 * The '''radius''' (type `unsigned int`): specifies how many segments around a matching segment should additonally go into the corresponding part. The default is 0, that is, only the matching segment. With radius 1, the segments one to the left and one to the right are also part of the part etc.
 * maxPartSize (type `unsigned int`): tha maximum size of a part, measured in characters; has priority over the radius; see below.
+ * A list of position lists (type `vector<vector<Position>>`).
 * The '''radius''' (type `unsigned int`): specifies how many segments around a matching segment should additonally go into the corresponding part. The default is 0, that is, only the matching segment. With radius 1, the segments one to the left and one to the right become also part of the part etc.
 * maxPartSize (type `unsigned int`): the maximum size of a part, measured in characters; has priority over the radius; see below.
 Line 78:
- * highlighingTags (type `vector<pair<string, string>>`): pairs of start and end tags of highlighting markup, such as (`<b style="color:red">,</b>`); used to highlight words on matching positions; each position list is assigned a new pair of highlightingTags; if there are more position lists than the size of the vector, the highlighting markup is cyclically repeated.
+ * highlighingTags (type `vector<pair<string, string>>`): a list of pairs of start and end tags for highlighting markup, such as the pair (`<b style="color:red">,</b>`); used to highlight words on matching positions; each position list is assigned a new pair from highlightingTags; if there are more position lists than the vector has elements, the highlighting markup is cyclically repeated.