Differences between revisions 2 and 3

This page describes the central function of the Excerpt Generator. It first gives a detailed requirement specification, together with a running example. Then it describes the implementation.

Requirements

Terminology

A document D trivially consists of positions

D = (0,1,2,3,4,5,6,7,8,9,10,...)

Each position holds the code of a word or a non-word. By lookup in a directionary, each position can be mapped to its corresponding word or non-word, so that the document can easily be (re-)constructed from the positions.

Each document is (conceptually) divided into segments. For example, the segments may be the sentences of the document. The two extreme cases are: The whole document is one single segment or each word is a segment of its own.

For example, D could be segmented into the positions

SP = (0,5,10,15,20,25,...)

that is, into the segments

S = ([0,4],[5,9],[10,14],[15,19],[20,24],...)

The output of the central funtion is called excerpt. An excerpt consists of parts, which are divided by a seperator from one another (e.g., "...").

The basic task of the central function is the follwing: Given the segmentation S of the document and some position lists L₀, L₁, ..., the function computes all segments in which at least one of the positions in one of the L_i is contained. For each segment matching a position, the function returns all words the segment consists of. If desired, the matching posituions (words) are highlighted. If more than one matching position is contained in a segment, this segment is not returned twice. Rather, the different mathcing positions in thsi segment are are given a different highlighting.

Input

Remark: Some of the following parameters should not be passed as arguments to the function, but rather be members of the object representing an Excerpt Generator.

A document D (type Document)
m position lists (type vector<vector<Position>>)
The radius (type unsigned int): specifies how many segments should be output around a position. The default is 0, that is, only the segment that contains the position is output. With radius 1, the segments one to the left and one to the right are also output etc.
maxNumOfParts (type unsigned int): specifies the maximum number of parts to be output. The parts to be output should contain , in shares as equal as possible, positions from the position lists. In particular, if maxNumOfParts > number of position list, for each list at least one part with a position in this list should be output. parts located near the start of the document have priority over parts located near the end.

Output

Implementation

The class Document has a mehtod getSegmentBounds(), which returns a list of Positions. A Position is an unsigned int, see Globals.h. For testing purposes, it suffices a preliminary, trivial implementation that segments every document into segments of lenght 5, that is, into the segments ([0,4],[5,9],[10,14],[15,19],[20,24],...).

CompleteSearch: completesearch/ExcerptGenerator/CentralFunction (last edited 2007-11-09 15:42:31 by mpiat1403)

-  ⇤ ← Revision 2 as of 2007-08-28 16:58:22 → 
  Size: 2465
  Editor: mpiat1403
  Comment:
+   ← Revision 3 as of 2007-08-29 08:30:27 → ⇥
  Size: 3067
  Editor: mpiat1403
  Comment: Initial description
-Deletions are marked like this.
+Additions are marked like this.
 Line 25:
+The basic task of the central function is the follwing: Given the segmentation S of the document and some position lists L,,0,,, L,,1,,, ..., the function computes all segments in which at least one of the positions in one of the L,,i,, is contained. For each segment matching a position, the function returns all words the segment consists of. If desired, the matching posituions (words) are highlighted. If more than one matching position is contained in a segment, this segment is not returned twice. Rather, the different mathcing positions in thsi segment are are given a different highlighting.