Differences between revisions 5 and 6
Revision 5 as of 2007-08-29 09:17:45
Size: 4174
Editor: mpiat1403
Comment: Initial description
Revision 6 as of 2007-08-29 09:31:01
Size: 4424
Editor: mpiat1403
Comment: Initial description
Deletions are marked like this. Additions are marked like this.
Line 9: Line 9:
D = (0,1,2,3,4,5,6,7,8,9,10,...) D = (0,1,2,3,4,5,6,7,8,9,10,...).
Line 11: Line 11:
Each position holds the code of a word or a non-word. By lookup in a dictionary, each position can be mapped to its corresponding word or non-word, so that the document can easily be (re-)constructed from its positions. Each position holds the code of a word or a non-word. By a lookup in a dictionary, each position can be mapped to its corresponding word or non-word, so that the document can easily be (re-)constructed from its positions.
Line 13: Line 13:
Each document is (conceptually) divided into '''segments''', that is, into intervals of positions. For example, the segments may be the sentences of the document. The two extreme cases are: The whole document is one single segment or each word is a segment of its own. Each document is (conceptually) divided into '''segments''', that is, into intervals of positions. For example, the segments may be the sentences of the document. The two extreme cases are:

 1.
The whole document is one single segment.
 1. E
ach word is a segment of its own.
Line 45: Line 48:
For each segment matching a position, the function returns all words the segment consists of. If desired, the matching positions (words) are highlighted. If more than one matching position is contained in a segment, this segment is not returned twice. Rather, the matching positions in this segment coming from a different position list are given a different highlighting. For each segment matching a position in one of the L,,i,,, the function returns all words the segment consists of. If desired, the matching positions (words) are highlighted. If more than one matching position is contained in a segment, this segment is not returned twice. Rather, with highlighting enabled, the matching positions in this segment coming from a different position list are given a different highlighting.
Line 47: Line 50:
In the running example, the function would return the following parts (with highlighting enabled) In the running example, the function would return the following parts (with highlighting enabled and * being the highlighting for L,,0,,, + for L,,1,,, and $ for L,,2,,):
Line 51: Line 54:
So a '''part''' is more or less the positions (words) of a matching segment, maybe augmented by some highlighting markup. So a '''part''' is defined to be the positions (words) of a matching segment, maybe augmented by some highlighting markup.
Line 53: Line 56:
The final output of the central function is a concatenation of all parts computed, divided by a separator from one another (e.g., "...". This output is called '''excerpt'''- The final output of the central function is a concatenation of all parts computed, divided by a separator from one another (e.g., "..."). This output is called '''excerpt'''-

In the example, the excerpt returned would be

$5$,*6*,*7*,+8+,9...$10$,+11+,*12*,13,14...20,+21+,$22$,23,24

This page describes the central function of the Excerpt Generator. It first gives a detailed requirement specification, together with a running example. Then it describes the implementation.

Requirements

Terminology and basic task

A document D trivially consists of positions

D = (0,1,2,3,4,5,6,7,8,9,10,...).

Each position holds the code of a word or a non-word. By a lookup in a dictionary, each position can be mapped to its corresponding word or non-word, so that the document can easily be (re-)constructed from its positions.

Each document is (conceptually) divided into segments, that is, into intervals of positions. For example, the segments may be the sentences of the document. The two extreme cases are:

  1. The whole document is one single segment.
  2. Each word is a segment of its own.

For example, D could be segmented into the positions

SP = (0,5,10,15,20,25,...),

that is, into the segments

S = ([0,4],[5,9],[10,14],[15,19],[20,24],...).

A list L of positions is called a position list. In general, a position list describes a set of words, each being located in a segment of the document. Such a segment in called to match the position (word) and conversely the position is called to match the segment.

For example, with the position list

L0 = (6,7,12)

the matching segments are

([5,9],[10,14]).

The basic task of the central function is the following: Given the segmentation S of the document and some position lists L0, L1,..., the function computes all segments that match at least one of the positions in one of the Li.

For example, with the additional position lists

L1 = (8,11,21)

L2 = (5,10,22)

the matching segments are:

([5,9],[10,14],[20,24]).

For each segment matching a position in one of the Li, the function returns all words the segment consists of. If desired, the matching positions (words) are highlighted. If more than one matching position is contained in a segment, this segment is not returned twice. Rather, with highlighting enabled, the matching positions in this segment coming from a different position list are given a different highlighting.

In the running example, the function would return the following parts (with highlighting enabled and * being the highlighting for L0, + for L1, and $ for L2):

($5$,*6*,*7*,+8+,9), ($10$,+11+,*12*,13,14), (20,+21+,$22$,23,24).

So a part is defined to be the positions (words) of a matching segment, maybe augmented by some highlighting markup.

The final output of the central function is a concatenation of all parts computed, divided by a separator from one another (e.g., "..."). This output is called excerpt-

In the example, the excerpt returned would be

$5$,*6*,*7*,+8+,9...$10$,+11+,*12*,13,14...20,+21+,$22$,23,24

Input

Remark: Some of the following parameters should not be passed as arguments to the function, but rather be members of the object representing an Excerpt Generator.

  • A document D (type Document)
  • m position lists (type vector<vector<Position>>)

  • The radius (type unsigned int): specifies how many segments should be output around a position. The default is 0, that is, only the segment that contains the position is output. With radius 1, the segments one to the left and one to the right are also output etc.

  • maxNumOfParts (type unsigned int): specifies the maximum number of parts to be output. The parts to be output should contain , in shares as equal as possible, positions from the position lists. In particular, if maxNumOfParts > number of position list, for each list at least one part with a position in this list should be output. parts located near the start of the document have priority over parts located near the end.

Output

Open specification questions

What to do if the same word is to be highlighted because it matches in different position lists? What highlighting is to be preferred?

Implementation

The class Document has a mehtod getSegmentBounds(), which returns a list of Positions. A Position is an unsigned int, see Globals.h. For testing purposes, it suffices a preliminary, trivial implementation that segments every document into segments of lenght 5, that is, into the segments ([0,4],[5,9],[10,14],[15,19],[20,24],...).

CompleteSearch: completesearch/ExcerptGenerator/CentralFunction (last edited 2007-11-09 15:42:31 by mpiat1403)