This page describes the central function of the Excerpt Generator. It first gives a detailed requirement specification, together with a running example. Then it describes the implementation.

Requirements

Terminology

A document D trivially consists of positions

D = (0,1,2,3,4,5,6,7,8,9,10,...)

Each position holds the code of a word or a non-word. By lookup in a directionary, each position can be mapped to its corresponding word or non-word, so that the document can easily be (re-)constructed from the positions.

Each document is (conceptually) divided into segments. For example, the segments may be the sentences of the document. The two extreme cases are: The whole document is one single segment or each word is a segment of its own.

For example, D could be segmented into the positions

SP = (0,5,10,15,20,25,...)

that is, into the segments

S = ([0,4],[5,9],[10,14],[15,19],[20,24],...)

The output of the central funtion is called excerpt. An excerpt consists of parts, which are divided by a seperator from one another (e.g., "...").

Implementation

The class Document has a mehtod getSegmentBounds(), which returns a list of Positions. A Position is an unsigned int, see Globals.h. For testing purposes, it suffices a preliminary, trivial implementation that segments every document into segments of lenght 5, that is, into the segments ([0,4],[5,9],[10,14],[15,19],[20,24],...).