AD Teaching Wiki
  • Comments
  • Immutable Page
  • Menu
    • Navigation
    • RecentChanges
    • FindPage
    • Local Site Map
    • Help
    • HelpContents
    • HelpOnMoinWikiSyntax
    • Display
    • Attachments
    • Info
    • Raw Text
    • Print View
    • Edit
    • Load
    • Save
  • Login

FrontPage

Upload page content

You can upload content for the page named below. If you change the page name, you can also upload content for another page. If the page name is empty, we derive the page name from the file name.

File to load page content from
Page name
Comment

Revision 6 as of 2014-02-19 15:15:51
AD Teaching Wiki:
  • InformationRetrievalWS1314
  • ExamSolutions

TASK 1 (Compression + entropy)

1, x, x+1, 2x, 2x+1, 3x, 3x+1, 4x

1.1 For x = 10, gap-encode using Golomb with M = 8

Gaps are: +1, +9, +1, +9, +1, +9, +1, +9 (eight gaps, not seven)
Golomb code for +1 : 1001
Golomb code for +9 : 01001
Encoding of the whole sequence: 1001 01001 1001 01001 1001 01001 1001 01001

1.2 No larger M gives shorter sequence

For M > 8, writing the residual (r) part alone costs >= 4 Bits. Together with the 1 separating it from the q part, this is >= 5 Bits per code. That is >= 40 Bits for the sequence above. With M = 8 we need only 36 Bits.

1.3 Function that computes Golomb code for M = 8

ArrayList<Bool> golombCode(int x) {
  result = new ArrayList<Bool>; 
  int q = x / 8;
  int r = x % 8;
  for (int i = 0; i < q; i++) { result.add(0); }
  result.add(1);
  result.add(r & 4 > 0 ? 1 : 0);
  result.add(r & 2 > 0 ? 1 : 0);
  result.add(r & 1 > 0 ? 1 : 0);
  return result;

1.4 Entropy-optimal code for sequence above, for abitrary x

We have two symbols, each occuring the same number of times. The entropy of the corresponding random variable (X taking on either value with probability 1/2) is 1/2 * log_2 1/2 + 1/2 * log_2 1/2 = 1. And there is indeed a (prefix-free) encoding with just 1 Bit per value:

1 : 0
x-1 : 1

TASK 2 (Prefix edit distance + 2-grams)

x = angela, y = angelina

2.1 Compute the PED

PED(angela, angelina) = ED(angela, angeli) = 1
Justification: If PED = 0 than x is a perfect prefix of y, which is not the case. PED is 1 with 1 operation: replace a -> i

PED(angelina, angela) = ED(angelina, angela) = 2
Justification: for shorter prefixes of angela, the difference of the string length would be >= 3, and so certainly ED >= 3.

2.2 Compute comm(x, y)

2-grams of x = angela:  an, ng, ge, el, la
2-grams of y = angelina:  an, ng, ge, el, li, in, na
comm(x, y) = 4

2.3 If PED(x, y) = 1, then comm(x, y) >= |x| - 3

The number of 2-grams of x is |x| - 1.
If PED(x, y) = 1, then we have y' prefix of y with ED(x, y') = 1.
One operation can affect at most 2 2-grams, so comm(x, y') >= |x| - 3.
Hence also comm(x, y) >= |x| - 3.

2.4 Example for PED(x, y) = 1, |x| = 3, comm(x, y) = 0

x = aba
y = aca

TASK 3 (UTF-8)

10000000 11000000 11000000 11000000 10111111 11100000 10111111 10111111

3.1 Find three errors

Error 1: The first byte does not start a UTF-8 multi-byte character.
Error 2: the third byte should start with 10
Error 3: the code point of bytes 4+5 is <= 127

3.2 Write a function that detects errors (considering only UTF-8 codes of length <= 2)

Bool utf8Correct(bytes[] s) {
  i = 0;
  while (i < s.length) {
     // If byte starts with 0, fine -> skip to next byte.
    if (s[i] & 128 == 0) { i++; continue; }
    // If byte does not start with 110 -> mistake.
    if (s[i] & 128 + 64 + 32 != 128 + 64) { return false; }
    // Check that next byte starts with 10, and codepoint is > 127.
    if (i + 1 == s.length || s[i+1] & 128 + 64 != 128) { return false;}
    if (s[i] & 63 < 2) { return false; }
    i += 2;    
  }
  return true;
}   
  • MoinMoin Powered
  • Python Powered
  • GPL licensed
  • Valid HTML 4.01