Size: 3460
Comment:
|
Size: 3461
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 108: | Line 108: |
if (s[i] & 128 + 64 + 32 ! 128 + 64) { return false; } | if (s[i] & 128 + 64 + 32 != 128 + 64) { return false; } |
TASK 1 (Compression + entropy)
1, x, x+1, 2x, 2x+1, 3x, 3x+1, 4x
1.1 For x = 10, gap-encode using Golomb with M = 8
Gaps are: +1, +9, +1, +9, +1, +9, +1, +9 (eight gaps, not seven) Golomb code for +1 : 1001 Golomb code for +9 : 01001 Encoding of the whole sequence: 1001 01001 1001 01001 1001 01001 1001 01001
1.2 No larger M gives shorter sequence
For M > 8, writing the residual (r) part alone costs >= 4 Bits. Together with the 1 separating it from the q part, this is >= 5 Bits per code. That is >= 40 Bits for the sequence above. With M = 8 we need only 36 Bits.
1.3 Function that computes Golomb code for M = 8
ArrayList<Bool> golombCode(int x) { result = new ArrayList<Bool>; int q = x / 8; int r = x % 8; for (int i = 0; i < q; i++) { result.add(0); } result.add(1); result.add(r & 4 > 0 ? 1 : 0); result.add(r & 2 > 0 ? 1 : 0); result.add(r & 1 > 0 ? 1 : 0); return result;
1.4 Entropy-optimal code for sequence above, for abitrary x
We have two symbols, each occuring the same number of times. The entropy of the corresponding random variable (X taking on either value with probability 1/2) is 1/2 * log_2 1/2 + 1/2 * log_2 1/2 = 1. And there is indeed a (prefix-free) encoding with just 1 Bit per value:
1 : 0 x-1 : 1
TASK 2 (Prefix edit distance + 2-grams)
x = angela, y = angelina
2.1 Compute the PED
PED(angela, angelina) = ED(angela, angeli) = 1 Justification: If PED = 0 than x is a perfect prefix of y, which is not the case. PED is 1 with 1 operation: replace a -> i PED(angelina, angela) = ED(angelina, angela) = 2 Justification: for shorter prefixes of angela, the difference of the string length would be >= 3, and so certainly ED >= 3.
2.2 Compute comm(x, y)
2-grams of x = angela: an, ng, ge, el, la 2-grams of y = angelina: an, ng, ge, el, li, in, na comm(x, y) = 4
2.3 If PED(x, y) = 1, then comm(x, y) >= |x| - 3
The number of 2-grams of x is |x| - 1. If PED(x, y) = 1, then we have y' prefix of y with ED(x, y') = 1. One operation can affect at most 2 2-grams, so comm(x, y') >= |x| - 3. Hence also comm(x, y) >= |x| - 3.
2.4 Example for PED(x, y) = 1, |x| = 3, comm(x, y) = 0
x = aba y = aca
TASK 3 (UTF-8)
10000000 11000000 11000000 11000000 10111111 11100000 10111111 10111111
3.1 Find three errors
Error 1: The first byte does not start a UTF-8 multi-byte character. Error 2: the third byte should start with 10 Error 3: the code point of bytes 4+5 is <= 127
3.2 Write a function that detects errors (considering only UTF-8 codes of length <= 2)
Bool utf8Correct(bytes[] s) { i = 0; while (i < s.length) { // If byte starts with 0, fine -> skip to next byte. if (s[i] & 128 == 0) { i++; continue; } // If byte does not start with 110 -> mistake. if (s[i] & 128 + 64 + 32 != 128 + 64) { return false; } // Check that next byte starts with 10, and codepoint is > 127. if (i + 1 == s.length || s[i+1] & 128 + 64 != 128) { return false;} if (s[i] & 63 < 2) { return false; } i += 2; } return true; }