Size: 4764
Comment:
|
Size: 4766
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 122: | Line 122: |
=== 4.1 2-means with initial centroids C1 = (0.5, 0.5), C2 = (1, 1) == | === 4.1 2-means with initial centroids C1 = (0.5, 0.5), C2 = (1, 1) === |
Line 132: | Line 132: |
=== 4.2 Different centroids giving a different clustering == | === 4.2 Different centroids giving a different clustering === |
TASK 1 (Compression + entropy)
1, x, x+1, 2x, 2x+1, 3x, 3x+1, 4x
1.1 For x = 10, gap-encode using Golomb with M = 8
Gaps are: +1, +9, +1, +9, +1, +9, +1, +9 (eight gaps, not seven) Golomb code for +1 : 1001 Golomb code for +9 : 01001 Encoding of the whole sequence: 1001 01001 1001 01001 1001 01001 1001 01001
1.2 No larger M gives shorter sequence
For M > 8, writing the residual (r) part alone costs >= 4 Bits. Together with the 1 separating it from the q part, this is >= 5 Bits per code. That is >= 40 Bits for the sequence above. With M = 8 we need only 36 Bits.
1.3 Function that computes Golomb code for M = 8
ArrayList<Bool> golombCode(int x) { result = new ArrayList<Bool>; int q = x / 8; int r = x % 8; for (int i = 0; i < q; i++) { result.add(0); } result.add(1); result.add(r & 4 > 0 ? 1 : 0); result.add(r & 2 > 0 ? 1 : 0); result.add(r & 1 > 0 ? 1 : 0); return result;
1.4 Entropy-optimal code for sequence above, for abitrary x
We have two symbols, each occuring the same number of times. The entropy of the corresponding random variable (X taking on either value with probability 1/2) is 1/2 * log_2 1/2 + 1/2 * log_2 1/2 = 1. And there is indeed a (prefix-free) encoding with just 1 Bit per value:
1 : 0 x-1 : 1
TASK 2 (Prefix edit distance + 2-grams)
x = angela, y = angelina
2.1 Compute the PED
PED(angela, angelina) = ED(angela, angeli) = 1 Justification: If PED = 0 than x is a perfect prefix of y, which is not the case. PED is 1 with 1 operation: replace a -> i PED(angelina, angela) = ED(angelina, angela) = 2 Justification: for shorter prefixes of angela, the difference of the string length would be >= 3, and so certainly ED >= 3.
2.2 Compute comm(x, y)
2-grams of x = angela: an, ng, ge, el, la 2-grams of y = angelina: an, ng, ge, el, li, in, na comm(x, y) = 4
2.3 If PED(x, y) = 1, then comm(x, y) >= |x| - 3
The number of 2-grams of x is |x| - 1. If PED(x, y) = 1, then we have y' prefix of y with ED(x, y') = 1. One operation can affect at most 2 2-grams, so comm(x, y') >= |x| - 3. Hence also comm(x, y) >= |x| - 3.
2.4 Example for PED(x, y) = 1, |x| = 3, comm(x, y) = 0
x = aba y = aca
TASK 3 (UTF-8)
10000000 11000000 11000000 11000000 10111111 11100000 10111111 10111111
3.1 Find three errors
Error 1: The first byte does not start a UTF-8 multi-byte character. Error 2: the third byte should start with 10 Error 3: the code point of bytes 4+5 is <= 127
3.2 Write a function that detects errors (considering only UTF-8 codes of length <= 2)
Bool utf8Correct(bytes[] s) { i = 0; while (i < s.length) { // If byte starts with 0, fine -> skip to next byte. if (s[i] & 128 == 0) { i++; continue; } // If byte does not start with 110 -> mistake. if (s[i] & 128 + 64 + 32 != 128 + 64) { return false; } // Check that next byte starts with 10, and codepoint is > 127. if (i + 1 == s.length || s[i+1] & 128 + 64 != 128) { return false;} if (s[i] & 63 < 2) { return false; } i += 2; } return true; }
TASK 4 (Clustering + latent semantic indexing)
Three documents: D1 = (1,0), D2 = (0,1), D3 = (1,1)
4.1 2-means with initial centroids C1 = (0.5, 0.5), C2 = (1, 1)
dist(Di, C1) = sqrt(1/4 + 1/4) = sqrt(1/2) = 1/2 * sqrt(2) = 0.71.. dist(D1, C2) = 1, dist(D2, C2) = 1, dist(D3, C2) = 0 D1 and D2 in Cluster C1, D3 in Cluster C3 C1 centroid -> (0.5, 0.5) C2 centroid -> (1, 1) Same centroid, so we can stop
4.2 Different centroids giving a different clustering
C1 = (1.0, 0.5), C2 = (0, 1) dist(D1, C1) = 0.71, dist(D2, C1) > 0.71, dist(D3, C1) = 0.71 dist(D1, C2) = 1, dist(D2, C2) = 0, dist(D3, C2) = 1 D1 and D3 in Cluster C1, D3 in Cluster C3 C1 centroid -> (1.0, 0.5) C2 centroid -> (0, 1) Same centroid, so we can stop
4.3 Rank of the corresponding term-document matrix.
It has rank 2, because D3 = D1 + D2, and D1 and D2 are linearly independent.
4.4 Effect of repeating every document n times (3 -> 3n documents)
Clustering does not change (distances don't change + averages don't change, because all documents repeated the same number of times).
The rank also does not change. We can still express all documents as linear combinations of D1 and D2. And D1 and D2 are still linearly independent.