TASK 1 (Compression + entropy)
1, x, x+1, 2x, 2x+1, 3x, 3x+1, 4x
1.1 For x = 10, gap-encode using Golomb with M = 8
Gaps are: +1, +9, +1, +9, +1, +9, +1, +9 (eight gaps, not seven) Golomb code for +1 : 1001 Golomb code for +9 : 01001 Encoding of the whole sequence: 1001 01001 1001 01001 1001 01001 1001 01001
1.2 No larger M gives shorter sequence
For M > 8, writing the residual (r) part alone costs >= 4 Bits. Together with the 1 separating it from the q part, this is >= 5 Bits per code. That is >= 40 Bits for the sequence above. With M = 8 we need only 36 Bits.
1.3 Function that computes Golomb code for M = 8
ArrayList<Bool> golombCode(int x) { result = new ArrayList<Bool>; int q = x / 8; int r = x % 8; for (int i = 0; i < q; i++) { result.add(0); } result.add(1); result.add(r & 4 > 0 ? 1 : 0); result.add(r & 2 > 0 ? 1 : 0); result.add(r & 1 > 0 ? 1 : 0); return result;
1.4 Entropy-optimal code for sequence above, for abitrary x
We have two symbols, each occuring the same number of times. The entropy of the corresponding random variable (X taking on either value with probability 1/2) is 1/2 * log_2 1/2 + 1/2 * log_2 1/2 = 1. And there is indeed a (prefix-free) encoding with just 1 Bit per value:
1 : 0 x-1 : 1
TASK 2 (Prefix edit distance + 2-grams)
x = angela, y = angelina
2.1 Compute the PED
PED(angela, angelina) = ED(angela, angeli) = 1 Justification: If PED = 0 than x is a perfect prefix of y, which is not the case. PED is 1 with 1 operation: replace a -> i PED(angelina, angela) = ED(angelina, angela) = 2 Justification: for shorter prefixes of angela, the difference of the string length would be >= 3, and so certainly ED >= 3.
2.2 Compute comm(x, y)
2-grams of x = angela: an, ng, ge, el, la 2-grams of y = angelina: an, ng, ge, el, li, in, na comm(x, y) = 4
2.3 If PED(x, y) = 1, then comm(x, y) >= |x| - 3
The number of 2-grams of x is |x| - 1. If PED(x, y) = 1, then we have y' prefix of y with ED(x, y') = 1. One operation can affect at most 2 2-grams, so comm(x, y') >= |x| - 3. Hence also comm(x, y) >= |x| - 3.
2.4 Example for PED(x, y) = 1, |x| = 3, comm(x, y) = 0
x = aba y = aca
TASK 3 (UTF-8)
10000000 11000000 11000000 11000000 10111111 11100000 10111111 10111111
3.1 Find three errors
Error 1: The first byte does not start a UTF-8 multi-byte character. Error 2: the third byte should start with 10 Error 3: the code point of bytes 4+5 is <= 127
3.2 Write a function that detects errors (considering only UTF-8 codes of length <= 2)
Bool utf8Correct(bytes[] s) { i = 0; while (i < s.length) { // If byte starts with 0, fine -> skip to next byte. if (s[i] & 128 == 0) { i++; continue; } // If byte does not start with 110 -> mistake. if (s[i] & 128 + 64 + 32 != 128 + 64) { return false; } // Check that next byte starts with 10, and codepoint is > 127. if (i + 1 == s.length || s[i+1] & 128 + 64 != 128) { return false;} if (s[i] & 63 < 2) { return false; } i += 2; } return true; }
TASK 4 (Clustering + latent semantic indexing)
Three documents: D1 = (1,0), D2 = (0,1), D3 = (1,1)
4.1 2-means with initial centroids C1 = (0.5, 0.5), C2 = (1, 1)
dist(Di, C1) = sqrt(1/4 + 1/4) = sqrt(1/2) = 1/2 * sqrt(2) = 0.71.. dist(D1, C2) = 1, dist(D2, C2) = 1, dist(D3, C2) = 0 D1 and D2 in Cluster C1, D3 in Cluster C3 C1 centroid -> (0.5, 0.5) C2 centroid -> (1, 1) Same centroid, so we can stop
4.2 Different centroids giving a different clustering
C1 = (1.0, 0.5), C2 = (0, 1) dist(D1, C1) = 0.71, dist(D2, C1) > 0.71, dist(D3, C1) = 0.71 dist(D1, C2) = 1, dist(D2, C2) = 0, dist(D3, C2) = 1 D1 and D3 in Cluster C1, D3 in Cluster C3 C1 centroid -> (1.0, 0.5) C2 centroid -> (0, 1) Same centroid, so we can stop
4.3 Rank of the corresponding term-document matrix.
It has rank 2, because D3 = D1 + D2, and D1 and D2 are linearly independent.
4.4 Effect of repeating every document n times (3 -> 3n documents)
Clustering does not change (distances don't change + averages don't change, because all documents repeated the same number of times).
The rank also does not change. We can still express all documents as linear combinations of D1 and D2. And D1 and D2 are still linearly independent.
TASK 5 (Probablity + maximum likelihood estimation)
Four rolls of a four-sided die, outcome is: 2, 4, 4, 4.
5.1 Mean of the sequence above + expected mean of fair such die
Mean of the sequence above = (2 + 4 + 4 + 4) / 4 = 14 / 4 = 3.5 Expected mean for a fair die = (1 + 2 + 3 + 4) / 4 = 10 / 4 = 2.5
5.2 Probability of a fair die achieving that or larger mean
2, 4, 4, 4: 4 ways to place the 2 3, 3, 4, 4: (4 over 2) = 6 ways to place the two 3s 3, 4, 4, 4: 4 ways to place the 3 4, 4, 4, 4: one such outcome This is 4 + 6 + 4 + 1 = 15 outcomes with mean >= 3.5. Each has probability (1/4)^4 = 1/256. So probabilty in question is 15 / 256.
5.3 Most likely probability distribution for outcome above
n_i = number of rolls with number i ... in the outcome above n1 = 0, n2 = 1, n3 = 0, n4 = 3 We want to maximize product_i pi^ni, or equivalently the logarithm of it = sum_i ni * ln pi. Side constraint is sum_i pi = 1. Lagrangian optimization with L = sum_i ni * ln pi + lambda * (1 - sum_i pi) Derivate of L wrt pi = ni/pi - lambda = 0 => ni/pi = lamba. Sind sum_i pi = 1 we get pi = ni/sum_i ni. Specifically here: p1 = 0, p2 = 1/4, p3 = 0, p4 = 3/4.
TASK 6 (Ontologies + SPARQL + SQL)
6.1 SPARQL Query
SELECT ?actor WHERE { ?actor acted-in ?movie . "Steven Spielberg" directed ?movie
6.2 SQL Query
SELECT acted-in.actor FROM acted-in, directed WHERE acted-in.movie = directed.movie AND directed.director = "Steven Spielberg"
6.3 Optimization
Precomput a hash map (directors -> movies) for the table "directed". This gives you the k movies of a given director in O(k) time. Precompute a hash map (movies -> actors) for the table "acted-in". This gives you the m actors of a given movie in O(m) time. These two together give you all n actors from all k movies of a given director in O(k + n) time. The result may contain duplicates. Filtering these out requires sorting, but that was not asked for.