TASK 1 (Inverted Index)
1.1 Inverted lists with tf-scores, gap-encoded
bla: 1 tf2, 2 tf2, 4 tf2 -> +1 tf2, +1 tf2, +2 tf2 bli: 2 tf1, 5 tf1 -> +2 tf1, +3 tf1 blu: 3 tf2, 4 tf1, 5 tf2 -> +3 tf2, +1 tf1, +1 tf2
1.3 With entropy-optimal encoding
+1 -> 0 +2 -> 10 +3 -> 11 tf1 -> 0 tf2 -> 1 bla: 01 01 101 bli: 100 110 blu: 111 00 01
TASK 2 (List Intersection)
2.1 Code for intersecting two lists represented as bit vectors:
Array<bool> intersect(Array<bool> list1, Array<bool> list2) { assert(list1.size() = list2.size()); Array<bool> result; for (int i = 0; i < list1.size()) { result.push(list1[i] & list2[i]); } return result; }
CORRECTION NOTE: It can be assumed that they have the same length; if someone does special magic for different lengths that is also fine, but it's not necessary.
2.2 Comparison with gap-encoded representation.
The entropy of the given probability distribution is H(X) = sum_i i / 2^i = 2. Hence, for an entropy-optimal code, we need 2m bits for a list of length m. The bit vector is hence better if and only if m > n/2.
TASK 3 (List Intersection)
3.1 2-gram index and lower bound
W1: #bla# W2: #bli# W3: #blu# #b: W1, W2, W3 bl: W1, W2, W3 la: W1 li: W2 lu: W3 a#: W1 i#: W2 u#: W3 ED(x, y) <= 1 => #common 2-grams of x' and y' must be >= comm := max(|x|, |y|) - 1
3.2 Find all words with ED <= 1 to blaa
x = blaa x' = #blaa# #b: W1, W2, W3 bl: W1, W2, W3 la: W1 aa: a#: W1 2-grams in common with blaa: W1: 4 W2: 2 W3: 2 comm >= 3, so only need to compute edit distance to W1, which is 1. Hence bla is the only word with ED <= 1.
TASK 4 (Latent Semantic Indexing)
4.1 Rank + decomposition
For example, the first three column vectors are linearly independent -> rank 3.
CORRECTION NOTE: no need to prove the linear independence (but the vectors selected should indeed be linear independent). Easy proof for linear independence: compute the determinant, e.g. of the first three column vectors. If it is non-zero, they are linearly independent.
Obtain A' by changing the entry at row 2, column 2 from 0 to 1.
[1 1 0 1 0.5] [1 0] [1 1 0 1 0.5] A' = [1 1 1 2 1 ] = [1 1] * [0 0 1 1 0.5] [0 0 1 1 0.5] [0 1]
4.2 Ranking for query Q = (1, 1, 0)
|Q| = sqrt(2) Scores with division by |Q|: 1, 1, 1/2, sqrt(3)/2, sqrt(3)/2. Scores without division by |Q|: sqrt(2), sqrt(2), sqrt(1/2), sqrt(3/2), sqrt(3/2). The ranking is hence D#1, D#2, D#4, D#5, D#3.
TASK 5 (k-means)
5.1 Clustering of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 with 1 and 2 as initial centroids.
Step 1: 1 | 2 3 4 5 6 7 8 9 10, centroids 1 6 Step 2: 1 2 3 | 4 5 6 7 8 9 10, centroids 2 7 Step 3: 1 2 3 4 | 5 6 7 8 9 10, centroids 2.5 7.5 Step 4: 1 2 3 4 5 | 6 7 8 9 10, centroids 3 8 No more change!
5.2 Same clustering for any choice of initial centroids.
1. Clustering is always in two contiguous parts. 2. Assume the right part is bigger: then its leftmost element will be closer to the centroid of the left part. 3. Assume the left part is bigger: then its rightmost element will be closer to the centroid of the right part. 4. Thus, as long one part is bigger, the bigger part will become smaller and the samller part bigger. 5. Eventually, the two parts will be of equal size and there is no more change.
TASK 6 (Naive Bayes)
6.1 Computing the priors and classifying xy and xxxy
Pr(x|X) = 2/3, Pr(y|X) = 1/3, Pr(x|Y) = 2/3, Pr(y|Y) = 1/3, Pr(X) = 2/3, Pr(Y) = 1/3 Pr(X|xy) / Pr(Y|xy) = Pr(X) / Pr(Y) = 2, and the same for Pr(X|xxxy) / Pr(Y|xxxy).
Hence both xy and xxxy are classified as X
6.2 Generalization + draw points and band of SVM corresponding classifier.
For any w = (nx, ny): Pr(X|w) = Pr(X) * Pr(x|X)^nx * Pr(y|X)^ny = 2/3 * (2/3)^nx * (1/3)^ny Pr(Y|w) = Pr(Y) * Pr(x|Y)^nx * Pr(y|Y)^ny = 1/3 * (2/3)^nx * (1/3)^ny Hence always Pr(X|w) / Pr(Y|w) = 2 => any w will be classified as X.
2 ---X--------X-------X-------X--- 1 ---Y--------Y------------------- 0 1 2 3 4 5 6 7
The SVM band is just the two horizontal lines.