Size: 4003
Comment:
|
← Revision 10 as of 2017-02-20 20:45:20 ⇥
Size: 4466
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 5: | Line 5: |
bla: 1 tf2, 2 tf2, 4 tf2 -> +1 tf2, +1 tf2, +2 tf2 bli: 2 tf1, 5 tf1 -> +2 tf1, +3 tf1 blu: 3 tf2, 4 tf1, 5 tf2 -> +3 tf2, +1 tf1, +1 tf2 |
{{{ bla: 1 tf2, 2 tf2, 4 tf2 -> +1 tf2, +1 tf2, +2 tf2 bli: 2 tf1, 5 tf1 -> +2 tf1, +3 tf1 blu: 3 tf2, 4 tf1, 5 tf2 -> +3 tf2, +1 tf1, +1 tf2 }}} |
Line 11: | Line 13: |
{{{ | |
Line 18: | Line 21: |
bla: 01 01 101 bli: 100 110 blu: 111 00 01 |
bla: 01 01 101 bli: 100 110 blu: 111 00 01 }}} |
Line 27: | Line 30: |
{{{ | |
Line 33: | Line 37: |
}}} | |
Line 34: | Line 39: |
CORRECTION NOTE: It can be assumed that they have the same length; if someone does special magic for different lengths that is also fine, but it's not necessary. | ''CORRECTION NOTE: It can be assumed that they have the same length; if someone does special magic for different lengths that is also fine, but it's not necessary.'' |
Line 45: | Line 50: |
W1: #bla# W2: #bli# |
{{{ W1: #bla# W2: #bli# |
Line 58: | Line 64: |
ED(x, y) <= 1 => #common 2-grams of x' and y' must be >= comm := max(|x|, |y|) - 1 | ED(x, y) <= 1 => #common 2-grams of x' and y' must be >= comm := max(|x|, |y|) - 1 }}} |
Line 62: | Line 69: |
{{{ | |
Line 75: | Line 83: |
comm >= 3, so only need to compute edit distance to W1, which is 1, hence fine! | comm >= 3, so only need to compute edit distance to W1, which is 1. Hence bla is the only word with ED <= 1. }}} |
Line 82: | Line 90: |
The first three column vectors are linearly independent -> rank 3. | For example, the first three column vectors are linearly independent -> rank 3. |
Line 84: | Line 92: |
CORRECTION NOTE: no need to prove the linear independence (but the vectors selected should indeed be linear independent). Easy proof for linear independence: compute the determinant, e.g. of the first three column vectors. If it is non-zero, they are linearly independent. | ''CORRECTION NOTE: no need to prove the linear independence (but the vectors selected should indeed be linear independent). Easy proof for linear independence: compute the determinant, e.g. of the first three column vectors. If it is non-zero, they are linearly independent.'' |
Line 88: | Line 96: |
[1 0] [1 1 0 1 0.5] A' = [1 1] * [0 0 1 1 0.5] [0 1] |
{{{ [1 1 0 1 0.5] [1 0] [1 1 0 1 0.5] A' = [1 1 1 2 1 ] = [1 1] * [0 0 1 1 0.5] [0 0 1 1 0.5] [0 1] }}} |
Line 92: | Line 102: |
=== 4.2 Ranking for query Q = (1, 1, 0), |Q| = sqrt(2). === | === 4.2 Ranking for query Q = (1, 1, 0) === |
Line 94: | Line 104: |
{{{ |Q| = sqrt(2) |
|
Line 96: | Line 108: |
Line 98: | Line 109: |
}}} | |
Line 104: | Line 115: |
{{{ | |
Line 109: | Line 121: |
}}} | |
Line 112: | Line 125: |
Observation 1: Clustering is always in two contiguous parts. Assume not yet half-half. Then the left-most element of the right part will be assigned to left cluster (or vice versa). |
{{{ 1. Clustering is always in two contiguous parts. 2. Assume the right part is bigger: then its leftmost element will be closer to the centroid of the left part. 3. Assume the left part is bigger: then its rightmost element will be closer to the centroid of the right part. 4. Thus, as long one part is bigger, the bigger part will become smaller and the samller part bigger. 5. Eventually, the two parts will be of equal size and there is no more change. }}} |
Line 120: | Line 137: |
{{{ | |
Line 121: | Line 139: |
Pr(X|xy) / Pr(Y|xy) = Pr(X) / Pr(Y) = 2, and the same for Pr(X|xxxy) / Pr(Y|xxxy). }}} |
|
Line 122: | Line 142: |
Pr(X|xy) / Pr(Y|xy) = Pr(X) / Pr(Y) = 2, and the same for Pr(X|xxy) / Pr(Y|xxxy). | |
Line 127: | Line 146: |
{{{ | |
Line 131: | Line 151: |
}}} | |
Line 140: | Line 161: |
The SVM margin is just two horizontal lines (marked -----). | The SVM band is just the two horizontal lines. |
TASK 1 (Inverted Index)
1.1 Inverted lists with tf-scores, gap-encoded
bla: 1 tf2, 2 tf2, 4 tf2 -> +1 tf2, +1 tf2, +2 tf2 bli: 2 tf1, 5 tf1 -> +2 tf1, +3 tf1 blu: 3 tf2, 4 tf1, 5 tf2 -> +3 tf2, +1 tf1, +1 tf2
1.3 With entropy-optimal encoding
+1 -> 0 +2 -> 10 +3 -> 11 tf1 -> 0 tf2 -> 1 bla: 01 01 101 bli: 100 110 blu: 111 00 01
TASK 2 (List Intersection)
2.1 Code for intersecting two lists represented as bit vectors:
Array<bool> intersect(Array<bool> list1, Array<bool> list2) { assert(list1.size() = list2.size()); Array<bool> result; for (int i = 0; i < list1.size()) { result.push(list1[i] & list2[i]); } return result; }
CORRECTION NOTE: It can be assumed that they have the same length; if someone does special magic for different lengths that is also fine, but it's not necessary.
2.2 Comparison with gap-encoded representation.
The entropy of the given probability distribution is H(X) = sum_i i / 2^i = 2. Hence, for an entropy-optimal code, we need 2m bits for a list of length m. The bit vector is hence better if and only if m > n/2.
TASK 3 (List Intersection)
3.1 2-gram index and lower bound
W1: #bla# W2: #bli# W3: #blu# #b: W1, W2, W3 bl: W1, W2, W3 la: W1 li: W2 lu: W3 a#: W1 i#: W2 u#: W3 ED(x, y) <= 1 => #common 2-grams of x' and y' must be >= comm := max(|x|, |y|) - 1
3.2 Find all words with ED <= 1 to blaa
x = blaa x' = #blaa# #b: W1, W2, W3 bl: W1, W2, W3 la: W1 aa: a#: W1 2-grams in common with blaa: W1: 4 W2: 2 W3: 2 comm >= 3, so only need to compute edit distance to W1, which is 1. Hence bla is the only word with ED <= 1.
TASK 4 (Latent Semantic Indexing)
4.1 Rank + decomposition
For example, the first three column vectors are linearly independent -> rank 3.
CORRECTION NOTE: no need to prove the linear independence (but the vectors selected should indeed be linear independent). Easy proof for linear independence: compute the determinant, e.g. of the first three column vectors. If it is non-zero, they are linearly independent.
Obtain A' by changing the entry at row 2, column 2 from 0 to 1.
[1 1 0 1 0.5] [1 0] [1 1 0 1 0.5] A' = [1 1 1 2 1 ] = [1 1] * [0 0 1 1 0.5] [0 0 1 1 0.5] [0 1]
4.2 Ranking for query Q = (1, 1, 0)
|Q| = sqrt(2) Scores with division by |Q|: 1, 1, 1/2, sqrt(3)/2, sqrt(3)/2. Scores without division by |Q|: sqrt(2), sqrt(2), sqrt(1/2), sqrt(3/2), sqrt(3/2). The ranking is hence D#1, D#2, D#4, D#5, D#3.
TASK 5 (k-means)
5.1 Clustering of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 with 1 and 2 as initial centroids.
Step 1: 1 | 2 3 4 5 6 7 8 9 10, centroids 1 6 Step 2: 1 2 3 | 4 5 6 7 8 9 10, centroids 2 7 Step 3: 1 2 3 4 | 5 6 7 8 9 10, centroids 2.5 7.5 Step 4: 1 2 3 4 5 | 6 7 8 9 10, centroids 3 8 No more change!
5.2 Same clustering for any choice of initial centroids.
1. Clustering is always in two contiguous parts. 2. Assume the right part is bigger: then its leftmost element will be closer to the centroid of the left part. 3. Assume the left part is bigger: then its rightmost element will be closer to the centroid of the right part. 4. Thus, as long one part is bigger, the bigger part will become smaller and the samller part bigger. 5. Eventually, the two parts will be of equal size and there is no more change.
TASK 6 (Naive Bayes)
6.1 Computing the priors and classifying xy and xxxy
Pr(x|X) = 2/3, Pr(y|X) = 1/3, Pr(x|Y) = 2/3, Pr(y|Y) = 1/3, Pr(X) = 2/3, Pr(Y) = 1/3 Pr(X|xy) / Pr(Y|xy) = Pr(X) / Pr(Y) = 2, and the same for Pr(X|xxxy) / Pr(Y|xxxy).
Hence both xy and xxxy are classified as X
6.2 Generalization + draw points and band of SVM corresponding classifier.
For any w = (nx, ny): Pr(X|w) = Pr(X) * Pr(x|X)^nx * Pr(y|X)^ny = 2/3 * (2/3)^nx * (1/3)^ny Pr(Y|w) = Pr(Y) * Pr(x|Y)^nx * Pr(y|Y)^ny = 1/3 * (2/3)^nx * (1/3)^ny Hence always Pr(X|w) / Pr(Y|w) = 2 => any w will be classified as X.
2 ---X--------X-------X-------X--- 1 ---Y--------Y------------------- 0 1 2 3 4 5 6 7
The SVM band is just the two horizontal lines.