⇤ ← Revision 1 as of 2013-03-01 15:32:24
Size: 3860
Comment:
|
Size: 4003
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
TASK 1 (Inverted Index) | == TASK 1 (Inverted Index) == |
Line 3: | Line 3: |
1.1 Inverted lists with tf-scores, gap-encoded | === 1.1 Inverted lists with tf-scores, gap-encoded === |
Line 9: | Line 9: |
1.3 With entropy-optimal encoding | === 1.3 With entropy-optimal encoding === |
Line 23: | Line 23: |
TASK 2 (List Intersection) | == TASK 2 (List Intersection) == |
Line 25: | Line 25: |
2.1 Code for intersecting two lists represented as bit vectors: | === 2.1 Code for intersecting two lists represented as bit vectors: === |
Line 36: | Line 36: |
2.2 Comparison with gap-encoded representation. | === 2.2 Comparison with gap-encoded representation. === |
Line 41: | Line 41: |
TASK 3 (List Intersection) | == TASK 3 (List Intersection) == |
Line 43: | Line 43: |
3.1 2-gram index and lower bound | === 3.1 2-gram index and lower bound === |
Line 60: | Line 60: |
3.2 Find all words with ED <= 1 to blaa | === 3.2 Find all words with ED <= 1 to blaa === |
Line 78: | Line 78: |
TASK 4 (Latent Semantic Indexing) | == TASK 4 (Latent Semantic Indexing) == |
Line 80: | Line 80: |
4.1 Rank + decomposition | === 4.1 Rank + decomposition === |
Line 92: | Line 92: |
4.2 Ranking for query Q = (1, 1, 0), |Q| = sqrt(2). | === 4.2 Ranking for query Q = (1, 1, 0), |Q| = sqrt(2). === |
Line 100: | Line 100: |
TASK 5 (k-means) | == TASK 5 (k-means) == |
Line 102: | Line 102: |
5.1 Clustering of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 with 1 and 2 as initial centroids. | === 5.1 Clustering of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 with 1 and 2 as initial centroids. === |
Line 110: | Line 110: |
5.2 Same clustering for any choice of initial centroids. | === 5.2 Same clustering for any choice of initial centroids. === |
Line 116: | Line 116: |
TASK 6 (Naive Bayes) | == TASK 6 (Naive Bayes) == |
Line 118: | Line 118: |
6.1 Computing the priors and classifying xy and xxxy | === 6.1 Computing the priors and classifying xy and xxxy === |
Line 125: | Line 125: |
6.2 Generalization + draw points and band of SVM corresponding classifier. | === 6.2 Generalization + draw points and band of SVM corresponding classifier. === |
Line 132: | Line 132: |
2 ---X--------X-------X-------X--- | {{{ 2 ---X--------X-------X-------X--- |
Line 137: | Line 138: |
}}} |
TASK 1 (Inverted Index)
1.1 Inverted lists with tf-scores, gap-encoded
bla: 1 tf2, 2 tf2, 4 tf2 -> +1 tf2, +1 tf2, +2 tf2 bli: 2 tf1, 5 tf1 -> +2 tf1, +3 tf1 blu: 3 tf2, 4 tf1, 5 tf2 -> +3 tf2, +1 tf1, +1 tf2
1.3 With entropy-optimal encoding
+1 -> 0 +2 -> 10 +3 -> 11
tf1 -> 0 tf2 -> 1
bla: 01 01 101 bli: 100 110 blu: 111 00 01
TASK 2 (List Intersection)
2.1 Code for intersecting two lists represented as bit vectors:
Array<bool> intersect(Array<bool> list1, Array<bool> list2) {
- assert(list1.size() = list2.size());
Array<bool> result; for (int i = 0; i < list1.size()) { result.push(list1[i] & list2[i]); } return result;
}
CORRECTION NOTE: It can be assumed that they have the same length; if someone does special magic for different lengths that is also fine, but it's not necessary.
2.2 Comparison with gap-encoded representation.
The entropy of the given probability distribution is H(X) = sum_i i / 2^i = 2. Hence, for an entropy-optimal code, we need 2m bits for a list of length m. The bit vector is hence better if and only if m > n/2.
TASK 3 (List Intersection)
3.1 2-gram index and lower bound
W1: #bla# W2: #bli# W3: #blu#
#b: W1, W2, W3 bl: W1, W2, W3 la: W1 li: W2 lu: W3 a#: W1 i#: W2 u#: W3
ED(x, y) <= 1 => #common 2-grams of x' and y' must be >= comm := max(|x|, |y|) - 1
3.2 Find all words with ED <= 1 to blaa
x = blaa x' = #blaa# #b: W1, W2, W3 bl: W1, W2, W3 la: W1 aa: a#: W1
2-grams in common with blaa: W1: 4 W2: 2 W3: 2
comm >= 3, so only need to compute edit distance to W1, which is 1, hence fine!
TASK 4 (Latent Semantic Indexing)
4.1 Rank + decomposition
The first three column vectors are linearly independent -> rank 3.
CORRECTION NOTE: no need to prove the linear independence (but the vectors selected should indeed be linear independent). Easy proof for linear independence: compute the determinant, e.g. of the first three column vectors. If it is non-zero, they are linearly independent.
Obtain A' by changing the entry at row 2, column 2 from 0 to 1.
- [1 0] [1 1 0 1 0.5]
A' = [1 1] * [0 0 1 1 0.5]
- [0 1]
4.2 Ranking for query Q = (1, 1, 0), |Q| = sqrt(2).
Scores with division by |Q|: 1, 1, 1/2, sqrt(3)/2, sqrt(3)/2. Scores without division by |Q|: sqrt(2), sqrt(2), sqrt(1/2), sqrt(3/2), sqrt(3/2).
The ranking is hence D#1, D#2, D#4, D#5, D#3.
TASK 5 (k-means)
5.1 Clustering of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 with 1 and 2 as initial centroids.
Step 1: 1 | 2 3 4 5 6 7 8 9 10, centroids 1 6 Step 2: 1 2 3 | 4 5 6 7 8 9 10, centroids 2 7 Step 3: 1 2 3 4 | 5 6 7 8 9 10, centroids 2.5 7.5 Step 4: 1 2 3 4 5 | 6 7 8 9 10, centroids 3 8 No more change!
5.2 Same clustering for any choice of initial centroids.
Observation 1: Clustering is always in two contiguous parts. Assume not yet half-half. Then the left-most element of the right part will be assigned to left cluster (or vice versa).
TASK 6 (Naive Bayes)
6.1 Computing the priors and classifying xy and xxxy
Pr(x|X) = 2/3, Pr(y|X) = 1/3, Pr(x|Y) = 2/3, Pr(y|Y) = 1/3, Pr(X) = 2/3, Pr(Y) = 1/3
Pr(X|xy) / Pr(Y|xy) = Pr(X) / Pr(Y) = 2, and the same for Pr(X|xxy) / Pr(Y|xxxy). Hence both xy and xxxy are classified as X
6.2 Generalization + draw points and band of SVM corresponding classifier.
For any w = (nx, ny): Pr(X|w) = Pr(X) * Pr(x|X)nx * Pr(y|X)ny = 2/3 * (2/3)nx * (1/3)ny Pr(Y|w) = Pr(Y) * Pr(x|Y)nx * Pr(y|Y)ny = 1/3 * (2/3)nx * (1/3)ny Hence always Pr(X|w) / Pr(Y|w) = 2 => any w will be classified as X.
2 ---X--------X-------X-------X--- 1 ---Y--------Y------------------- 0 1 2 3 4 5 6 7
The SVM margin is just two horizontal lines (marked
).