AD Teaching Wiki:

TASK 1 (Inverted Index)

1.1 Inverted lists with tf-scores, gap-encoded

bla: 1 tf2, 2 tf2, 4 tf2   ->  +1 tf2, +1 tf2, +2 tf2
bli: 2 tf1, 5 tf1          ->  +2 tf1, +3 tf1
blu: 3 tf2, 4 tf1, 5 tf2   ->  +3 tf2, +1 tf1, +1 tf2

1.3 With entropy-optimal encoding

+1 -> 0
+2 -> 10
+3 -> 11

tf1 -> 0
tf2 -> 1

bla: 01 01 101
bli: 100 110
blu: 111 00 01

TASK 2 (List Intersection)

2.1 Code for intersecting two lists represented as bit vectors:

Array<bool> intersect(Array<bool> list1, Array<bool> list2) {
  assert(list1.size() = list2.size());
  Array<bool> result;  
  for (int i = 0; i < list1.size()) { result.push(list1[i] & list2[i]); }
  return result;
}

CORRECTION NOTE: It can be assumed that they have the same length; if someone does special magic for different lengths that is also fine, but it's not necessary.

2.2 Comparison with gap-encoded representation.

The entropy of the given probability distribution is H(X) = sum_i i / 2^i = 2. Hence, for an entropy-optimal code, we need 2m bits for a list of length m. The bit vector is hence better if and only if m > n/2.

TASK 3 (List Intersection)

3.1 2-gram index and lower bound

W1: #bla#
W2: #bli#
W3: #blu#

#b: W1, W2, W3
bl: W1, W2, W3
la: W1
li: W2
lu: W3
a#: W1
i#: W2
u#: W3

ED(x, y) <= 1  =>  #common 2-grams of x' and y' must be >= comm := max(|x|, |y|) - 1

3.2 Find all words with ED <= 1 to blaa

x = blaa
x' = #blaa#
#b: W1, W2, W3
bl: W1, W2, W3
la: W1
aa:
a#: W1

2-grams in common with blaa:
W1: 4
W2: 2
W3: 2

comm >= 3, so only need to compute edit distance to W1, which is 1. Hence bla is the only word with ED <= 1.

TASK 4 (Latent Semantic Indexing)

4.1 Rank + decomposition

For example, the first three column vectors are linearly independent -> rank 3.

CORRECTION NOTE: no need to prove the linear independence (but the vectors selected should indeed be linear independent). Easy proof for linear independence: compute the determinant, e.g. of the first three column vectors. If it is non-zero, they are linearly independent.

Obtain A' by changing the entry at row 2, column 2 from 0 to 1.

[1 1 0 1 0.5]     [1 0]     [1 1 0 1 0.5]
[1 1 1 2  1 ]  =  [1 1]  *  [0 0 1 1 0.5]
[1 1 1 1 0.5]     [0 1]

4.2 Ranking for query Q = (1, 1, 0), |Q| = sqrt(2).

Scores with division by |Q|: 1, 1, 1/2, sqrt(3)/2, sqrt(3)/2. Scores without division by |Q|: sqrt(2), sqrt(2), sqrt(1/2), sqrt(3/2), sqrt(3/2).

The ranking is hence D#1, D#2, D#4, D#5, D#3.

TASK 5 (k-means)

5.1 Clustering of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 with 1 and 2 as initial centroids.

Step 1:  1 | 2 3 4 5 6 7 8 9 10,  centroids 1  6
Step 2:  1 2 3 | 4 5 6 7 8 9 10,  centroids 2  7
Step 3:  1 2 3 4 | 5 6 7 8 9 10,  centroids 2.5  7.5
Step 4:  1 2 3 4 5 | 6 7 8 9 10,  centroids 3  8
No more change!

5.2 Same clustering for any choice of initial centroids.

Observation 1: Clustering is always in two contiguous parts. Assume not yet half-half. Then the left-most element of the right part will be assigned to left cluster (or vice versa).

TASK 6 (Naive Bayes)

6.1 Computing the priors and classifying xy and xxxy

Pr(x|X) = 2/3, Pr(y|X) = 1/3, Pr(x|Y) = 2/3, Pr(y|Y) = 1/3, Pr(X) = 2/3, Pr(Y) = 1/3
Pr(X|xy) / Pr(Y|xy) = Pr(X) / Pr(Y) = 2, and the same for Pr(X|xxy) / Pr(Y|xxxy).

Hence both xy and xxxy are classified as X

6.2 Generalization + draw points and band of SVM corresponding classifier.

For any w = (nx, ny):
Pr(X|w) = Pr(X) * Pr(x|X)^nx * Pr(y|X)^ny = 2/3 * (2/3)^nx * (1/3)^ny
Pr(Y|w) = Pr(Y) * Pr(x|Y)^nx * Pr(y|Y)^ny = 1/3 * (2/3)^nx * (1/3)^ny
Hence always Pr(X|w) / Pr(Y|w) = 2 => any w will be classified as X.

2   ---X--------X-------X-------X--- 

1   ---Y--------Y-------------------

    0  1    2   3   4   5   6   7

The SVM band is just the two horizontal lines.

AD Teaching Wiki: InformationRetrievalWS1213/ExamSolutions (last edited 2013-03-01 15:52:34 by Hannah Bast)