InformationRetrievalWS1718/ExamSolution

TASK 1 (Ranking and evaluation)

1.1 P@2, P@R, AP

P@2 = 1/2
P@R = 1/3
P@R1 = 1/2, P@R2 = 1/2, P@R3 = 0, AP = (1/2 + 1/2 + 0) / 3 = 1/3

1.2 DCG@2, iDCG@2, nDCG@2

DCG@2 = 0 + 3 / 1 = 3
iDCG@2 = 3 + 2 / 1 = 5
nDCG@2 = 3/5 = 60%

1.3 Function dcg that computes DCG@k

1. def dcg(rels, k):
2.     sum = rels[0]
3.     for i in range(1, k): 
4.         sum += rels[i] / log2(i+1)
5.     return sum

1.4 Function ndcg that computes nDCG@k

1. def ndcg(rels, k):
2.    dcg_val = dcg(rels, k)
3.    sorted_rels = sort(rels, DESC)
4.    idcg_val = dcg(sorted_rels, k)
5.    return dcg_val / idcg_val

TASK 2 (Encodings)

2.1 Function unary_encode that encodes given sequence of integers

1. def unary_encode(input):
2.     result = []
3.     for x in input:
4.         for i in range(0, x-1):
5.             result.add(0)
6.         result.add(1)
7.     return result

2.2 Expected number of bits in sequence of n integers i.i.d from {1, ..., n}

1. The code length for integer i is exactly i
2. For a random integer from {1, ..., n}, the expected code length is sum_i (i * 1/n) = 1/n * sum_i i = 1/n * 1/2 * n * (n+1) = (n+1) / 2
3. The expected number of bits in the whole sequence is hence n * (n+1) / 2

2.3 Is this encoding entropy-optimal for the distribution from task 2.2

1. The entropy H of the distribution from 2.2 is -sum_i 1/n * log_2 (1/n) = 1/n * log_2 n * sum_i 1 =  log_2 n
2. The expected code length is (n+1)/2
3. For entropy-optimality, we should have expected code length = entropy + O(1)
4. So the answer is NO.

2.4 Name a prefix-free entropy-optimal encoding for the distribution from task 2.2

1. Just encode each integer with a fixed number of ceil(log_2 n) bits
2. This is prefix-free, because for two different codes of the same length, none can be the prefix of the other
3. Since ceil(log_2 n) <= log_2 + 1, this encoding is entropy-optimal

TASK 3 (Fuzzy search and UTF-8)

3.1 (5 points) ED(x, y), ED(y, x), PED(x, y), PED(y, x) for x = cute and y = computer

ED(cute, computer) = 4 ... insert omp and r
ED(computer, cute) = 4 ... ED is symmetric
PED(cute, computer) = ED(cute, compute) = 3
PED(computer, cute) = ED(computer, cute) = 4

3.2 (5 points) Find x and y with ED(x, y) = 2 and comm($$x$$, $$y$$) = max{|x|, |y|} - 4

1. max{|x|, |y|} - 4 is the lower bound from the lecture for q = 3 and ED(x, y) = 2
2. The lower bound is achieved when destroying the maximal number of q-grams with each edit distance operation, in this case 3
3. For example, x = AXXA, y = BXXB

3.3 (10 points) Function valid_utf8_char8 that checks if a given array of three bytes is valid UTF-8 for a single character

1. def valid_utf8_char(bytes):
       # Check that first byte start with 1110 and the other two with 10
       # 10000000b = 128, 11000000b = 192, 11100000b = 224, 11110000b = 240
3.     if (bytes[0] & 240 != 224 OR bytes[1] & 192 != 128 OR bytes[2] & 192 != 128):
4.         return False 
       # Check that the code point is >= 2048
       # 11100000b = 224, 10100000b = 160 
5.     if (bytes[0] > 224 OR bytes[1] >= 160):
6.         return True
7.     return False

TASK 4 (Naive Bayes)

4.1 (10 points) Function naive_bayes_predict that predicts the class for a given document

1. def naive_bayes_predict(pc, pwc, words):
2.    best_c, bestp = -1, -1
3.    for c in range(0, len(pc)):
4.        p = pc[c]
5.        for w in words:
6.            p *= pwc[w][c]
7.        if p > best_p:
8.            best_p, best_c = p, c
9.    return best_c

4.2 (5 points) Same prediction for d_1 and d_2 = d_1.d_1

1. If p(c | d1) = P_c * p_c, then p(c | d2) = P_c² * p_c.
2. If the p_c are all equal, then p(c1 | d1) < p(c2 | d1) <=> P_c1 < P_c2 <=> P_c1² < P_c2² <=> p(c1 | d2) < p(c1 | d2)

4.3 (5 points) Is the statement from 4.2 also true if the pc are not all equal?

1. Assume two classes c1 and c2 with p(c1) = 1/4 and p(c2) = 3/4
2. Let d1 consist of one word w with p(c1, w) = 2/3 and p(c2, w) = 1/3
3. Then p(c1 | d1) = 2/3 * 1/4 = 1/6 and p(c2 | d1) = 1/3 * 3/4 = 1/4  => class c2 is more likely
4. But p(c1 | d2) = 2/3 * 2/3 * 1/4 = 1/9 and p(c2 | d2) = 1/3 * 1/3 * 3/4 = 1/12  => class c1 is more likely
5. So the statement from 4.2 is not true in general if the pc are not all equal

TASK 5 (Latent Semantic Indexing)

5.1 (10 points) Determine the matrices U and S of the SVD of A

1. The product A * AT is [[15, 9], [9, 15]]
2. One eigenvector is easily seen to be (1, 1) with eigenvalue 9 * 15 = 24
3. The other eigenvector is then (1, -1) with eigenvalue 6
4. The matrix U is hence 1/2 * sqrt(2) * [[1, 1], [1, -1]]
5. The matrix S is diag(sqrt(24), sqrt(6)) = sqtr(6) * diag(2, 1)

5.2 Function compute_V that computes V from A, U, S

1. def compute_V(A, U, S, V, m, n):
2.     r = len(S)
       # A has shape m x n, U has shape m x r, S has r entries 
       # We can assume that V has shape r x n and is filled with zeroes
       # According to ES10, V = S^-1 * UT * A
3.     for i in range(0, r):
4.         for j in range(0, n):
5.             for k in range(0, m):
6.                 V[i, j] += U[k, i] * A[k, j] / S[i]

TASK 6 (Hidden Markov Model)

6.1 (5 points) Probability that second state is plus if sequence is good, good, good

1. Pr(h1 = plus) * Pr(o1 = good | h1 = plus) * Pr(h2 = plus | h1 = plus) * Pr(o2 = good | h2 = plus) = 1/2 * 1/3 * 1/3 * 1/3 = 1/54
2. Pr(h1 = minus) * Pr(o1 = good | h1 = minus) * Pr(h2 = plus | h1 = minus) * Pr(o2 = good | h2 = plus) = 1/2 * 2/3 * 2/3 * 1/3 = 4/54
3. Pr(h2 = plus) = 1/54 + 4/54 = 5/54

6.2 (5 points) Most likely sequence of hidden states for great, bad, great

1. Pr(great | minus) = 0 and Pr(bad | plus) = 0
2. Hence the only sequence of hidden states with non-zero probability is plus, minus, plus
3. This is hence the most likely sequence of hidden states

6.3 (5 points) Function compute_likelihood that computes likelihood of a given sequence of observations and hidden states

1. def compute_likelihood(o, h):
2.     result = 1
3.     for i in range(0, len(o)):
4.         result *= iprob[i] if i == 0 else tprob[h[i], h[i-1]]
5.         result *= eprob[h[i], o[i]]
6.     return result

6.4 (5 points) Function brute_force_hmm that computes the most likely sequence of hidden states for given sequence of three observations

1. def brute_force_hmm(o1, o2, o3):
2.     k = len(iprob)  # Number of hidden states
3.     best_l, best_h = 0, []
4.     for h1 in range(0, k):
5.         for h2 in range(0, k):
6.             for h3 in range(0, k):
7.                 l = compute_likelihood([o1, o2, o3], [h1, h2, h3])
8.                 if l > best_l:
9.                     best_l, best_h = l, [h1, h2, h3]
10.    return best_h

Links

Previous Courses

TASK 1 (Ranking and evaluation)

1.1 P@2, P@R, AP

1.2 DCG@2, iDCG@2, nDCG@2

1.3 Function dcg that computes DCG@k

1.4 Function ndcg that computes nDCG@k

TASK 2 (Encodings)

2.1 Function unary_encode that encodes given sequence of integers

2.2 Expected number of bits in sequence of n integers i.i.d from {1, ..., n}

2.3 Is this encoding entropy-optimal for the distribution from task 2.2

2.4 Name a prefix-free entropy-optimal encoding for the distribution from task 2.2

TASK 3 (Fuzzy search and UTF-8)

3.1 (5 points) ED(x, y), ED(y, x), PED(x, y), PED(y, x) for x = cute and y = computer

3.2 (5 points) Find x and y with ED(x, y) = 2 and comm($$x$$, $$y$$) = max{|x|, |y|} - 4

3.3 (10 points) Function valid_utf8_char8 that checks if a given array of three bytes is valid UTF-8 for a single character

TASK 4 (Naive Bayes)

4.1 (10 points) Function naive_bayes_predict that predicts the class for a given document

4.2 (5 points) Same prediction for d_1 and d_2 = d_1.d_1

4.3 (5 points) Is the statement from 4.2 also true if the pc are not all equal?

TASK 5 (Latent Semantic Indexing)

5.1 (10 points) Determine the matrices U and S of the SVD of A

5.2 Function compute_V that computes V from A, U, S

TASK 6 (Hidden Markov Model)

6.1 (5 points) Probability that second state is plus if sequence is good, good, good

6.2 (5 points) Most likely sequence of hidden states for great, bad, great

6.3 (5 points) Function compute_likelihood that computes likelihood of a given sequence of observations and hidden states

6.4 (5 points) Function brute_force_hmm that computes the most likely sequence of hidden states for given sequence of three observations