Diff for "InformationRetrievalWS1718/ExamSolution"

Differences between revisions 2 and 4 (spanning 2 versions)

TASK 1 (Ranking and evaluation)

1.1 P@2, P@R, AP

P@2 = 1/2
P@R = 1/3
P@R1 = 1/2, P@R2 = 1/2, P@R3 = 0, AP = (1/2 + 1/2 + 0) / 3 = 1/3

1.2 DCG@2, iDCG@2, nDCG@2

DCG@2 = 0 + 3 / 1 = 3
iDCG@2 = 3 + 2 / 1 = 5
nDCG@2 = 3/5 = 60%

1.3 Function dcg that computes DCG@k

1. def dcg(rels, k):
2.     sum = rels[0]
3.     for in in range(1, len(rels)): 
4.         sum += rels[i] / log2(i)
5.     return sum

1.4 Function ndcg that computes nDCG@k

1. def ndcg(rels, k):
2.    dcg = dcg(rels, k)
3.    sorted_rels = sort(rels, DESC)
4.    idcg = dcg(rels, k)
5.    return dcg / idcg

TASK 2 (Encodings, Niklas)

2.1 Function unary_encode that encodes given sequence of integers

1. def unary_encode(input):
2.     result = []
3.     for x in input:
4.         for i in range(0, x-1):
5.             result.add(0)
6.         result.add(1)
7.     return result

2.2 Expected number of bits in sequence of n integers i.i.d from {1, ..., n}

1. The code length for integer i is exactly i
2. For a random integer from {1, ..., n}, the expected code length is sum_i (i * 1/n) = 1/n * sum_i i = 1/n * 1/2 * n * (n+1) = (n+1) / 2
3. The expected number of bits in the whole sequence is hence n * (n+1) / 2

2.3 Is this encoding entropy-optimal for the distribution from task 2.2

1. The entropy H of the distribution from 2.2 is -sum_i 1/n * log_2 (1/n) = 1/n * log_2 n * sum_i 1 =  log_2 n
2. The expected code length is (n+1)/2
3. For entropy-optimality, we should have expected code length = entropy + O(1)
4. So the answer is NO.

2.4 Name a prefix-free entropy-optimal encoding for the distribution from task 2.2

1. Just encode each integer with a fixed number of ceil(log_2 n) bits
2. This is prefix-free, because for two different codes of the same length, none can be the prefix of the other
3. Since ceil(log_2 n) <= log_2 + 1, this encoding is entropy-optimal

TASK 3 (Fuzzy search and UTF-8, Axel)

3.1 (5 points) ED(x, y), ED(y, x), PED(x, y), PED(y, x) for x = cute and y = computer

ED(cute, computer) = 4 ... insert omp and r
ED(computer, cute) = 4 ... ED is symmetric
PED(cute, computer) = ED(cute, compute) = 3
PED(computer, cute) = ED(computer, cute) = 4

3.2 (5 points) Find x and y with ED(x, y) = 2 and comm($$x$$, $$y$$) = max{|x|, |y|} - 4

1. max{|x|, |y|} - 4 is the lower bound from the lecture for q = 3 and ED(x, y) = 2
2. The lower bound is achieved when destroying the maximal number of q-grams with each edit distance operation, in this case 3
3. For example, x = ABC, y = DBE

3.3 (10 points) Function valid_utf8_char8 that checks if a given array of three bytes is valid UTF-8 for a single character

1. def valid_utf8_char(bytes):
       # Check that first byte start with 1110 and the other two with 10
       # 10000000b = 128, 11000000b = 192, 11100000b = 224, 11110000b = 240
3.     if (bytes[0] & 240 != 224 OR bytes[1] & 192 != 128 OR bytes[2] & 192 != 192):
4.         return False 
       # Check that the code point is >= 2048
       # 11100000b = 224, , 10100000b = 160 
5.     if (bytes[0] > 224 OR bytes[1] >= 160):
6.         return True
7.     return False

TASK 4 (Naive Bayes and k-means)

4.1 Steps of k-means

Points: P1 = (3, 1), P2 = (1, 2), P3 = (1, 4)
Initial centroids: C1 = (3, 0), C2 = (0, 3)
Step 1a: C1 <- P1, C2 <- P2, P3
Step 1b: C1 = (3, 1), C2 = (1, 3)
Step 2a: C1 <- P1, C2 <- P2, P3 
Step 2b: C1 = (3, 1), C2 = (1, 3)
Centroid identical to those from previous round -> STOP

4.2 Compute centroids from A and P

1. def centroids(A, P):
2.    return A.normalizecolumnsL1().transpose() * P

4.3 Determine w and b of Naive Bayes

Learned probabilities: p_1X = 3/4, p_2X = 1/4, p_1Y = 1/4, p_2Y = 3/4, p_X = 1/3, p_Y = 2/3
Linear classifier: w = (log_2 (p_1X/p_1Y), log_2 (p_2X/p_2Y)) = (log_2 3, -log_2 3)
Linear classifier: b = - log_2 (p_X/p_Y)

4.4 Example such that Naive Bayes decides 2x > y

Solution for x > y:

Choose P1 = (2, 1) --> X and P2 = (1, 2) --> Y
Then p_1X = 2/3, p_2X = 1/3, p_1Y = 1/3, p_2Y = 2/3
Then w = (1, -1) and b = 0
Then (w1, w2) * (x, y) - b > 0  <=>  x > y.

TASK 5 (Latent Semantic Indexing)

5.1 Show that V is row-orthonormal

Product without factor 1/4 * sqrt(2) is: [[8, 0], [0, 8]]
1/4 * sqrt(2) * 1/4 * sqrt(2) = 1/16 * 2 = 1/8
This gives indeed the identity matrix

5.2 Compute missing S

A * At = [[40, 24], [24, 64]]
A * At * [1, 1]t = [64, 64]t -> Eigenvalue 64
A * At * [1, -11]t = [16, -16]t -> Eigenvalue 16
The singular values are hence 8 and 4
S is hence diag(8, 4)

5.3 Rank of A and making it rank 3

A has two independent columns -> rank >= 2
A has two rows -> rank <= 2
Since the number of rows is an upper bound, adding another column cannot make the rank 3

5.4 Function for L2-normalization of a vector

1. def normalize(vector):
2.     sum = 0
3.     for x in vector:
4.         sum += x * x
5.     for x in vector:
6.         x = x / sqrt(sum)

TASK 6 (Miscellaneous)

6.1 Total number of items in a 3-gram index

Each 3-gram of each word contributed one item. A word v has |v| + 2 3-grams.
Hence the total number of items is sum_i |v_i| + 2 * n.

6.2 SQL query for persons who founded company in city they were born in

SELECT founded_by.person
FROM founded_by, based_in, born_in
WHERE founded_by.person = born_in.person
AND   founded_by.company = based_in.company
AND   based_in.city = born_in.city

6.3 Maximum entropy of distribution given by p1, ..., pn

Entropy is sum_i -p_i * log2 p_i
Optimize sum_i -p_i * ln p_i (the constant factor log / ln does not change the position of the optimum)
First derivative: -1 - ln p_i + lambda = 0  =>  all p_i equal and hence 1/n
Second derivative:  -1/p_i < 0  =>  MAX
Resulting entropy: sum_i -1/n * log2 (1/n) = log2 n

-  ⇤ ← Revision 2 as of 2018-02-19 13:01:28 → 
  Size: 5653
  Editor: Hannah Bast
  Comment:
+   ← Revision 4 as of 2018-02-19 13:35:40 → ⇥
  Size: 6412
  Editor: Hannah Bast
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 31:
-=== 1.3 Function ndcg that computes nDCG@k ===
+=== 1.4 Function ndcg that computes nDCG@k ===
 Line 42:
-== TASK 2 (Encodings) ==

=== 2.1 Encoding for x = 1, ...,10 ===

{{{
  1 = 100.
  2 = 101.0
  3 = 101.1
  4 = 110.00
  5 = 110.01
  6 = 110.10
  7 = 110.11
  8 = 111.000
  9 = 111.001
 10 = 111.010
}}}

=== 2.2 Function for code for x < 16 ===

{{{
1. def code(x):
2.     result = [1]
3.     length = floor(log2(x))
4.     result.extend([length // 2, length % 2])
5.     if x >= 2 and x < 4:  result.append(x % 2)
7.     if x >= 4 and x < 8:  result.extend([(x & 2) >> 1, x % 2])
8.     if x >= 8 and x < 16: result.extend([(x & 4) >> 2, (x & 2) >> 1, x % 2])
9.     return result
}}}

=== 2.3 Formula for code length for arbitrary x ===

{{{
Length of Golomb part: floor(floor(log_2 x) / 4) + 3
Length of binary part: floor(log_2 x)
Sum of the two: floor(floor(log_2 x) / 4) + 3 + floor(log_2 x)
}}}


== TASK 3 (Web applications and UTF-8) ==

=== 3.1 Write HTML ===

{{{
<html>
<head><script src="convert.js"></script></head>
<body>Centimetres: <input id="cm"/>Inches: <input id="in"/></body>
</html>
}}}

=== 3.2 Write JavaScript ===

{{{
$(document).ready(function(){
  $("#cm").keyup(function(){ $("#in").val($("#cm).val() / 2.54)); })
  $("#in").keyup(function(){ $("#cm").val($("#in).val() * 2.54)); })
})
}}}

=== 3.3 UTF-8 code for Euro sign ===

{{{
32 in binary is 0010.0000, 172 in binary is 1010.1100 (128 + 32 + 8 + 4)
The code point in binary is hence: 0010.0000.1010.1100
We hence need a 3-byte code of the form 1110 xxxx 10 yyyyyy 10 zzzzzz (with a 16-bit code point xxxxyyyyyyzzzzzz)
The 3-byte UTF-8 code is hence: 1110 0010  10 000010  10 101100
}}}

=== 3.4 Function for counting #characters in UTF-8 sequence ===

{{{
1. def count_utf8_chars(bytes):
2.    count = 0
3.    for byte in bytes:
4.        # Count all except the "follow bytes" of the form "10......". 
5.        if (byte & (128 + 64) != 128):
6.            count += 1
7.    return count
}}}
+== TASK 2 (Encodings, Niklas) ==

=== 2.1 Function unary_encode that encodes given sequence of integers ===

{{{
1. def unary_encode(input):
2.     result = []
3.     for x in input:
4.         for i in range(0, x-1):
5.             result.add(0)
6.         result.add(1)
7.     return result
}}}

=== 2.2 Expected number of bits in sequence of n integers i.i.d from {1, ..., n} ===

{{{
1. The code length for integer i is exactly i
2. For a random integer from {1, ..., n}, the expected code length is sum_i (i * 1/n) = 1/n * sum_i i = 1/n * 1/2 * n * (n+1) = (n+1) / 2
3. The expected number of bits in the whole sequence is hence n * (n+1) / 2 
}}}

=== 2.3 Is this encoding entropy-optimal for the distribution from task 2.2 ===

{{{
1. The entropy H of the distribution from 2.2 is -sum_i 1/n * log_2 (1/n) = 1/n * log_2 n * sum_i 1 =  log_2 n
2. The expected code length is (n+1)/2
3. For entropy-optimality, we should have expected code length = entropy + O(1)
4. So the answer is NO.
}}}

=== 2.4 Name a prefix-free entropy-optimal encoding for the distribution from task 2.2 ===

{{{
1. Just encode each integer with a fixed number of ceil(log_2 n) bits
2. This is prefix-free, because for two different codes of the same length, none can be the prefix of the other
3. Since ceil(log_2 n) <= log_2 + 1, this encoding is entropy-optimal
}}}


== TASK 3 (Fuzzy search and UTF-8, Axel) ==

=== 3.1 (5 points) ED(x, y), ED(y, x), PED(x, y), PED(y, x) for x = cute and y = computer ===

{{{
ED(cute, computer) = 4 ... insert omp and r
ED(computer, cute) = 4 ... ED is symmetric
PED(cute, computer) = ED(cute, compute) = 3
PED(computer, cute) = ED(computer, cute) = 4
}}}

=== 3.2 (5 points) Find x and y with ED(x, y) = 2 and comm($$x$$, $$y$$) = max{|x|, |y|} - 4 ===

{{{
1. max{|x|, |y|} - 4 is the lower bound from the lecture for q = 3 and ED(x, y) = 2
2. The lower bound is achieved when destroying the maximal number of q-grams with each edit distance operation, in this case 3
3. For example, x = ABC, y = DBE
}}}

=== 3.3 (10 points) Function valid_utf8_char8 that checks if a given array of three bytes is valid UTF-8 for a single character ===

{{{
1. def valid_utf8_char(bytes):
       # Check that first byte start with 1110 and the other two with 10
       # 10000000b = 128, 11000000b = 192, 11100000b = 224, 11110000b = 240
3.     if (bytes[0] & 240 != 224 OR bytes[1] & 192 != 128 OR bytes[2] & 192 != 192):
4.         return False 
       # Check that the code point is >= 2048
       # 11100000b = 224, , 10100000b = 160 
5.     if (bytes[0] > 224 OR bytes[1] >= 160):
6.         return True
7.     return False
}}}

Links

Previous Courses

TASK 1 (Ranking and evaluation)

1.1 P@2, P@R, AP

1.2 DCG@2, iDCG@2, nDCG@2

1.3 Function dcg that computes DCG@k

1.4 Function ndcg that computes nDCG@k

TASK 2 (Encodings, Niklas)

2.1 Function unary_encode that encodes given sequence of integers

2.2 Expected number of bits in sequence of n integers i.i.d from {1, ..., n}

2.3 Is this encoding entropy-optimal for the distribution from task 2.2

2.4 Name a prefix-free entropy-optimal encoding for the distribution from task 2.2

TASK 3 (Fuzzy search and UTF-8, Axel)

3.1 (5 points) ED(x, y), ED(y, x), PED(x, y), PED(y, x) for x = cute and y = computer

3.2 (5 points) Find x and y with ED(x, y) = 2 and comm($$x$$, $$y$$) = max{|x|, |y|} - 4

3.3 (10 points) Function valid_utf8_char8 that checks if a given array of three bytes is valid UTF-8 for a single character

TASK 4 (Naive Bayes and k-means)

4.1 Steps of k-means

4.2 Compute centroids from A and P

4.3 Determine w and b of Naive Bayes

4.4 Example such that Naive Bayes decides 2x > y

TASK 5 (Latent Semantic Indexing)

5.1 Show that V is row-orthonormal

5.2 Compute missing S

5.3 Rank of A and making it rank 3

5.4 Function for L2-normalization of a vector

TASK 6 (Miscellaneous)

6.1 Total number of items in a 3-gram index

6.2 SQL query for persons who founded company in city they were born in

6.3 Maximum entropy of distribution given by p1, ..., pn