Size: 1416
Comment:
|
Size: 3674
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 7: | Line 7: |
The idf for first and second is log_2(4/1) = 2, the idf for the other words is log_2 (4/2) = 1. Hence {{{ first 2 0 0 0 second 0 2 0 0 nice 1 0 1 0 document 1 1 0 0 text 0 0 1 1 more 0 0 1 1 }}} |
|
Line 9: | Line 20: |
{{{ 1. def tf_to_tf_idf(A, m, n): 2. for i in range(m): 3. df = 0 4. for j in range(n): 5. if A[i, j] > 0: 6. df += 1 7. for j in range(n): 8. A[i, j] *= log2(n/df) }}} |
|
Line 10: | Line 32: |
{{{ Query: second nice document -> scores: D1 = 2, D2 = 3, D3 = 1, D4 = 0. Order: D2, D1, D3, D4 Relevant: D1 and D4 -> then P@2 and P@4 are 50%, and hence also AP }}} |
|
Line 15: | Line 44: |
{{{ 1 = 100. 2 = 101.0 3 = 101.1 4 = 110.00 5 = 110.01 6 = 110.10 7 = 110.11 8 = 111.000 9 = 111.001 10 = 111.010 }}} |
|
Line 17: | Line 59: |
=== 2.3 Formula for code length for arbitraty x === | {{{ 1. def code(x): 2. result = [1] 3. length = floor(log2(x)) 4. result.extend([length // 2, length % 2]) 5. if x >= 2 and x < 4: result.append(x % 2) 7. if x >= 4 and x < 8: result.extend([(x & 2) >> 1, x % 2]) 8. if x >= 8 and x < 16: result.extend([(x & 4) >> 2, (x & 2) >> 1, x % 2]) 9. return result }}} === 2.3 Formula for code length for arbitrary x === {{{ Length of Golomb part: floor(floor(log_2 x) / 4) + 3 Length of binary part: floor(log_2 x) - 1 Sum of the two: floor(floor(log_2 x) / 4) + 2 + floor(log_2 x) }}} |
Line 24: | Line 83: |
{{{ <html> <head><script src="convert.js"></script></head> <body>Centimetres: <input id="cm"/>Inches: <input id="in"/></body> </html> }}} |
|
Line 25: | Line 91: |
{{{ $(document).ready(function(){ $("#cm").keyup(function(){ $("#in").val($("#cm).val() / 2.54)); }) $("#in").keyup(function(){ $("#cm").val($("#in).val() * 2.54)); }) }) }}} |
|
Line 28: | Line 101: |
{{{ 32 in binary is 0010.0000, 172 in binary is 1010.1100 (128 + 32 + 8 + 4) The code point in binary is hence: 0010.0000.1010.1100 We hence need a 3-byte code of the form 1110 xxxx 10 yyyyyy 10 zzzzzz (with a 16-bit code point xxxxyyyyyyzzzzzz) The 3-byte UTF-8 code is hence: 1110 0010 10 000010 10 101100 }}} |
|
Line 29: | Line 109: |
{{{ 1. def count_utf8_chars(bytes): 2. count = 0 3. for byte in bytes: 4. # Count all except the "follow bytes" of the form "10......". 5. if (byte & (128 + 64) != 128): 6. count += 1 7. return count }}} |
|
Line 64: | Line 154: |
TASK 1 (Ranking and evaluation)
1.1 Term-document matrix
The idf for first and second is log_2(4/1) = 2, the idf for the other words is log_2 (4/2) = 1. Hence
first 2 0 0 0 second 0 2 0 0 nice 1 0 1 0 document 1 1 0 0 text 0 0 1 1 more 0 0 1 1
1.2 Function for conversion from tf to tf.idf
1. def tf_to_tf_idf(A, m, n): 2. for i in range(m): 3. df = 0 4. for j in range(n): 5. if A[i, j] > 0: 6. df += 1 7. for j in range(n): 8. A[i, j] *= log2(n/df)
1.3 Find query with P@2 and AP = 50%
Query: second nice document -> scores: D1 = 2, D2 = 3, D3 = 1, D4 = 0. Order: D2, D1, D3, D4 Relevant: D1 and D4 -> then P@2 and P@4 are 50%, and hence also AP
TASK 2 (Encodings)
2.1 Encoding for x = 1, ...,10
1 = 100. 2 = 101.0 3 = 101.1 4 = 110.00 5 = 110.01 6 = 110.10 7 = 110.11 8 = 111.000 9 = 111.001 10 = 111.010
2.2 Function for code for x < 16
1. def code(x): 2. result = [1] 3. length = floor(log2(x)) 4. result.extend([length // 2, length % 2]) 5. if x >= 2 and x < 4: result.append(x % 2) 7. if x >= 4 and x < 8: result.extend([(x & 2) >> 1, x % 2]) 8. if x >= 8 and x < 16: result.extend([(x & 4) >> 2, (x & 2) >> 1, x % 2]) 9. return result
2.3 Formula for code length for arbitrary x
Length of Golomb part: floor(floor(log_2 x) / 4) + 3 Length of binary part: floor(log_2 x) - 1 Sum of the two: floor(floor(log_2 x) / 4) + 2 + floor(log_2 x)
TASK 3 (Web applications and UTF-8)
3.1 Write HTML
<html> <head><script src="convert.js"></script></head> <body>Centimetres: <input id="cm"/>Inches: <input id="in"/></body> </html>
3.2 Write JavaScript
$(document).ready(function(){ $("#cm").keyup(function(){ $("#in").val($("#cm).val() / 2.54)); }) $("#in").keyup(function(){ $("#cm").val($("#in).val() * 2.54)); }) })
3.3 UTF-8 code for Euro sign
32 in binary is 0010.0000, 172 in binary is 1010.1100 (128 + 32 + 8 + 4) The code point in binary is hence: 0010.0000.1010.1100 We hence need a 3-byte code of the form 1110 xxxx 10 yyyyyy 10 zzzzzz (with a 16-bit code point xxxxyyyyyyzzzzzz) The 3-byte UTF-8 code is hence: 1110 0010 10 000010 10 101100
3.4 Function for counting #characters in UTF-8 sequence
1. def count_utf8_chars(bytes): 2. count = 0 3. for byte in bytes: 4. # Count all except the "follow bytes" of the form "10......". 5. if (byte & (128 + 64) != 128): 6. count += 1 7. return count
TASK 4 (Naive Bayes and k-means)
4.1 Steps of k-means
4.2 Compute centroids from A and P
4.3 Determine w and b of Naive Bayes
4.4 Example such that Naive Bayes decides 2x > y
TASK 5 (Latent Semantic Indexing)
5.1 Show that V is row-orthonormal
5.2 Compute missing S
5.3 Rank of A and making it rank 3
5.4 Function for L2-normalization of a vector
TASK 6 (Miscellaneous)
6.1 Total number of items in a 3-gram index
6.2 SQL query for persons who founded company in city they were born in