Diff for "SearchEnginesWS0910"

Differences between revisions 318 and 793 (spanning 475 versions)

Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.

Here are PDFs of the slides of the lectures: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6, Lecture 7, Lecture 8, Lecture 9, Lecture 10, Lecture 11, Lecture 12, Lecture 13, Lecture 14, Projects.

Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5, Exercise Sheet 6, Exercise Sheet 7, Exercise Sheet 8, Exercise Sheet 9, Exercise Sheet 10, Exercise Sheet 11, Exercise Sheet 12, Exercise Sheet 13, Exercise Sheet 14.

Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4, Solutions and Comments 5, Solutions and Comments 6, Solutions and Comments 7, Solutions and Comments 8, Solutions and Comments 9, Solutions and Comments 10, Solutions and Comments 11, Solutions and Comments 12, Solutions and Comments 13.

Here are our master solutions: Master solution for Exercise Sheet 6 (only Exercise 4), Master solution for Mid-Term Exam,Master solution for Exercise Sheet 9, Master solution for Exercise Sheet 10, Master solution for Exercise Sheet 11, Master solution for Exercise Sheet 12.

Here are the rules for the exercises as explained in Lecture 2.

Here is everything about the mid-term exam (December 18, 2009, a pure trial exam which did not count for anything) and the final exam (March 12, 2010, the real thing which accounted for most of the mark).

Here is the table with the links to your uploaded solutions for Exercise Sheet 14. The deadline is Thursday 18Feb10 16:00.

More general questions and comments

Yes, it's on Wednesday, March 31, 2 - 3 pm in my office (building 51, second floor, room 28). Hannah 20Mar10 21:09

Hi, have you decided when and where the exam review will be, yet? Marius Mar30th2010 08:50 p.m.

Your marks for the final exam and your overall mark for the course are now available on your personal page. You will have the opportunity to look at your exams on a certain day, which we haven't fixed yet. We will tell you (via this page) when we have fixed that date. Hannah 15Mar10 00:39

Yes, HS026 in building 101. Hannah 12Mar10 8:45am

Does the HS mean same 101 building? I am still new in Freiburg ;). Paresh 12 Mar 10 07:45

I am sorry that we have not managed to produce the master solutions for exercise sheets 13 and 14 yet. However, Lecture 13 is still relevant for the exam (clustering, part 2), but I can tell you that there will be no task about Lecture 14 (statistical hypothesis testing). I think that's fair, because there was no tutorial for the (last) exercise sheet 14. Exercise sheet 14 will count as a normal excercise sheet, however. Hannah 11Mar10 15:21

Hi, since there are still no master solutions for sheets 13-14, I assume the contents of the lecture concerning these sheets are not relevant for the exam. Marius Mar11th 2:23 p.m.

To Johannes + all: you are not allowed to bring any computing devices whatsoever and you won't need them. If there is a task which requires a calculation that is unreasonable to do by hand (like the log_2(10/7) from the mid-term exam), we will tell you what it is or an approximation to work with (for example that you can take log_2(10/7) as 0.5). Hannah 10Mar10 20:42

Exam and portable calculators: "2. You are not allowed to use any computing devices, mobile phones, etc." I had some problems with pen, paper and sqrt(1080). May we bring calculators? Johannes 2010-03-10T20:27

Thanks, the solution for sheet6 ex4 helped us a lot! björn

@Jonas: thanks for the comment, I have corrected it in the master solution. @Björn: I added a master solution for Exercise Sheet 6 (only Exercise 4), linked above, with what I think is a very short and simple proof. Tell me if you find anything wrong with it. Hannah 10Mar10 16:40

Jonas: Yes, that was already mentioned in the tutorials. Marjan 10Mar10 15:58

Hi. Concerining exercise sheet 10 exercise 1. Shouldn't you take the squareroots of 108 and 10 (in the Matrix EPSILON). Otherwise the equation is not right. Jonas 10.03.10

Hi, we got a question concerning ex sheet 6, exercise 4. In the tutorial Marjan presented a solid, but complicated solution using Taylor Expansion. In the lecture you mentioned that this wasn't necessary for any exercise. Unfortunately we fail at finding a simpler, but still mathematical rigorous solution. Would you please give a brief idea of how to proove such inequalities as this might by useful for similar, yet easier exercises in the exam. Björn Mi 15:12

Hi Johannes + all. Here is a very simple example: let the query word be algorithm and one candidate similar word computed by the permuted lexicon be algXXXthm (the common prefix is thmalg [from the permutations thmalgori and thmalgXXX] which is long enough) and let the edit distance threshold be 2. Obviously this candidate word will be filtered out because the edit distance is 3. Marjan 07Mar10 18:57

Filtering with a Permutern Index: The slide states: "for all matches thus found, compute the actual edit distance". Is there a simple strawman-example for a word that gets removed in the postfiltering-step? (Today is silly question day.) Johannes 2010-03-07T18:26

Hi Johannes + all. Concerning your inverted index question: it really depends on the application, if you have lists of only doc ids and want to intersect them fast, you would sort the lists by doc id, if you want to do top-k you would sort them by score. Duplicates only make sense when you also store positional information, which we didn't do in the lecture. Concerning your Elias-Gamma question: there is an upper bound, which I think we also derived in the lecture, and that is log n + O(log^(k) n) + O(1), but I couldn't tell you what are the constants hidden in the two Big-Ohs. Hannah 7Mar10 18:19

Inverted indexes and like: If a inverted index maps a word, w, (perhaps a string) to a subset, W(w), of the set of all documents (perhaps only the IDs as numbers). Is W(w) always sorted? Does it contain duplicates? For some application (and the algorithms for them) this seems to matter. I'm just asking in case of a exam task, involving coding (especially k-way-merge). Johannes 2010-03-07T13:54

Elias-Gamma Encoding: Is there a closed form for the length of the code for an integer x when elias is iterated k times? Johannes 2010-03-07T15:14

Questions and comments about the master solution of the mid-term exam

Johannes 2010-03-07T12:40 :

1.3: CLAIM: If an encoding is prefix-free, then there is no code that is a prefix of a different code. Does this claim hold? If so, then 001 mustn't be a code, since 0 is a code and a prefix of 001. Is this right?

There was an obvious mistake which I now corrected (00 should be mapped to 1, not 0). Hannah 7Mar10 12:56

1.4: It states: "For a sequence of length n, we need to generate n/2 such codes [...]." Does not each symbol of the n from the sequence get encoded?

Each code stands for two bits at a time, so for a sequence of n bits, you have to generate n/2 codes. I replaced sequence of length n by sequence of n bits to make this clearer. Hannah 7Mar10 12:58

3.4: The function returns the number of common k-grams (as far as I see). Can the return-line be completed with a call to the function from 3.2 to return the Jaccard-distance?

Yes, indeed, I replaced return l by return jaccardDistance(x, y, k, l). Hannah 7Mar10 13:01

5.4: Does the top-k-algorithm return the top k documents? If so, which k had to been used in this task? What exactly is the condition for stopping? What exactly is the update rule for the ranges? My idea is that (for a fixed document) the minimum is always the known minimum from any of the lists and the maximum is always the (already known) minimum plus the lowest score, seen in any list different than the one the minimum is from. In case of only two lists there may be some simplifications.

The task asked for the top-ranked document, so k = 1. We can stop when the upper bound for all documents not yet seen is strictly below the k-th largest lower bound so far, and when the score ranges for the documents already seen are such that it is clear which are the top-k documents and in which order. If there are ties, and we don't care how they are broken, and we don't care to know the order of the top-k documents, we can sometimes stop earlier. Does this answer all your questions? Hannah 7Mar10 13:06

Thanks a lot for your comments! Please go on if you have more. Hannah 7Mar10 13:07

Thanks a lot for your answers! Johannes 2010-03-07T13:44

Questions and comments about Exercise Sheet 14 below this line (most recent on top)

Hi Johannes: why don't you start with the first few questions, and then let's see whether it makes sense to continue this via the Wiki, or via private email, or via a meeting in person. Hannah 6Mar10 17:36

Yes, the final exam is like the mid-term exam in this respect. Hannah 6Mar10 17:36

Alex: http://vulcano.informatik.uni-freiburg.de/wiki/teaching/SearchEnginesWS0910/MidTermExam, so it seems to be allowed. Mirko, 6Mar10 16:10

Hi, I was wondering, will the exam next week also be an open book exam like the mid-term? Perhaps I overlooked it, but I don't think this is stated anywhere yet. Alex 6Mar10 13:49

I have lots of questions and don't know where to put them. I suppose this wiki-page will get chaotic pretty fast if I post 20 questions. Johannes VI Mar MMX 12:00

I'm sorry for the delay with the master solutions. I am at a conference right now but will try to make progress with this over the weekend. Hannah 4Mar10 23:59

Do we get master solutions for ex. 11, 12, 13 and 14? Johannes 04Mar2010 23:32 ZULU

Now they're there again. Marjan 01Mar18:09

ARGH! I'm very sorry. My Down-Them-All Plugin for Firefox seems to have deleted all the lecture PDFs! Sorry for that. Rollback to previous versions does not seem to work. I hope, someone has already downloaded them all and is able to restore them! SORRY! Interesting, I've got the rights to delete something from the main page, though. Marius Mar 1st 2010 2:38 p.m.

(Reminder:) Hello, the master solutions are not online, yet. alex n 1Mar10 11:08

Yes, we are working on it. Please remind us again if they aren't online by the end of this week. Hannah 23Feb10 14:30

Do we get master solutions for ex. 11, 12, 13 and 14? Johannes 23Feb10 14:05

Hi Matthias, yes, Pr(A) = 1 - Pr(not A), for any event A, and so for any random variable X, Pr(X <= x) = 1 - Pr(X > x), because X <= x and X > x are complementary events. For continuous random variables (like variables with a normal distribution), the difference between <= and < and >= and > is immaterial, because Pr(X = x) for each fixed x. But anyway, to compute the probability, you first have to transform it a bit, like I did in the lecture, and then obtain Pr(N(0,1) >= sqrt(n1) * (µ1 - µ) / σ) and Pr(N(0,1) <= sqrt(n2) * (µ - µ2) / σ). To evaluate the latter you can also simply use the symmetry of the normal distribution, due to which one has Pr(N(0,1) <= -x) = Pr(N(0,1) >= x). Hannah 18Feb10 12:58

Hi, how can we compute Pr(N(n2 * µ2, n2 * σ²) <= n2 * µ2 ? Can we use 1- (Pr(N(n2 * µ2, n2 * σ²) >= n2 * µ2) for that ? Matthias 18Feb10 12:01

Hi Florian + all, one of µ1 and µ2 is larger than µ and one is smaller. Let's assume µ1 is larger and µ2 is smaller. Then for µ1 you have to look at Pr(N(n1 * µ, n1 * σ²) >= n1 * µ1). But for µ2 you have to look at Pr(N(n2 * µ2, n2 * σ²) <= n2 * µ2). Note the <= instead of the >= for the second probability. Recall the meaning of these probabilities. Just as an example, let µ be 100 and µ1 be 150 and µ2 be 50. Then the first probability means: what is the probability that I see a mean of 150 or more in my first sample, although the mean of my distribution is 100. The second probability means: what is the probability that I see a mean of 50 or less in my second sample, although the mean of my distribution is 100. If you take both <= or both >= for both probabilities, it is to be expected that you get two completely different probabilities, one very low and one very high (except when they are both close to 50%). Please ask again if this is still unclear. Hannah 17Feb10 21:51

Sorry, with probability for µ1 I meant Pr(N(n1 * µ, n1 * σ²) >= n1 * µ1) and accordingly with probability for µ2 I meant Pr(N(n2 * µ, n2 * σ²) >= n2 * µ2) where n1=n2 for the exercise sheet. Florian 17Feb10 21:18

Hi Florian, what exactly do you mean by probability for µ1 and probability for µ2? Hannah 17Feb10 21:02

Hi, what values are we expected to get for exercise 4? I always get a probability of about 99.9% for μ1 and a value of about 0.07% for μ2, can that be? Florian 17Feb10 18:25

Hi Florian, yes, the averages in Exercise 3 should be average running times. I uploaded a new version of the sheet, where I corrected this. Hannah 14Feb10 17:48

Hi, I guess we should measure the running times to determine the efficiency of the programs for exercise 3? Florian 15Feb10 17:42

Hi Claudius, you should compute Pr(D|H0), exactly as done in the lecture for Example 2, where we computed this probability as Pr(X > x), where X is a random variable with distribution N(0,1), that is, normal with mean 0 and variance 1, and x depends on the mean and variance of your data. Hannah 14Feb10 16:44

Hi. If I have understood correctly, we have to compute Pr(H|D) in Exercise 4. From statistical hypothesis testing, we get Pr(D|H). Now, Pr(H|D) = Pr(D|H) * (Pr(H) / Pr(D)). We know Pr(D|H) and we can compute Pr(D), but what value do we have to use for Pr(H)? Claudius 14Feb10 14:41

Hi Eric, I don't care whether you use integers or doubles, but I am curious why the one should be any harder than the other? Hannah 12Feb10 19:02

May we use integers for sorting? Or do we have to use doubles? This is important for generating my sorted array Eric 12Feb10 18:56

If you're asking about the merging you can of course use a priority queue if you want, but you don't really need it when merging 2 lists. Marjan 18:28

Why would you use a priority queue? It's simple sorting, the exercise is not about implementing your own sorting algorithm or something like that. About exercise 3, it should be clear from the exercise itself that the sequences should be sorted (otherwise how can the merging work?) Marjan 18:23

Means that we have nothing to do than use a priority queue or something like that and don't have to implement the sorting? And at Exercise 3 the random set should be an ordered one or not? Alex 12Feb10 18:19

We prefer randomized sorting using bitonic networks, alternatively combined with LSD radix sort or simple pancake sort. That's of course a joke, it should be clear that you can use the built-in sorting functions (your own implementation will be certainly slower). Marjan 12Feb10 18:12

What does "do a standard sort" in exercise 2 mean? Shall I implement one on my own, or may I use the Java built-in sorting mechanisms? Also, which sorting algorithm do you prefer for this? Eric 12Feb10 18:04

-  ⇤ ← Revision 318 as of 2009-11-22 12:59:15 → 
  Size: 5930
  Editor: HSI-KBW-078-042-029-035
  Comment:
+   ← Revision 793 as of 2010-07-15 12:04:05 → ⇥
  Size: 20493
  Editor: rumo
  Comment: Spam
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]].
+Here are PDFs of the slides of the lectures: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]], [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]], [[attachment:SearchEnginesWS0910/lecture-7.pdf|Lecture 7]], [[attachment:SearchEnginesWS0910/lecture-8.pdf|Lecture 8]], [[attachment:SearchEnginesWS0910/lecture-9.pdf|Lecture 9]], [[attachment:SearchEnginesWS0910/lecture-10.pdf|Lecture 10]], [[attachment:SearchEnginesWS0910/lecture-11.pdf|Lecture 11]], [[attachment:SearchEnginesWS0910/lecture-12.pdf|Lecture 12]], [[attachment:SearchEnginesWS0910/lecture-13.pdf|Lecture 13]], [[attachment:SearchEnginesWS0910/lecture-14.pdf|Lecture 14]], [[attachment:SearchEnginesWS0910/lecture-projects.pdf|Projects]].
 Line 5:
-Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (no audio)]].
+Here are the recordings of the lectures (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-1.lpd|Recording Lecture 1 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-3.lpd|Recording Lecture 3 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-4.lpd|Recording Lecture 4 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-5.lpd|Recording Lecture 5 (LPD without audio)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-6.lpd|Recording Lecture 6 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-7.avi|Recording Lecture 7 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-8.avi|Recording Lecture 8 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-9.avi|Recording Lecture 9 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-10.avi|Recording Lecture 10 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-11.avi|Recording Lecture 11 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-12.avi|Recording Lecture 12 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-13.avi|Recording Lecture 13 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-14.avi|Recording Lecture 14 (AVI)]]. To play the Lecturnity recordings (.lpd files) you need the [[http://www.lecturnity.de/de/download/lecturnity-player|Lecturnity Player, which you can download here]]. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend [[http://www.videolan.org/vlc|VLC]].
 Line 7:
-Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]].
+Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]], [[attachment:SearchEnginesWS0910/exercise-7.pdf|Exercise Sheet 7]], [[attachment:SearchEnginesWS0910/exercise-8.pdf|Exercise Sheet 8]], [[attachment:SearchEnginesWS0910/exercise-9.pdf|Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/exercise-10.pdf|Exercise Sheet 10]], [[attachment:SearchEnginesWS0910/exercise-11.pdf|Exercise Sheet 11]], [[attachment:SearchEnginesWS0910/exercise-12.pdf|Exercise Sheet 12]], [[attachment:SearchEnginesWS0910/exercise-13.pdf|Exercise Sheet 13]], [[attachment:SearchEnginesWS0910/exercise-14.pdf|Exercise Sheet 14]].
 Line 9:
-Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]].
+Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]], [[SearchEnginesWS0910/ExerciseSheet6|Solutions and Comments 6]], [[SearchEnginesWS0910/ExerciseSheet7|Solutions and Comments 7]], [[SearchEnginesWS0910/ExerciseSheet8|Solutions and Comments 8]], [[SearchEnginesWS0910/ExerciseSheet9|Solutions and Comments 9]], [[SearchEnginesWS0910/ExerciseSheet10|Solutions and Comments 10]], [[SearchEnginesWS0910/ExerciseSheet11|Solutions and Comments 11]], [[SearchEnginesWS0910/ExerciseSheet12|Solutions and Comments 12]], [[SearchEnginesWS0910/ExerciseSheet13|Solutions and Comments 13]].
 Line 11:
-= Exercise Sheet 5 =

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. [[http://www.lecturnity.de/de/download/lecturnity-player|You can download the player for free here]].
+Here are our master solutions: [[attachment:SearchEnginesWS0910/solution-6.pdf|Master solution for Exercise Sheet 6 (only Exercise 4)]], [[attachment:SearchEnginesWS0910/solution-midterm.pdf|Master solution for Mid-Term Exam]],[[attachment:SearchEnginesWS0910/solution-9.pdf|Master solution for Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/solution-10.pdf|Master solution for Exercise Sheet 10]], [[attachment:SearchEnginesWS0910/solution-11.pdf|Master solution for Exercise Sheet 11]], [[attachment:SearchEnginesWS0910/solution-12.pdf|Master solution for Exercise Sheet 12]].
-Line 17:
+Line 15:
-[[SearchEnginesWS0910/ExerciseSheet5|Here you can upload your solutions for Exercise Sheet 5]].
+Here is everything about the [[SearchEnginesWS0910/MidTermExam|mid-term exam (December 18, 2009, a pure trial exam which did not count for anything)]] and the [[SearchEnginesWS0910/FinalExam|final exam (March 12, 2010, the real thing which accounted for most of the mark)]].
-Line 19:
+Line 17:
-== Questions or comments below this line, most recent on top please ==

For the exp/bin-search intersection algorithm it clearly matters that it searches for the elements of the smaller list in the larger one. A good implementation will certainly take care of that. Should our implementation also do that or ignore it in order to get 16 measurements that are really different? '''Björn 22Nov09 1:00pm'''
+[[SearchEnginesWS0910/ExerciseSheet14|Here is the table with the links to your uploaded solutions for Exercise Sheet 14]]. The deadline is Thursday 18Feb10 16:00.
-Line 24:
+Line 20:
-Ok, no problem, I'm happy when it's clear now. '''Hannah 22Nov09 0:24am'''
+== More general questions and comments ==
-Line 26:
+Line 22:
-You're right, I misread your comment, sorry. I was thinking of 10MB per lists processed in 1 second, resulting in 20MB/s and was wondering where the 100MB/s are coming from. '''Thomas 22Nov09 00:20am'''
+Yes, it's on Wednesday, March 31, 2 - 3 pm in my office (building 51, second floor, room 28). '''Hannah 20Mar10 21:09'''
-Line 28:
+Line 24:
-Hi Thomas, I am at a loss of words here. I am saying a car is driving 20 kilometers and it needs 10 minutes for that, so its average speed was 120 km / hours. And you are saying how can the speed of a car be 120 km / hours, when it only drives 20 kilometers. Well, what should I say. Besides, in my example I clearly said that the two lists ''together'' occupy 10 MB, not 10 MB per list. Please read again what I wrote. '''Hannah 22Nov09 0:16am'''
+Hi, have you decided when and where the exam review will be, yet? '''Marius Mar30th2010 08:50 p.m.'''
-Line 30:
+Line 26:
-Why should two lists of 10MB size result in 100MB processed, if each list is only iterated over once to do the intersection (O(m+n) complexity)? The data processed after all is just 20MB, no matter how the algorithm is implemented (even if it iterates a thousand times over every list, it still just processed 20MB of data). '''Thomas 21Nov09 12:00am'''
+Your marks for the final exam and your overall mark for the course are now available on your personal page. You will have the opportunity to look at your exams on a certain day, which we haven't fixed yet. We will tell you (via this page) when we have fixed that date. '''Hannah 15Mar10 00:39'''
-Line 32:
+Line 28:
-By the way, whenever I talk about "lists" here or on the exercise sheets or in the lecture, I am not referring to a particular data structure (in particular I am NOT talking about a linked list), but "list of elements"  is just "series of elements". And well, "inverted list" is just common terminology. To implement a "list of doc ids" or anything like that you should of course always use an array or a vector or a data structure like that. '''Hannah 21Nov09 8:30pm'''
+Yes, HS026 in building 101. '''Hannah 12Mar10 8:45am'''
-Line 34:
+Line 30:
-Hi Marius + all, let me explain it by an example. Your two input lists occupy a certain amount of memory. Every programming language has built-in functions for this. For example, if your list entries are ints, then for C++ you can use sizeof(int) to get the number of bytes occupied by one entry.  Multiply by the number of list elements to get the number of bytes occupied by one list. One Megabyte (MB) is 1024 * 1024 bytes. Now assume your two lists together occupy 10 MB. Assume your code takes 0.1 seconds to intersect these two lists. Then the "MB processed per second" is 100 MB / second. '''Hannah 21Nov09 8:26pm'''
+Does the HS mean same 101 building? I am still new in Freiburg ;). '''Paresh 12 Mar 10 07:45'''
-Line 36:
+Line 32:
-Hi, in exercise 3, what do you mean by "MB processed per second"? Is a MB the equivalent to 4096 processed integers? And when is a MB to be considered as processed? When it's written to the intersected list or in the comparisons, already? '''Marius 21Nov09 7:33pm'''
+I am sorry that we have not managed to produce the master solutions for exercise sheets 13 and 14 yet. However, Lecture 13 is still relevant for the exam (clustering, part 2), but I can tell you that there will be ''no'' task about Lecture 14 (statistical hypothesis testing). I think that's fair, because there was no tutorial for the (last) exercise sheet 14. Exercise sheet 14 will count as a normal excercise sheet, however. '''Hannah 11Mar10 15:21'''
-Line 38:
+Line 34:
-The slides + all my hand-writing on it are now online, see the link ''Recording Lecture 5 (no audio)'' above. '''Hannah 20Nov09 3:24am'''
+Hi, since there are still no master solutions for sheets 13-14, I assume the contents of the lecture concerning these sheets are not relevant for the exam. '''Marius Mar11th 2:23 p.m.'''
-Line 40:
+Line 36:
-The recording of todays lecture again did not work. I am very sorry for that (and very angry that there are so many problems with this software). Anyway, the end result of the lecture, that is the slides with all the writing on it are available and I will put them online as soon as possible. '''Hannah 19Nov09 11:23pm'''
+To Johannes + all: you are not allowed to bring any computing devices whatsoever and you won't need them. If there is a task which requires a calculation that is unreasonable to do by hand (like the log_2(10/7) from the mid-term exam), we will tell you what it is or an approximation to work with (for example that you can take log_2(10/7) as 0.5). '''Hannah 10Mar10 20:42'''
-Line 42:
+Line 38:
-There is a typo in Exercise 5 of the new sheet. The two occurrences of ''n'' should be ''m''. '''Hannah 19Nov09 11:22pm'''
+'''Exam and portable calculators''': "2. You are not allowed to use any computing devices, mobile phones, etc." I had some problems with pen, paper and sqrt(1080). May we bring calculators? '''Johannes 2010-03-10T20:27''' 

Thanks, the solution for sheet6 ex4 helped us a lot! '''björn'''

@Jonas: thanks for the comment, I have corrected it in the master solution. @Björn: I added a master solution for Exercise Sheet 6 (only Exercise 4), linked above, with what I think is a very short and simple proof. Tell me if you find anything wrong with it. '''Hannah 10Mar10 16:40'''

Jonas: Yes, that was already mentioned in the tutorials. '''Marjan 10Mar10 15:58'''

Hi. Concerining exercise sheet 10 exercise 1. Shouldn't you take the squareroots of 108 and 10 (in the Matrix EPSILON). Otherwise the equation is not right. '''Jonas 10.03.10'''

Hi, we got a question concerning ex sheet 6, exercise 4. In the tutorial Marjan presented a solid, but complicated solution using Taylor Expansion. In the lecture you mentioned that this wasn't necessary for any exercise. Unfortunately we fail at finding a simpler, but still mathematical rigorous solution. Would you please give a brief idea of how to proove such inequalities as this might by useful for similar, yet easier exercises in the exam. '''Björn Mi 15:12'''

Hi Johannes + all. Here is a very simple example: let the query word be ''algorithm'' and one candidate similar word computed by the permuted lexicon be ''algXXXthm'' (the common prefix is ''thmalg'' [from the permutations ''thmalgori'' and ''thmalgXXX''] which is long enough) and let the edit distance threshold be 2. Obviously this candidate word will be filtered out because the edit distance is 3. '''Marjan 07Mar10 18:57'''  

'''Filtering with a Permutern Index''': The slide states: "for all matches thus found, compute the actual edit distance". Is there a simple strawman-example for a word that gets removed in the postfiltering-step? (Today is silly question day.) '''Johannes 2010-03-07T18:26'''

Hi Johannes + all. Concerning your inverted index question: it really depends on the application, if you have lists of only doc ids and want to intersect them fast, you would sort the lists by doc id, if you want to do top-k you would sort them by score. Duplicates only make sense when you also store positional information, which we didn't do in the lecture. Concerning your Elias-Gamma question: there is an upper bound, which I think we also derived in the lecture, and that is log n + O(log^(k) n) + O(1), but I couldn't tell you what are the constants hidden in the two Big-Ohs. '''Hannah 7Mar10 18:19'''

'''Inverted indexes and like''': If a inverted index maps a word, w, (perhaps a string) to a subset, W(w), of the set of all documents (perhaps only the IDs as numbers). Is W(w) always sorted? Does it contain duplicates? For some application (and the algorithms for them) this seems to matter. I'm just asking in case of a exam task, involving coding (especially k-way-merge). '''Johannes 2010-03-07T13:54'''

'''Elias-Gamma Encoding''': Is there a closed form for the length of the code for an integer x when elias is iterated k times? '''Johannes 2010-03-07T15:14'''

== Questions and comments about the master solution of the mid-term exam ==

'''Johannes 2010-03-07T12:40''' :

'''1.3''': CLAIM: If an encoding is prefix-free, then there is no code that is a prefix of a different code. Does this claim hold? If so, then 001 mustn't be a code, since 0 is a code and a prefix of 001. Is this right?

There was an obvious mistake which I now corrected (00 should be mapped to 1, not 0). '''Hannah 7Mar10 12:56'''

'''1.4''': It states: "For a sequence of length n, we need to generate n/2 such codes [...]." Does not each symbol of the n from the sequence get encoded? 

Each code stands for two bits at a time, so for a sequence of n bits, you have to generate n/2 codes. I replaced ''sequence of length n'' by ''sequence of n bits'' to make this clearer. '''Hannah 7Mar10 12:58'''

'''3.4''': The function returns the number of common k-grams (as far as I see). Can the return-line be completed with a call to the function from 3.2 to return the Jaccard-distance?
 
Yes, indeed, I replaced ''return l'' by ''return jaccardDistance(x, y, k, l)''. '''Hannah 7Mar10 13:01'''

'''5.4''': Does the top-k-algorithm return the top k documents? If so, which k had to been used in this task? What exactly is the condition for stopping? What exactly is the update rule for the ranges? My idea is that (for a fixed document) the minimum is always the known minimum from any of the lists and the maximum is always the (already known) minimum plus the lowest score, seen in any list different than the one the minimum is from. In case of only two lists there may be some simplifications.

The task asked for the ''top-ranked document'', so k = 1. We can stop when the upper bound for all documents not yet seen is ''strictly'' below the k-th largest lower bound so far, and when the score ranges for the documents already seen are such that it is clear which are the top-k documents and in which order. If there are ties, and we don't care how they are broken, and we don't care to know the order of the top-k documents, we can sometimes stop earlier. Does this answer all your questions? '''Hannah 7Mar10 13:06'''

Thanks a lot for your comments! Please go on if you have more. '''Hannah 7Mar10 13:07'''

Thanks a lot for your answers! '''Johannes 2010-03-07T13:44'''

== Questions and comments about Exercise Sheet 14 below this line (most recent on top) ==

Hi Johannes: why don't you start with the first few questions, and then let's see whether it makes sense to continue this via the Wiki, or via private email, or via a meeting in person. '''Hannah 6Mar10 17:36'''

Yes, the final exam is like the mid-term exam in this respect. '''Hannah 6Mar10 17:36'''

Alex: http://vulcano.informatik.uni-freiburg.de/wiki/teaching/SearchEnginesWS0910/MidTermExam, so it seems to be allowed. '''Mirko, 6Mar10 16:10'''

Hi, I was wondering, will the exam next week also be an open book exam like the mid-term? Perhaps I overlooked it, but I don't think this is stated anywhere yet. '''Alex 6Mar10 13:49'''

I have lots of questions and don't know where to put them. I suppose this wiki-page will get chaotic pretty fast if I post 20 questions. '''Johannes VI Mar MMX 12:00'''

I'm sorry for the delay with the master solutions. I am at a conference right now but will try to make progress with this over the weekend. '''Hannah 4Mar10 23:59'''

Do we get master solutions for ex. 11, 12, 13 and 14? '''Johannes 04Mar2010 23:32 ZULU'''

Now they're there again. '''Marjan 01Mar18:09'''

ARGH! I'm very sorry. My Down-Them-All Plugin for Firefox seems to have deleted all the lecture PDFs! Sorry for that. Rollback to previous versions does not seem to work. I hope, someone has already downloaded them all and is able to restore them! SORRY! Interesting, I've got the rights to delete something from the main page, though. '''Marius Mar 1st 2010 2:38 p.m.'''

(Reminder:) Hello, the master solutions are not online, yet. '''alex n 1Mar10 11:08'''

Yes, we are working on it. Please remind us again if they aren't online by the end of this week. '''Hannah 23Feb10 14:30'''

Do we get master solutions for ex. 11, 12, 13 and 14? '''Johannes 23Feb10 14:05'''

Hi Matthias, yes, Pr(A) = 1 - Pr(not A), for any event A, and so for any random variable X, Pr(X <= x) = 1 - Pr(X > x), because X <= x and X > x are complementary events. For continuous random variables (like variables with a normal distribution), the difference between <= and < and >= and > is immaterial, because Pr(X = x) for each fixed x. But anyway, to compute the probability, you first have to transform it a bit, like I did in the lecture, and then obtain Pr(N(0,1) >= sqrt(n1) * (µ1 - µ) / σ) and Pr(N(0,1) <= sqrt(n2) * (µ - µ2) / σ). To evaluate the latter you can also simply use the symmetry of the normal distribution, due to which one has Pr(N(0,1) <= -x) = Pr(N(0,1) >= x). '''Hannah 18Feb10 12:58'''

Hi, how can we compute Pr(N(n2 * µ2, n2 * σ^2^) <= n2 * µ2 ? Can we use 1- (Pr(N(n2 * µ2, n2 * σ^2^) >= n2 * µ2) for that ? '''Matthias 18Feb10 12:01'''

Hi Florian + all, one of µ1 and µ2 is larger than µ and one is smaller. Let's assume µ1 is larger and µ2 is smaller. Then for µ1 you have to look at Pr(N(n1 * µ, n1 * σ^2^) >= n1 * µ1). But for µ2 you have to look at Pr(N(n2 * µ2, n2 * σ^2^) <= n2 * µ2). Note the <= instead of the >= for the second probability. Recall the meaning of these probabilities. Just as an example, let µ be 100 and µ1 be 150 and µ2 be 50. Then the first probability means: what is the probability that I see a mean of ''150 or more'' in my first sample, although the mean of my distribution is 100. The second probability means: what is the probability that I see a mean of ''50 or less'' in my second sample, although the mean of my distribution is 100. If you take both <= or both >= for both probabilities, it is to be expected that you get two completely different probabilities, one very low and one very high (except when they are both close to 50%). Please ask again if this is still unclear. '''Hannah 17Feb10 21:51'''

Sorry, with probability for µ1 I meant Pr(N(n1 * µ, n1 * σ^2^) >= n1 * µ1) and accordingly with probability for µ2 I meant Pr(N(n2 * µ, n2 * σ^2^) >= n2 * µ2) where n1=n2 for the exercise sheet.  '''Florian 17Feb10 21:18'''

Hi Florian, what exactly do you mean by ''probability for µ1'' and ''probability for µ2''? '''Hannah 17Feb10 21:02'''

Hi, what values are we expected to get for exercise 4? I always get a probability of about 99.9% for μ1 and a value of about 0.07% for μ2, can that be? '''Florian 17Feb10 18:25'''

Hi Florian, yes, the ''averages'' in Exercise 3 should be ''average running times''. I uploaded a new version of the sheet, where I corrected this. '''Hannah 14Feb10 17:48'''

Hi, I guess we should measure the running times to determine the efficiency of the programs for exercise 3? '''Florian 15Feb10 17:42'''

Hi Claudius, you should compute Pr(D|H0), exactly as done in the lecture for Example 2, where we computed this probability as Pr(X > x), where X is a random variable with distribution N(0,1), that is, normal with mean 0 and variance 1, and x depends on the mean and variance of your data. '''Hannah 14Feb10 16:44'''

Hi. If I have understood correctly, we have to compute Pr(H|D) in Exercise 4. From statistical hypothesis testing, we get Pr(D|H). Now, Pr(H|D) = Pr(D|H) * (Pr(H) / Pr(D)). We know Pr(D|H) and we can compute Pr(D), but what value do we have to use for Pr(H)? '''Claudius 14Feb10 14:41'''

Hi Eric, I don't care whether you use integers or doubles, but I am curious why the one should be any harder than the other? '''Hannah 12Feb10 19:02'''

May we use integers for sorting? Or do we have to use doubles? This is important for generating my sorted array '''Eric 12Feb10 18:56'''

If you're asking about the merging you can of course use a priority queue if you want, but you don't really need it when merging 2 lists. '''Marjan 18:28'''

Why would you use a priority queue? It's simple sorting, the exercise is not about implementing your own sorting algorithm or something like that. About exercise 3, it should be clear from the exercise itself that the sequences should be sorted (otherwise how can the merging work?) '''Marjan 18:23'''

Means that we have nothing to do than use a priority queue or something like that and don't have to implement the sorting? And at Exercise 3 the random set should be an ordered one or not? '''Alex 12Feb10 18:19'''

We prefer randomized sorting using bitonic networks, alternatively combined with LSD radix sort or simple pancake sort. That's of course a joke, it should be clear that you can use the built-in sorting functions (your own implementation will be certainly slower). '''Marjan 12Feb10 18:12'''

What does "do a standard sort" in exercise 2 mean? Shall I implement one on my own, or may I use the Java built-in sorting mechanisms? Also, which sorting algorithm do you prefer for this? '''Eric 12Feb10 18:04'''