Size: 23462
Comment:
|
← Revision 795 as of 2011-07-22 19:29:57 ⇥
Size: 20802
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
#acl All:read |
|
Line 3: | Line 5: |
Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]], [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]], [[attachment:SearchEnginesWS0910/lecture-7.pdf|Lecture 7]], [[attachment:SearchEnginesWS0910/lecture-8.pdf|Lecture 8]], [[attachment:SearchEnginesWS0910/lecture-9.pdf|Lecture 9]], [[attachment:SearchEnginesWS0910/lecture-10.pdf|Lecture 10]], [[attachment:SearchEnginesWS0910/lecture-11.pdf|Lecture 11]], [[attachment:SearchEnginesWS0910/lecture-12.pdf|Lecture 12]]. | Here are PDFs of the slides of the lectures: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]] (Intro), [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]] (socket communication), [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]] (ranking), [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]] (IO-efficiency & compression), [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]] (list intersection), [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]] (prefix search), [[attachment:SearchEnginesWS0910/lecture-7.pdf|Lecture 7]] (javascript), [[attachment:SearchEnginesWS0910/lecture-8.pdf|Lecture 8]] (error-tolerant search), [[attachment:SearchEnginesWS0910/lecture-9.pdf|Lecture 9]] (programming languages & UTF-8), [[attachment:SearchEnginesWS0910/lecture-10.pdf|Lecture 10]] (latent semantic indexing), [[attachment:SearchEnginesWS0910/lecture-11.pdf|Lecture 11]] (naive bayes classification), [[attachment:SearchEnginesWS0910/lecture-12.pdf|Lecture 12]] (clustering), [[attachment:SearchEnginesWS0910/lecture-13.pdf|Lecture 13]] (hierarchical clustering), [[attachment:SearchEnginesWS0910/lecture-14.pdf|Lecture 14]] (hypothesis testing), [[attachment:SearchEnginesWS0910/lecture-projects.pdf|Projects]]. |
Line 5: | Line 7: |
Here are the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (LPD without audio)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-6.lpd|Recording Lecture 6 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-7.avi|Recording Lecture 7 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-8.avi|Recording Lecture 8 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-9.avi|Recording Lecture 9 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-10.avi|Recording Lecture 10 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-11.avi|Recording Lecture 11 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-12.avi|Recording Lecture 12 (AVI)]]. | Here are the recordings of the lectures (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-1.lpd|Recording Lecture 1 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-3.lpd|Recording Lecture 3 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-4.lpd|Recording Lecture 4 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-5.lpd|Recording Lecture 5 (LPD without audio)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-6.lpd|Recording Lecture 6 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-7.avi|Recording Lecture 7 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-8.avi|Recording Lecture 8 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-9.avi|Recording Lecture 9 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-10.avi|Recording Lecture 10 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-11.avi|Recording Lecture 11 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-12.avi|Recording Lecture 12 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-13.avi|Recording Lecture 13 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/recordings/searchengines-ws0910/lecture-14.avi|Recording Lecture 14 (AVI)]]. To play the Lecturnity recordings (.lpd files) you need the [[http://www.lecturnity.de/de/download/lecturnity-player|Lecturnity Player, which you can download here]]. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend [[http://www.videolan.org/vlc|VLC]]. |
Line 7: | Line 9: |
Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]], [[attachment:SearchEnginesWS0910/exercise-7.pdf|Exercise Sheet 7]], [[attachment:SearchEnginesWS0910/exercise-8.pdf|Exercise Sheet 8]], [[attachment:SearchEnginesWS0910/exercise-9.pdf|Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/exercise-10.pdf|Exercise Sheet 10]], [[attachment:SearchEnginesWS0910/exercise-11.pdf|Exercise Sheet 11]], [[attachment:SearchEnginesWS0910/exercise-12.pdf|Exercise Sheet 12]]. | Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]], [[attachment:SearchEnginesWS0910/exercise-7.pdf|Exercise Sheet 7]], [[attachment:SearchEnginesWS0910/exercise-8.pdf|Exercise Sheet 8]], [[attachment:SearchEnginesWS0910/exercise-9.pdf|Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/exercise-10.pdf|Exercise Sheet 10]], [[attachment:SearchEnginesWS0910/exercise-11.pdf|Exercise Sheet 11]], [[attachment:SearchEnginesWS0910/exercise-12.pdf|Exercise Sheet 12]], [[attachment:SearchEnginesWS0910/exercise-13.pdf|Exercise Sheet 13]], [[attachment:SearchEnginesWS0910/exercise-14.pdf|Exercise Sheet 14]]. |
Line 9: | Line 11: |
Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]], [[SearchEnginesWS0910/ExerciseSheet6|Solutions and Comments 6]], [[SearchEnginesWS0910/ExerciseSheet7|Solutions and Comments 7]], [[SearchEnginesWS0910/ExerciseSheet8|Solutions and Comments 8]], [[SearchEnginesWS0910/ExerciseSheet9|Solutions and Comments 9]], [[SearchEnginesWS0910/ExerciseSheet10|Solutions and Comments 10]]. | Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]], [[SearchEnginesWS0910/ExerciseSheet6|Solutions and Comments 6]], [[SearchEnginesWS0910/ExerciseSheet7|Solutions and Comments 7]], [[SearchEnginesWS0910/ExerciseSheet8|Solutions and Comments 8]], [[SearchEnginesWS0910/ExerciseSheet9|Solutions and Comments 9]], [[SearchEnginesWS0910/ExerciseSheet10|Solutions and Comments 10]], [[SearchEnginesWS0910/ExerciseSheet11|Solutions and Comments 11]], [[SearchEnginesWS0910/ExerciseSheet12|Solutions and Comments 12]], [[SearchEnginesWS0910/ExerciseSheet13|Solutions and Comments 13]]. |
Line 11: | Line 13: |
Here are our master solutions: [[attachment:SearchEnginesWS0910/solution-midterm.pdf|Master solution for Mid-Term Exam]],[[attachment:SearchEnginesWS0910/solution-9.pdf|Master solution for Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/solution-10.pdf|Master solution for Exercise Sheet 10]]. The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the Lecturnity recordings (.lpd files) you need the [[http://www.lecturnity.de/de/download/lecturnity-player|Lecturnity Player, which you can download here]]. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend [[http://www.videolan.org/vlc|VLC]]. |
Here are our master solutions: [[attachment:SearchEnginesWS0910/solution-6.pdf|Master solution for Exercise Sheet 6 (only Exercise 4)]], [[attachment:SearchEnginesWS0910/solution-midterm.pdf|Master solution for Mid-Term Exam]],[[attachment:SearchEnginesWS0910/solution-9.pdf|Master solution for Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/solution-10.pdf|Master solution for Exercise Sheet 10]], [[attachment:SearchEnginesWS0910/solution-11.pdf|Master solution for Exercise Sheet 11]], [[attachment:SearchEnginesWS0910/solution-12.pdf|Master solution for Exercise Sheet 12]]. |
Line 17: | Line 17: |
[[SearchEnginesWS0910/MidTermExam|Here is everything about the mid-term exam]]. | Here is everything about the [[SearchEnginesWS0910/MidTermExam|mid-term exam (December 18, 2009, a pure trial exam which did not count for anything)]] and the [[SearchEnginesWS0910/FinalExam|final exam (March 12, 2010, the real thing which accounted for most of the mark)]]. |
Line 19: | Line 19: |
[[attachment:dblp.txt|Here is the file for the Exercise Sheet 11]]. It's a text file, where each line contains the name of the conference (in capital letters), followed by a TAB (ASCII code 9), followed by the title. There are three different conferences: STOC (2423 titles), SIGIR (2372 titles), and SIGGRAPH (1835 titles). The total number of titles / lines is 6630. The exact file size is 454365 bytes. | [[SearchEnginesWS0910/ExerciseSheet14|Here is the table with the links to your uploaded solutions for Exercise Sheet 14]]. The deadline is Thursday 18Feb10 16:00. |
Line 21: | Line 21: |
[[SearchEnginesWS0910/ExerciseSheet11|Here is the table with the links to your uploaded solutions for Exercise Sheet 11]]. The deadline is Thursday 28Jan10 16:00. | |
Line 23: | Line 22: |
== Questions and comments about Exercise Sheet 12 below this line (most recent on top) == | == More general questions and comments == |
Line 25: | Line 24: |
Oh yes, what Marjan wrote is actually very important, and I didn't have it on my mind when I wrote the exercise sheet. K-means won't do something particularly meaningful if in each assignment step you assign each point / record to the *farthest* cluster. And it makes sense that it alternates between two clusterings at some point: assuming that the centroids don't change anymore, then after you assign a point to its *farthest* centroid, then in the next step the centroid to which it has been assigned before will be it's farthest centroid in the next step. Cool stuff. '''Hannah 2Feb10 20:19''' | Yes, it's on Wednesday, March 31, 2 - 3 pm in my office (building 51, second floor, room 28). '''Hannah 20Mar10 21:09''' |
Line 27: | Line 26: |
To Hannah: I tried 10000 for M and the RSS value ist still sometimes increasing and sometimes decreasing. I'll change my program according to Marjan's observation now and report again... EDIT: Ok after changing this everything seems to make sense again, the RSS value is now constantly increasing until it does not change anymore! '''Florian 02Feb10 20:10''' | Hi, have you decided when and where the exam review will be, yet? '''Marius Mar30th2010 08:50 p.m.''' |
Line 29: | Line 28: |
To all: Just a small but important observation. The "Jaccard distance" from the exercise sheet is in fact a similarity (between two texts), meaning that your RSS should in fact grow. This also means that you should pick the cluster centroids for each text with the maximum value of the Jaccard similarity. '''Marjan 02Feb10 19:46''' | Your marks for the final exam and your overall mark for the course are now available on your personal page. You will have the opportunity to look at your exams on a certain day, which we haven't fixed yet. We will tell you (via this page) when we have fixed that date. '''Hannah 15Mar10 00:39''' |
Line 31: | Line 30: |
Sure, this is a meaningful way to pick the initial centroids, too. However, there is no guarantee at all, that most SIGIR papers will be assigned to the cluster for which the initial centroid was a SIGIR paper, simply because the title of that paper might have been very untypical for SIGIR. So you might end up with very low precision and recall, although for another assignment of the clusters to the labels / classes, precision and recall might actually be very good. '''Hannah 2Feb10 19:32''' | Yes, HS026 in building 101. '''Hannah 12Mar10 8:45am''' |
Line 33: | Line 32: |
Hi, is it also allowed to choose the initial centroids as follows: first centroid = random SIGGRAPH text, second centroid = random SIGIR text, third centroid = random STOC text? Then we would also know which centroid belongs to which class. '''Mirko 2Feb 19:27''' | Does the HS mean same 101 building? I am still new in Freiburg ;). '''Paresh 12 Mar 10 07:45''' |
Line 35: | Line 34: |
Concerning the RSS values: I haven't written the program myself, but Marjan promised to do so soon. But here is some intuition. First, I would also output the RSS value divided by the number of titles (6630), and the square root of that. Let's call this number J. This gives you the average Jaccard distance of a text to its centroid. What is this J on average? Well, the number of words in your centroids quickly becomes M. In the perfect case, all the words of each text are contained in its nearest centroid. Then the Jaccard distance is text length / M, and so J should be average text length / M. '''Hannah 2Feb10 19:13''' | I am sorry that we have not managed to produce the master solutions for exercise sheets 13 and 14 yet. However, Lecture 13 is still relevant for the exam (clustering, part 2), but I can tell you that there will be ''no'' task about Lecture 14 (statistical hypothesis testing). I think that's fair, because there was no tutorial for the (last) exercise sheet 14. Exercise sheet 14 will count as a normal excercise sheet, however. '''Hannah 11Mar10 15:21''' |
Line 37: | Line 36: |
About computing precision and recall: yes, you need an assignment of the clusters to the three classes SIGGRAPH, SIGIR, STOC. I would do that assignment as follows. Pick the first cluster in your numbering of clusters. Find the most frequent label in the cluster (you have the labels of all titles from dblp.txt). Assign the cluster to that label. Then pick the second cluster, and find the most frequent of the remaining labels, two in this case. And so on, where for three clusters there is no "and so on", because you have only one choice of label left for the third cluster. Once you have a 1-1 assignment from clusters to labels, it should be clear how to compute precision and recall (same definitions as in the case of classification). '''Hannah 2Feb10 18:57''' | Hi, since there are still no master solutions for sheets 13-14, I assume the contents of the lecture concerning these sheets are not relevant for the exam. '''Marius Mar11th 2:23 p.m.''' |
Line 39: | Line 38: |
I had negative steps (as asked below), until I found, that I had a empty cluster. Fixing that removed the problem. '''Johannes 2763-02-02T1834''' | To Johannes + all: you are not allowed to bring any computing devices whatsoever and you won't need them. If there is a task which requires a calculation that is unreasonable to do by hand (like the log_2(10/7) from the mid-term exam), we will tell you what it is or an approximation to work with (for example that you can take log_2(10/7) as 0.5). '''Hannah 10Mar10 20:42''' |
Line 41: | Line 40: |
To exercise 3: Can you give an intuition, in what range the RSS-value should be (approx.)? My RSS-values seems very high to me...Additionally, I have the same question like Matthias few lines ago: How can we calculate recall and precision - without knowing which cluster belongs to which conference? '''Claudius 2Feb10 17:23''' | '''Exam and portable calculators''': "2. You are not allowed to use any computing devices, mobile phones, etc." I had some problems with pen, paper and sqrt(1080). May we bring calculators? '''Johannes 2010-03-10T20:27''' |
Line 43: | Line 42: |
Thanks for the answer, Matthias, please also try with M = 10000 (if that is feasible time-wise, otherwise take M a little smaller) and report again, thanks. '''Hannah 2Feb10 16:43''' | Thanks, the solution for sheet6 ex4 helped us a lot! '''björn''' |
Line 45: | Line 44: |
To Hannah: Yes, for me it does happen for large Ms as well. I tried 10, 100, 1000 with the same quantitative effects. However, for larger M, the RSS value decreases in size. The RSS values are rising&falling in a very small range of about 5%. Do we have an error here or is the distance function + recalculation step given simply not convergent? '''Matthias''' | @Jonas: thanks for the comment, I have corrected it in the master solution. @Björn: I added a master solution for Exercise Sheet 6 (only Exercise 4), linked above, with what I think is a very short and simple proof. Tell me if you find anything wrong with it. '''Hannah 10Mar10 16:40''' |
Line 47: | Line 46: |
To Matthias + Florian + all: Which values of M have you tried? Does it still happen when you choose M large enough, say 1000, or even larger? Note that the cluster centroids cannot be meaningful if M is not large enough. '''Hannah 2Deb10 16:21''' | Jonas: Yes, that was already mentioned in the tutorials. '''Marjan 10Mar10 15:58''' |
Line 49: | Line 48: |
I'm noticing exactly the same behavior like Florian. After a few iterations, the RSS value is alternating esp. when using small M. Also, the RSS values do not fall, they are more or less jumping around the same base value all the time. | Hi. Concerining exercise sheet 10 exercise 1. Shouldn't you take the squareroots of 108 and 10 (in the Matrix EPSILON). Otherwise the equation is not right. '''Jonas 10.03.10''' |
Line 51: | Line 50: |
Also, in order to calculate Recall + Precision, we have to define which of the clusters is which conference. How should we do that? Just by counting the most frequent conference in a cluster based on the known ground truth information? '''Matthias''' | Hi, we got a question concerning ex sheet 6, exercise 4. In the tutorial Marjan presented a solid, but complicated solution using Taylor Expansion. In the lecture you mentioned that this wasn't necessary for any exercise. Unfortunately we fail at finding a simpler, but still mathematical rigorous solution. Would you please give a brief idea of how to proove such inequalities as this might by useful for similar, yet easier exercises in the exam. '''Björn Mi 15:12''' |
Line 53: | Line 52: |
To Alex: You should compute the k-means clustering on the whole set, that's the point of clustering anyway (no train sets). '''Marjan 02Feb10 14:39''' | Hi Johannes + all. Here is a very simple example: let the query word be ''algorithm'' and one candidate similar word computed by the permuted lexicon be ''algXXXthm'' (the common prefix is ''thmalg'' [from the permutations ''thmalgori'' and ''thmalgXXX''] which is long enough) and let the edit distance threshold be 2. Obviously this candidate word will be filtered out because the edit distance is 3. '''Marjan 07Mar10 18:57''' |
Line 55: | Line 54: |
Should we compute k-means on the complete set or should we divide the set in a training and in a test set, like the exercise before?'''Alex 02Feb10 14:10''' | '''Filtering with a Permutern Index''': The slide states: "for all matches thus found, compute the actual edit distance". Is there a simple strawman-example for a word that gets removed in the postfiltering-step? (Today is silly question day.) '''Johannes 2010-03-07T18:26''' |
Line 57: | Line 56: |
Hi Florian + all. That's a very valid question, although counterintuitive from what we expected. To give a precise answer I'll have to write the program myself but since I am in a middle of a deadline I can only promise that I will come back to it asap. My advice is to try to understand why the RSS value increases - either by debugging your own program or by writing a toy example by hand and then reporting here again. '''Marjan 01Feb10 22:41''' | Hi Johannes + all. Concerning your inverted index question: it really depends on the application, if you have lists of only doc ids and want to intersect them fast, you would sort the lists by doc id, if you want to do top-k you would sort them by score. Duplicates only make sense when you also store positional information, which we didn't do in the lecture. Concerning your Elias-Gamma question: there is an upper bound, which I think we also derived in the lecture, and that is log n + O(log^(k) n) + O(1), but I couldn't tell you what are the constants hidden in the two Big-Ohs. '''Hannah 7Mar10 18:19''' |
Line 59: | Line 58: |
The RSS values of my clustering often get higher after an iteration step, is that normal for this kind of data? Furthermore my RSS value never stops changing, it always switches between two different values after a few iterations (I now just choose three random titles as the initial centroids). I also do not understand how I can compute the precision and recall values of the clustering with respect to the "perfect" clustering. '''Florian 01Feb10 22:41''' | '''Inverted indexes and like''': If a inverted index maps a word, w, (perhaps a string) to a subset, W(w), of the set of all documents (perhaps only the IDs as numbers). Is W(w) always sorted? Does it contain duplicates? For some application (and the algorithms for them) this seems to matter. I'm just asking in case of a exam task, involving coding (especially k-way-merge). '''Johannes 2010-03-07T13:54''' |
Line 61: | Line 60: |
To Florian, Eric + all: yes, sure, the centroids must be of the same type as the things you cluster. And since the things you should cluster are strings / texts (the titles from the dblp.txt from the last lecture), the centroids also have to be strings / texts. There are many ways how you could pick the initial centroids. One way would be to just pick k random titles from the dblp.txt as centroids. '''Hannah 1Feb10 20:55''' | '''Elias-Gamma Encoding''': Is there a closed form for the length of the code for an integer x when elias is iterated k times? '''Johannes 2010-03-07T15:14''' |
Line 63: | Line 62: |
As I understand it the centroids have to be strings as well and the distance of a text to the centroid is just jaccard distance of their distinct words. The new centroids can then just be created by using the m most frequent occuring words in all texts which are assigned to this centroid as statet in exercise 1. The only problem then would be to get the initial centroids... '''Florian 01Feb10 20:50''' | == Questions and comments about the master solution of the mid-term exam == |
Line 65: | Line 64: |
As I understood we may chose random numbers between 0 and 1, since this is what the jaccard distance will lead us to. But my next question is: how should we determine the distance of ONE text to it's centroid? (while initially assigning them to clusters) '''Eric 01Feb10 20:38''' | '''Johannes 2010-03-07T12:40''' : |
Line 67: | Line 66: |
Hi, how should we choose our initial centroids for the k-means algorithm with the strings? '''Florian 01Feb10 17:56''' | '''1.3''': CLAIM: If an encoding is prefix-free, then there is no code that is a prefix of a different code. Does this claim hold? If so, then 001 mustn't be a code, since 0 is a code and a prefix of 001. Is this right? |
Line 69: | Line 68: |
== Questions and comments about Exercise Sheet 11 below this line (most recent on top) == Hi Johannes, nice way of telling me, and yes, you are of course right, it should be p = h / lambda and q = t / lambda and then p + q = 1 => lambda = h + t. And not p = lambda * h and q = lambda * t and lambda = 1 / (h + t). But believe me, it's very easy to make such stupid mistakes when doing calculations at the (virtual) blackboard. That is why I always ask you guys to pay attention when I am doing calculations, and to correct me if I am doing something wrong. Anyway, also your late feedback is of course appreciated, and I will see how I can correct that thingy on the slides. '''Hannah 27Jan10 17:34''' |
There was an obvious mistake which I now corrected (00 should be mapped to 1, not 0). '''Hannah 7Mar10 12:56''' |
Line 72: | Line 70: |
I have cognitive dissonances from slide 12 from the last lecture. The implications give me an uncomfortable feeling. '''Johannes 2763-01-27T1704''' | '''1.4''': It states: "For a sequence of length n, we need to generate n/2 such codes [...]." Does not each symbol of the n from the sequence get encoded? |
Line 74: | Line 72: |
Hi Matthias, well, you are learning the Pr(W = w|C = c) from the 10% training data (every tenth record), so that's where they should come from. '''Hannah 27Jan10 16:07''' | Each code stands for two bits at a time, so for a sequence of n bits, you have to generate n/2 codes. I replaced ''sequence of length n'' by ''sequence of n bits'' to make this clearer. '''Hannah 7Mar10 12:58''' |
Line 76: | Line 74: |
Hi, could you please explain again what data should be used for calculating the highest Pr(W = w|C = c) in excercies 2? The whole data, the remaining 90%, the learning 10% ? '''Matthias 27Jan10 11:52''' | '''3.4''': The function returns the number of common k-grams (as far as I see). Can the return-line be completed with a call to the function from 3.2 to return the Jaccard-distance? Yes, indeed, I replaced ''return l'' by ''return jaccardDistance(x, y, k, l)''. '''Hannah 7Mar10 13:01''' |
Line 78: | Line 78: |
A comment for all who haven't submitted their solutions yet (= most). To compute the argmax_c Pr(C = c) * Prod_w Pr(W = w|C = c), better compute the argmax_c of the logarithms, that is, argmax_c log(Pr(C = c)) + sum_w log Pr(W = W|C = c). The result will be the same, because log is a mononotone function. However, computing the sum of the logs is numerical more stable, while computing the product of many small probabilities can lead to numerical problems which can distort results. I don't think it's a big issue for the relatively small data set I gave you, but I would still do it. Anyway, it's not more work computing the sums of logs of things than the product of the things. '''Hannah 27Jan10 5:19am''' | '''5.4''': Does the top-k-algorithm return the top k documents? If so, which k had to been used in this task? What exactly is the condition for stopping? What exactly is the update rule for the ranges? My idea is that (for a fixed document) the minimum is always the known minimum from any of the lists and the maximum is always the (already known) minimum plus the lowest score, seen in any list different than the one the minimum is from. In case of only two lists there may be some simplifications. |
Line 80: | Line 80: |
Hi Alex + all: I didn't have a program so far, but have just written one, and my overall precision is 83.61%. So classification seems to work pretty well on this dataset. I didn't do anything fancy, used the +1 smoothing to avoid zero probabilities, and used every word. Note that the titles also contain commas, parantheses and stuff. I am saying this because I have seen that some people have words like "(extended" or "abstract)" or "title.". So please do pay attention to that and do not tokenize merely by whitepace. Also, whenever you write a program, test it!!! That is, have a small procedure that outputs the learned probabilities (or better, the counts), and then check them for a small example. I did that as well for my program, otherwise I would never be convinced that it does the correct thing. '''Hannah 27Jan10 1:10am''' | The task asked for the ''top-ranked document'', so k = 1. We can stop when the upper bound for all documents not yet seen is ''strictly'' below the k-th largest lower bound so far, and when the score ranges for the documents already seen are such that it is clear which are the top-k documents and in which order. If there are ties, and we don't care how they are broken, and we don't care to know the order of the top-k documents, we can sometimes stop earlier. Does this answer all your questions? '''Hannah 7Mar10 13:06''' |
Line 82: | Line 82: |
Can you give an short hint how high the match rate should be. At my program the detection rate is around 1/3. I think this is a little bit low, but I also found no hint what rate is a good rate for the given set of documents. Edit: I found a bug now the detection rate is around 70%, but the question is still the same, is this a possible result? '''Alex 26Jan10 18:20''' | Thanks a lot for your comments! Please go on if you have more. '''Hannah 7Mar10 13:07''' |
Line 84: | Line 84: |
Hi Claudius + all: to get the points, you only have to compute the w with the highest P(W=w|C=c), even if that is a word like "for" or "of". Would be nice and more interesting though, and not really more work, to compute those w with the k highest P(W=w|C=c). For some not too large k, some interesting words should crop up. You can also choose to ignore stopwords altogether as Marjan suggested. Here is a [[http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words|list of English stopwords]]. '''Hannah 25Jan10 18:25''' | Thanks a lot for your answers! '''Johannes 2010-03-07T13:44''' |
Line 86: | Line 86: |
Hi Claudius. My recommendation is to ignore stop-words (e.g. the, a, of, is, are etc., for reasons already explained in the lecture) but please wait for a reply from Hannah to be sure. '''Marjan 25Jan10 14:30''' | == Questions and comments about Exercise Sheet 14 below this line (most recent on top) == |
Line 88: | Line 88: |
Hi. In Exercise 2, we have to identify the most predictive word for each conference. But, when I take the heighest Pr(W=w|C=c), I get not very predictive words like "for" and "of". Is this sufficient, or should we make an effort, to find words, which are more predictive? '''Claudius 25 Jan 14:26''' | Hi Johannes: why don't you start with the first few questions, and then let's see whether it makes sense to continue this via the Wiki, or via private email, or via a meeting in person. '''Hannah 6Mar10 17:36''' |
Line 90: | Line 90: |
Yes, very good question (the second one), I had it on my agenda for the lecture, but somehow forgot to tell you about it. There is a very simple and effective solution to that problem, which you should also use in the exercise. On slide #10, I told you to take Pr(W = w | C = c) = n_wc / sum_w n_wc, where n_wc is the total number of occurrences of word w in class c. Well, just take Pr(W = w | C = c) = (n_wc + 1) / sum_w (n_wc + 1), which can never be zero. Intuitively, this is like saying that every word occurs at least once for each class. Which is also reasonable, because if your amount of data is big enough, that will indeed happen. It's just an artefact of small data that some words don't occur at all for certain classes. Please ask again in case that was not crystal clear. '''Hannah 24Jan10 21:49''' | Yes, the final exam is like the mid-term exam in this respect. '''Hannah 6Mar10 17:36''' |
Line 92: | Line 92: |
To Florian + all: Of course you should use the Bayes formula to predict the most probable conference (class). The second question is a good one. I think the natural way is to take that probability as zero. Another way (actually the opposite) is to ignore the words that have not appeared in the original training set i.e. assume that they're not relevant for the prediction. '''Marjan 24Jan10 21:32''' | Alex: http://vulcano.informatik.uni-freiburg.de/wiki/teaching/SearchEnginesWS0910/MidTermExam, so it seems to be allowed. '''Mirko, 6Mar10 16:10''' |
Line 94: | Line 94: |
I have a question to Exercise 2: I do not quite understand how we should predict the conferences for the remaining records. Should we just decide by looking at the most predictive word and decide with that or should we use the Naive Bayes formula of the slides ( argmax_c Pr(C = c) · Π_i=1,...,m Pr(W_i = w_i | C = c) ). And using the Bayes fomula, how should we handle occuring words that did not occur in the training data? Using zero for their probability makes the whole probability for the conference zero as well which is not very reasonable. '''Florian 24Jan10 21:20''' | Hi, I was wondering, will the exam next week also be an open book exam like the mid-term? Perhaps I overlooked it, but I don't think this is stated anywhere yet. '''Alex 6Mar10 13:49''' |
Line 96: | Line 96: |
I have also uploaded the master solutions for exercise sheet 10 now, see the link above. Note that it's just two pages. Above you also find links to the previous master solutions now (that is, for the mid-term exam and for exercise sheet 9). If you find any mistakes in any of the master solutions, please let us know immediately, thanks. Also, if you have any questions / comments regarding the master solutions, don't hesitate to ask. '''Hannah 24Jan10 16:05''' | I have lots of questions and don't know where to put them. I suppose this wiki-page will get chaotic pretty fast if I post 20 questions. '''Johannes VI Mar MMX 12:00''' |
Line 98: | Line 98: |
Ok, the file is now there, see the link and short description above. Have fun, and let us know if you are having any problems. '''NOTE:''' I said it in the lectures, but let me repeat it here, just in case, you must, of course, only use ''only the words from the title as features''. The conference name in the first column is only so that you know the ground truth, which you need for the learning in Exercise 1, as well as for the quality assessment in Exercise 4. '''Hannah 24Jan10 15:48''' | I'm sorry for the delay with the master solutions. I am at a conference right now but will try to make progress with this over the weekend. '''Hannah 4Mar10 23:59''' |
Line 100: | Line 100: |
I will do it right now, sorry, it was just procrastination from my side. '''Hannah 24Jan10 15:06''' | Do we get master solutions for ex. 11, 12, 13 and 14? '''Johannes 04Mar2010 23:32 ZULU''' |
Line 102: | Line 102: |
Hi, can you please upload the text-file with the publication records? '''Claudius 24 Jan 12:05''' | Now they're there again. '''Marjan 01Mar18:09''' |
Line 104: | Line 104: |
Hi Manuela + all: I understand your point. I think that when one is familiar with basic linear algebra, then all the exercises (including Exercise 2, given my fairly strong and concrete hints) are something which you just sit down and do, no deep thinking required. But when one is not familiar, then yes, I can see that most of the time will be spend on understanding the meaning of basic things (which, I agree, is very important) like why can one write something like u * v', where u and v are vectors, and obtain a matrix. I guess I am constantly underestimating the mathematical background and exercise you received in you first semesters here in Freiburg. Anyway, I will take this into account when computing the marks from your points for the exercise sheets 9, 10, 11, etc. Note that also for the first 8 exercise sheets you could get a 1.0 without getting all the points, even after taking the worst sheet out of the counting. We will have something similar for the second half, too. So don't worry, it will be fair, and please continue to make an effort with the exercises, and continue to give me feedback when an exercise consumed way too much time, for whatever reason. '''Hannah 21Jan 17:48''' | ARGH! I'm very sorry. My Down-Them-All Plugin for Firefox seems to have deleted all the lecture PDFs! Sorry for that. Rollback to previous versions does not seem to work. I hope, someone has already downloaded them all and is able to restore them! SORRY! Interesting, I've got the rights to delete something from the main page, though. '''Marius Mar 1st 2010 2:38 p.m.''' |
Line 106: | Line 106: |
Maybe it's only a problem for me that I can't sit down and start to prove f.e. exercise 2 or 3 immediately. I'm not familiar with linear algebra and it's difficult to understand the meaning of what we do. So before I can start I have to search for information and have to read what matrix norms and Frobenius norms and so on is. That's why it took much time for me to do exercise 2 and 3. Proving the hints (at the bottom of this page) is also nothing what I can do in five minutes. And for exercise 1 it was my own fault that I need much more time for it. I was confused and made some silly stuff. Of course it would be nice to have the bonus points for the exam, but it will be hard (and time consuming) to solve all tasks of all exercise sheets without gaps. Thanks for the hints and I think that the new bonus point system is much better than the old one. The only thing is that I'm not sure, if the "time calculation" is better than before. Maybe I'm just too slow. '''Manuela''' | (Reminder:) Hello, the master solutions are not online, yet. '''alex n 1Mar10 11:08''' |
Line 108: | Line 108: |
To Björn at all: Yes, I see, I think the solution to an exercise like Exercise 1 is much faster to write on paper and then scan it in. Typesetting lots of matrices etc. in Latex is no fun and takes lots of time and shouldn't really be part of an exercise. '''Hannah 21Jan10 14:32''' | Yes, we are working on it. Please remind us again if they aren't online by the end of this week. '''Hannah 23Feb10 14:30''' |
Line 110: | Line 110: |
Yes, your last hint was very helpful. Thanks a lot. Sorry for the late response but I had to work for other courses first and it took me like 3 hours to put the other solutions into Latex (maybe this is also one reason why this sheet takes lots of time again. Especially Ex1 is okay to solve using applets/programs + copy&paste for all intermediate steps, but writing everything down, still takes ages). Now that I looked at exercise 2 again, your hint really helped. '''Björn 21Jan 13:03''' | Do we get master solutions for ex. 11, 12, 13 and 14? '''Johannes 23Feb10 14:05''' Hi Matthias, yes, Pr(A) = 1 - Pr(not A), for any event A, and so for any random variable X, Pr(X <= x) = 1 - Pr(X > x), because X <= x and X > x are complementary events. For continuous random variables (like variables with a normal distribution), the difference between <= and < and >= and > is immaterial, because Pr(X = x) for each fixed x. But anyway, to compute the probability, you first have to transform it a bit, like I did in the lecture, and then obtain Pr(N(0,1) >= sqrt(n1) * (µ1 - µ) / σ) and Pr(N(0,1) <= sqrt(n2) * (µ - µ2) / σ). To evaluate the latter you can also simply use the symmetry of the normal distribution, due to which one has Pr(N(0,1) <= -x) = Pr(N(0,1) >= x). '''Hannah 18Feb10 12:58''' Hi, how can we compute Pr(N(n2 * µ2, n2 * σ^2^) <= n2 * µ2 ? Can we use 1- (Pr(N(n2 * µ2, n2 * σ^2^) >= n2 * µ2) for that ? '''Matthias 18Feb10 12:01''' Hi Florian + all, one of µ1 and µ2 is larger than µ and one is smaller. Let's assume µ1 is larger and µ2 is smaller. Then for µ1 you have to look at Pr(N(n1 * µ, n1 * σ^2^) >= n1 * µ1). But for µ2 you have to look at Pr(N(n2 * µ2, n2 * σ^2^) <= n2 * µ2). Note the <= instead of the >= for the second probability. Recall the meaning of these probabilities. Just as an example, let µ be 100 and µ1 be 150 and µ2 be 50. Then the first probability means: what is the probability that I see a mean of ''150 or more'' in my first sample, although the mean of my distribution is 100. The second probability means: what is the probability that I see a mean of ''50 or less'' in my second sample, although the mean of my distribution is 100. If you take both <= or both >= for both probabilities, it is to be expected that you get two completely different probabilities, one very low and one very high (except when they are both close to 50%). Please ask again if this is still unclear. '''Hannah 17Feb10 21:51''' Sorry, with probability for µ1 I meant Pr(N(n1 * µ, n1 * σ^2^) >= n1 * µ1) and accordingly with probability for µ2 I meant Pr(N(n2 * µ, n2 * σ^2^) >= n2 * µ2) where n1=n2 for the exercise sheet. '''Florian 17Feb10 21:18''' Hi Florian, what exactly do you mean by ''probability for µ1'' and ''probability for µ2''? '''Hannah 17Feb10 21:02''' Hi, what values are we expected to get for exercise 4? I always get a probability of about 99.9% for μ1 and a value of about 0.07% for μ2, can that be? '''Florian 17Feb10 18:25''' Hi Florian, yes, the ''averages'' in Exercise 3 should be ''average running times''. I uploaded a new version of the sheet, where I corrected this. '''Hannah 14Feb10 17:48''' Hi, I guess we should measure the running times to determine the efficiency of the programs for exercise 3? '''Florian 15Feb10 17:42''' Hi Claudius, you should compute Pr(D|H0), exactly as done in the lecture for Example 2, where we computed this probability as Pr(X > x), where X is a random variable with distribution N(0,1), that is, normal with mean 0 and variance 1, and x depends on the mean and variance of your data. '''Hannah 14Feb10 16:44''' Hi. If I have understood correctly, we have to compute Pr(H|D) in Exercise 4. From statistical hypothesis testing, we get Pr(D|H). Now, Pr(H|D) = Pr(D|H) * (Pr(H) / Pr(D)). We know Pr(D|H) and we can compute Pr(D), but what value do we have to use for Pr(H)? '''Claudius 14Feb10 14:41''' Hi Eric, I don't care whether you use integers or doubles, but I am curious why the one should be any harder than the other? '''Hannah 12Feb10 19:02''' May we use integers for sorting? Or do we have to use doubles? This is important for generating my sorted array '''Eric 12Feb10 18:56''' If you're asking about the merging you can of course use a priority queue if you want, but you don't really need it when merging 2 lists. '''Marjan 18:28''' Why would you use a priority queue? It's simple sorting, the exercise is not about implementing your own sorting algorithm or something like that. About exercise 3, it should be clear from the exercise itself that the sequences should be sorted (otherwise how can the merging work?) '''Marjan 18:23''' Means that we have nothing to do than use a priority queue or something like that and don't have to implement the sorting? And at Exercise 3 the random set should be an ordered one or not? '''Alex 12Feb10 18:19''' We prefer randomized sorting using bitonic networks, alternatively combined with LSD radix sort or simple pancake sort. That's of course a joke, it should be clear that you can use the built-in sorting functions (your own implementation will be certainly slower). '''Marjan 12Feb10 18:12''' What does "do a standard sort" in exercise 2 mean? Shall I implement one on my own, or may I use the Java built-in sorting mechanisms? Also, which sorting algorithm do you prefer for this? '''Eric 12Feb10 18:04''' |
Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.
Here are PDFs of the slides of the lectures: Lecture 1 (Intro), Lecture 2 (socket communication), Lecture 3 (ranking), Lecture 4 (IO-efficiency & compression), Lecture 5 (list intersection), Lecture 6 (prefix search), Lecture 7 (javascript), Lecture 8 (error-tolerant search), Lecture 9 (programming languages & UTF-8), Lecture 10 (latent semantic indexing), Lecture 11 (naive bayes classification), Lecture 12 (clustering), Lecture 13 (hierarchical clustering), Lecture 14 (hypothesis testing), Projects.
Here are the recordings of the lectures (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: Recording Lecture 1 (LPD), Recording Lecture 3 (LPD), Recording Lecture 4 (LPD), Recording Lecture 5 (LPD without audio), Recording Lecture 6 (LPD), Recording Lecture 7 (AVI), Recording Lecture 8 (AVI), Recording Lecture 9 (AVI), Recording Lecture 10 (AVI), Recording Lecture 11 (AVI), Recording Lecture 12 (AVI), Recording Lecture 13 (AVI), Recording Lecture 14 (AVI). To play the Lecturnity recordings (.lpd files) you need the Lecturnity Player, which you can download here. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend VLC.
Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5, Exercise Sheet 6, Exercise Sheet 7, Exercise Sheet 8, Exercise Sheet 9, Exercise Sheet 10, Exercise Sheet 11, Exercise Sheet 12, Exercise Sheet 13, Exercise Sheet 14.
Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4, Solutions and Comments 5, Solutions and Comments 6, Solutions and Comments 7, Solutions and Comments 8, Solutions and Comments 9, Solutions and Comments 10, Solutions and Comments 11, Solutions and Comments 12, Solutions and Comments 13.
Here are our master solutions: Master solution for Exercise Sheet 6 (only Exercise 4), Master solution for Mid-Term Exam,Master solution for Exercise Sheet 9, Master solution for Exercise Sheet 10, Master solution for Exercise Sheet 11, Master solution for Exercise Sheet 12.
Here are the rules for the exercises as explained in Lecture 2.
Here is everything about the mid-term exam (December 18, 2009, a pure trial exam which did not count for anything) and the final exam (March 12, 2010, the real thing which accounted for most of the mark).
Here is the table with the links to your uploaded solutions for Exercise Sheet 14. The deadline is Thursday 18Feb10 16:00.
More general questions and comments
Yes, it's on Wednesday, March 31, 2 - 3 pm in my office (building 51, second floor, room 28). Hannah 20Mar10 21:09
Hi, have you decided when and where the exam review will be, yet? Marius Mar30th2010 08:50 p.m.
Your marks for the final exam and your overall mark for the course are now available on your personal page. You will have the opportunity to look at your exams on a certain day, which we haven't fixed yet. We will tell you (via this page) when we have fixed that date. Hannah 15Mar10 00:39
Yes, HS026 in building 101. Hannah 12Mar10 8:45am
Does the HS mean same 101 building? I am still new in Freiburg ;). Paresh 12 Mar 10 07:45
I am sorry that we have not managed to produce the master solutions for exercise sheets 13 and 14 yet. However, Lecture 13 is still relevant for the exam (clustering, part 2), but I can tell you that there will be no task about Lecture 14 (statistical hypothesis testing). I think that's fair, because there was no tutorial for the (last) exercise sheet 14. Exercise sheet 14 will count as a normal excercise sheet, however. Hannah 11Mar10 15:21
Hi, since there are still no master solutions for sheets 13-14, I assume the contents of the lecture concerning these sheets are not relevant for the exam. Marius Mar11th 2:23 p.m.
To Johannes + all: you are not allowed to bring any computing devices whatsoever and you won't need them. If there is a task which requires a calculation that is unreasonable to do by hand (like the log_2(10/7) from the mid-term exam), we will tell you what it is or an approximation to work with (for example that you can take log_2(10/7) as 0.5). Hannah 10Mar10 20:42
Exam and portable calculators: "2. You are not allowed to use any computing devices, mobile phones, etc." I had some problems with pen, paper and sqrt(1080). May we bring calculators? Johannes 2010-03-10T20:27
Thanks, the solution for sheet6 ex4 helped us a lot! björn
@Jonas: thanks for the comment, I have corrected it in the master solution. @Björn: I added a master solution for Exercise Sheet 6 (only Exercise 4), linked above, with what I think is a very short and simple proof. Tell me if you find anything wrong with it. Hannah 10Mar10 16:40
Jonas: Yes, that was already mentioned in the tutorials. Marjan 10Mar10 15:58
Hi. Concerining exercise sheet 10 exercise 1. Shouldn't you take the squareroots of 108 and 10 (in the Matrix EPSILON). Otherwise the equation is not right. Jonas 10.03.10
Hi, we got a question concerning ex sheet 6, exercise 4. In the tutorial Marjan presented a solid, but complicated solution using Taylor Expansion. In the lecture you mentioned that this wasn't necessary for any exercise. Unfortunately we fail at finding a simpler, but still mathematical rigorous solution. Would you please give a brief idea of how to proove such inequalities as this might by useful for similar, yet easier exercises in the exam. Björn Mi 15:12
Hi Johannes + all. Here is a very simple example: let the query word be algorithm and one candidate similar word computed by the permuted lexicon be algXXXthm (the common prefix is thmalg [from the permutations thmalgori and thmalgXXX] which is long enough) and let the edit distance threshold be 2. Obviously this candidate word will be filtered out because the edit distance is 3. Marjan 07Mar10 18:57
Filtering with a Permutern Index: The slide states: "for all matches thus found, compute the actual edit distance". Is there a simple strawman-example for a word that gets removed in the postfiltering-step? (Today is silly question day.) Johannes 2010-03-07T18:26
Hi Johannes + all. Concerning your inverted index question: it really depends on the application, if you have lists of only doc ids and want to intersect them fast, you would sort the lists by doc id, if you want to do top-k you would sort them by score. Duplicates only make sense when you also store positional information, which we didn't do in the lecture. Concerning your Elias-Gamma question: there is an upper bound, which I think we also derived in the lecture, and that is log n + O(log^(k) n) + O(1), but I couldn't tell you what are the constants hidden in the two Big-Ohs. Hannah 7Mar10 18:19
Inverted indexes and like: If a inverted index maps a word, w, (perhaps a string) to a subset, W(w), of the set of all documents (perhaps only the IDs as numbers). Is W(w) always sorted? Does it contain duplicates? For some application (and the algorithms for them) this seems to matter. I'm just asking in case of a exam task, involving coding (especially k-way-merge). Johannes 2010-03-07T13:54
Elias-Gamma Encoding: Is there a closed form for the length of the code for an integer x when elias is iterated k times? Johannes 2010-03-07T15:14
Questions and comments about the master solution of the mid-term exam
Johannes 2010-03-07T12:40 :
1.3: CLAIM: If an encoding is prefix-free, then there is no code that is a prefix of a different code. Does this claim hold? If so, then 001 mustn't be a code, since 0 is a code and a prefix of 001. Is this right?
There was an obvious mistake which I now corrected (00 should be mapped to 1, not 0). Hannah 7Mar10 12:56
1.4: It states: "For a sequence of length n, we need to generate n/2 such codes [...]." Does not each symbol of the n from the sequence get encoded?
Each code stands for two bits at a time, so for a sequence of n bits, you have to generate n/2 codes. I replaced sequence of length n by sequence of n bits to make this clearer. Hannah 7Mar10 12:58
3.4: The function returns the number of common k-grams (as far as I see). Can the return-line be completed with a call to the function from 3.2 to return the Jaccard-distance?
Yes, indeed, I replaced return l by return jaccardDistance(x, y, k, l). Hannah 7Mar10 13:01
5.4: Does the top-k-algorithm return the top k documents? If so, which k had to been used in this task? What exactly is the condition for stopping? What exactly is the update rule for the ranges? My idea is that (for a fixed document) the minimum is always the known minimum from any of the lists and the maximum is always the (already known) minimum plus the lowest score, seen in any list different than the one the minimum is from. In case of only two lists there may be some simplifications.
The task asked for the top-ranked document, so k = 1. We can stop when the upper bound for all documents not yet seen is strictly below the k-th largest lower bound so far, and when the score ranges for the documents already seen are such that it is clear which are the top-k documents and in which order. If there are ties, and we don't care how they are broken, and we don't care to know the order of the top-k documents, we can sometimes stop earlier. Does this answer all your questions? Hannah 7Mar10 13:06
Thanks a lot for your comments! Please go on if you have more. Hannah 7Mar10 13:07
Thanks a lot for your answers! Johannes 2010-03-07T13:44
Questions and comments about Exercise Sheet 14 below this line (most recent on top)
Hi Johannes: why don't you start with the first few questions, and then let's see whether it makes sense to continue this via the Wiki, or via private email, or via a meeting in person. Hannah 6Mar10 17:36
Yes, the final exam is like the mid-term exam in this respect. Hannah 6Mar10 17:36
Alex: http://vulcano.informatik.uni-freiburg.de/wiki/teaching/SearchEnginesWS0910/MidTermExam, so it seems to be allowed. Mirko, 6Mar10 16:10
Hi, I was wondering, will the exam next week also be an open book exam like the mid-term? Perhaps I overlooked it, but I don't think this is stated anywhere yet. Alex 6Mar10 13:49
I have lots of questions and don't know where to put them. I suppose this wiki-page will get chaotic pretty fast if I post 20 questions. Johannes VI Mar MMX 12:00
I'm sorry for the delay with the master solutions. I am at a conference right now but will try to make progress with this over the weekend. Hannah 4Mar10 23:59
Do we get master solutions for ex. 11, 12, 13 and 14? Johannes 04Mar2010 23:32 ZULU
Now they're there again. Marjan 01Mar18:09
ARGH! I'm very sorry. My Down-Them-All Plugin for Firefox seems to have deleted all the lecture PDFs! Sorry for that. Rollback to previous versions does not seem to work. I hope, someone has already downloaded them all and is able to restore them! SORRY! Interesting, I've got the rights to delete something from the main page, though. Marius Mar 1st 2010 2:38 p.m.
(Reminder:) Hello, the master solutions are not online, yet. alex n 1Mar10 11:08
Yes, we are working on it. Please remind us again if they aren't online by the end of this week. Hannah 23Feb10 14:30
Do we get master solutions for ex. 11, 12, 13 and 14? Johannes 23Feb10 14:05
Hi Matthias, yes, Pr(A) = 1 - Pr(not A), for any event A, and so for any random variable X, Pr(X <= x) = 1 - Pr(X > x), because X <= x and X > x are complementary events. For continuous random variables (like variables with a normal distribution), the difference between <= and < and >= and > is immaterial, because Pr(X = x) for each fixed x. But anyway, to compute the probability, you first have to transform it a bit, like I did in the lecture, and then obtain Pr(N(0,1) >= sqrt(n1) * (µ1 - µ) / σ) and Pr(N(0,1) <= sqrt(n2) * (µ - µ2) / σ). To evaluate the latter you can also simply use the symmetry of the normal distribution, due to which one has Pr(N(0,1) <= -x) = Pr(N(0,1) >= x). Hannah 18Feb10 12:58
Hi, how can we compute Pr(N(n2 * µ2, n2 * σ2) <= n2 * µ2 ? Can we use 1- (Pr(N(n2 * µ2, n2 * σ2) >= n2 * µ2) for that ? Matthias 18Feb10 12:01
Hi Florian + all, one of µ1 and µ2 is larger than µ and one is smaller. Let's assume µ1 is larger and µ2 is smaller. Then for µ1 you have to look at Pr(N(n1 * µ, n1 * σ2) >= n1 * µ1). But for µ2 you have to look at Pr(N(n2 * µ2, n2 * σ2) <= n2 * µ2). Note the <= instead of the >= for the second probability. Recall the meaning of these probabilities. Just as an example, let µ be 100 and µ1 be 150 and µ2 be 50. Then the first probability means: what is the probability that I see a mean of 150 or more in my first sample, although the mean of my distribution is 100. The second probability means: what is the probability that I see a mean of 50 or less in my second sample, although the mean of my distribution is 100. If you take both <= or both >= for both probabilities, it is to be expected that you get two completely different probabilities, one very low and one very high (except when they are both close to 50%). Please ask again if this is still unclear. Hannah 17Feb10 21:51
Sorry, with probability for µ1 I meant Pr(N(n1 * µ, n1 * σ2) >= n1 * µ1) and accordingly with probability for µ2 I meant Pr(N(n2 * µ, n2 * σ2) >= n2 * µ2) where n1=n2 for the exercise sheet. Florian 17Feb10 21:18
Hi Florian, what exactly do you mean by probability for µ1 and probability for µ2? Hannah 17Feb10 21:02
Hi, what values are we expected to get for exercise 4? I always get a probability of about 99.9% for μ1 and a value of about 0.07% for μ2, can that be? Florian 17Feb10 18:25
Hi Florian, yes, the averages in Exercise 3 should be average running times. I uploaded a new version of the sheet, where I corrected this. Hannah 14Feb10 17:48
Hi, I guess we should measure the running times to determine the efficiency of the programs for exercise 3? Florian 15Feb10 17:42
Hi Claudius, you should compute Pr(D|H0), exactly as done in the lecture for Example 2, where we computed this probability as Pr(X > x), where X is a random variable with distribution N(0,1), that is, normal with mean 0 and variance 1, and x depends on the mean and variance of your data. Hannah 14Feb10 16:44
Hi. If I have understood correctly, we have to compute Pr(H|D) in Exercise 4. From statistical hypothesis testing, we get Pr(D|H). Now, Pr(H|D) = Pr(D|H) * (Pr(H) / Pr(D)). We know Pr(D|H) and we can compute Pr(D), but what value do we have to use for Pr(H)? Claudius 14Feb10 14:41
Hi Eric, I don't care whether you use integers or doubles, but I am curious why the one should be any harder than the other? Hannah 12Feb10 19:02
May we use integers for sorting? Or do we have to use doubles? This is important for generating my sorted array Eric 12Feb10 18:56
If you're asking about the merging you can of course use a priority queue if you want, but you don't really need it when merging 2 lists. Marjan 18:28
Why would you use a priority queue? It's simple sorting, the exercise is not about implementing your own sorting algorithm or something like that. About exercise 3, it should be clear from the exercise itself that the sequences should be sorted (otherwise how can the merging work?) Marjan 18:23
Means that we have nothing to do than use a priority queue or something like that and don't have to implement the sorting? And at Exercise 3 the random set should be an ordered one or not? Alex 12Feb10 18:19
We prefer randomized sorting using bitonic networks, alternatively combined with LSD radix sort or simple pancake sort. That's of course a joke, it should be clear that you can use the built-in sorting functions (your own implementation will be certainly slower). Marjan 12Feb10 18:12
What does "do a standard sort" in exercise 2 mean? Shall I implement one on my own, or may I use the Java built-in sorting mechanisms? Also, which sorting algorithm do you prefer for this? Eric 12Feb10 18:04