6965
Comment:
|
9609
|
Deletions are marked like this. | Additions are marked like this. |
Line 3: | Line 3: |
Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]]. | Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]], [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]], [[attachment:SearchEnginesWS0910/lecture-7.pdf|Lecture 7]], [[attachment:SearchEnginesWS0910/lecture-8.pdf|Lecture 8]], [[attachment:SearchEnginesWS0910/lecture-9.pdf|Lecture 9]], [[attachment:SearchEnginesWS0910/lecture-10.pdf|Lecture 10]], [[attachment:SearchEnginesWS0910/lecture-11.pdf|Lecture 11]], [[attachment:SearchEnginesWS0910/lecture-12.pdf|Lecture 12]], [[attachment:SearchEnginesWS0910/lecture-13.pdf|Lecture 13]]. |
Line 5: | Line 5: |
Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]]. | Here are the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (LPD without audio)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-6.lpd|Recording Lecture 6 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-7.avi|Recording Lecture 7 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-8.avi|Recording Lecture 8 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-9.avi|Recording Lecture 9 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-10.avi|Recording Lecture 10 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-11.avi|Recording Lecture 11 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-12.avi|Recording Lecture 12 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-13.avi|Recording Lecture 13 (AVI)]]. To play the Lecturnity recordings (.lpd files) you need the [[http://www.lecturnity.de/de/download/lecturnity-player|Lecturnity Player, which you can download here]]. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend [[http://www.videolan.org/vlc|VLC]]. |
Line 7: | Line 7: |
Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Exercise Sheet 1]]. | Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]], [[attachment:SearchEnginesWS0910/exercise-7.pdf|Exercise Sheet 7]], [[attachment:SearchEnginesWS0910/exercise-8.pdf|Exercise Sheet 8]], [[attachment:SearchEnginesWS0910/exercise-9.pdf|Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/exercise-10.pdf|Exercise Sheet 10]], [[attachment:SearchEnginesWS0910/exercise-11.pdf|Exercise Sheet 11]], [[attachment:SearchEnginesWS0910/exercise-12.pdf|Exercise Sheet 12]], [[attachment:SearchEnginesWS0910/exercise-13.pdf|Exercise Sheet 13]]. Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]], [[SearchEnginesWS0910/ExerciseSheet6|Solutions and Comments 6]], [[SearchEnginesWS0910/ExerciseSheet7|Solutions and Comments 7]], [[SearchEnginesWS0910/ExerciseSheet8|Solutions and Comments 8]], [[SearchEnginesWS0910/ExerciseSheet9|Solutions and Comments 9]], [[SearchEnginesWS0910/ExerciseSheet10|Solutions and Comments 10]], [[SearchEnginesWS0910/ExerciseSheet11|Solutions and Comments 11]], [[SearchEnginesWS0910/ExerciseSheet12|Solutions and Comments 12]]. Here are our master solutions: [[attachment:SearchEnginesWS0910/solution-midterm.pdf|Master solution for Mid-Term Exam]],[[attachment:SearchEnginesWS0910/solution-9.pdf|Master solution for Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/solution-10.pdf|Master solution for Exercise Sheet 10]]. |
Line 11: | Line 15: |
= Exercise Sheet 2 = | [[SearchEnginesWS0910/MidTermExam|Here is everything about the mid-term exam]]. The final exam is on Friday March 12, 2010. The written exam begins at 2.00 pm in HS 026. The oral exams are scheduled on the same day. |
Line 13: | Line 17: |
Here are the details about the three servers (UDP, TCP, HTTP) for Exercise 4: | [[SearchEnginesWS0910/ExerciseSheet13|Here is the table with the links to your uploaded solutions for Exercise Sheet 13]]. The deadline is Thursday 11Feb10 16:00. |
Line 15: | Line 19: |
All three servers are running on our machine vulcano.informatik.uni-freiburg.de (IP address is 132.230.152.135). | == Questions and comments about Exercise Sheet 13 below this line (most recent on top) == |
Line 17: | Line 21: |
The UDP server is running on port 8888 of that machine. You can send it a number and it will then send you back that number of bytes, in packets of 1000 bytes each. (That means you also have to read packets of 1000 bytes each.) The first ten bytes of each packet contain the packet id. That is interesting for checking which packets get lost and in which order packets arrive. | '''FYI: on Thursday, we will have 45 minutes of regular lecture (16.15 - 17.00 h), and then the intro lecture for the bachelor / master's project (17.00 - 17.45 h). Those who are interested in doing a project, can just stay, the others may go then.''' '''Hannah 8Feb10 14:08''' |
Line 19: | Line 23: |
The TCP server is running on port 9999 of that machine. You can send it a request of the form GET /<number of bytes> HTTP/1.1, or you can just use a downloading program like wget or curl and time it. | Hi Florian + all: yes, it's possible for all the four heuristics we have discussed in the lecture, including complete-link. Conceptually, you have to maintain the n x n similarity matrix SIM, where SIM[i][j] gives you the similarity of the current cluster i with the current cluster j. In each iteration, we consider only those entries SIM[i][j] where i and j are cluster represantatives (that is, they are the smallest index of an element in their cluster) and where i != j (we never want to merge a cluster with itself). Initially each point is a cluster on its own and the SIM[i][j] are just the pairwise point similarities, as in the code we wrote together in the lecture. Now in each iteration, we need to find that pair (i,j) of cluster representative for which SIM[i][j] is maximal. In the lecture we did it the trivial way, by a double loop. That requires quadratic time per iteration, giving cubic time overall. For the exercise, you shoudl instead use one priority queue per cluster representative i to store all SIM[i][j] for all valid cluster representatives j. Then for each i you get the argmax_j SIM[i][j] in logarithmic time instead of in linear time, and doing this for each i you get the argmax_(i,j) SIM[i][j] in n * log n time instead of in n^2 time per iteration. Once you have found the pair of clusters to merge, you have to update both SIM[i][j] and the priority queues appropriately. The former can be done in linear time per iteration and the latter can be done with a linear number of priority queue operations. BTW, you find pseudo-code of the algorithm in the information retrieval book: http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-of-hac-1.html maybe that helps you, too. '''Hannah 7Feb10 20:49''' |
Line 21: | Line 25: |
The HTTP server is running on port 80, as web servers normally do. Just download the file http://vulcano.informatik.uni-freiburg.de/file_100M and measure the time. | Hi, in exercise 1, is it really possible to implement the complete-link heuristic with a priority queue so that the finding of the next two clusters which have to be merged only needs logarithmic time? If I did it right, then we have to fill the priority queue with the possible pairs of clusters and use their similarity with the complete-link heuristic to sort the queue. But when two clusters are merged now, then the distances of the combined cluster (i.e. the one with the smaller index) to the other clusters can change, so after each merge the priority queue has to be rebuilt to ensure it's heap property. But a complete rebuild of the queue is quite time consuming, or am I doing anything wrong? '''Florian 7Feb10 20:09''' |
Line 23: | Line 27: |
For measuring your transfer and error rates, as requested by the exercise, repeat your experiments several times and also at different times, and form the average of these measurements (or report several numbers if you get very different results). You should ask for large amounts of data, like 10 MB or 100 MB. | Hi Mirko, you are free to modify the existing program (linked to in my first comment below), or write the program from scratch (or transcribe my C++ program) in a programming language of your choice. '''Hannah 7Feb10 16:40''' |
Line 25: | Line 29: |
[[SearchEnginesWS0910/ExerciseSheet2|Here you can upload your solutions for Exercise Sheet 2]]. | Hi, one question about Exercise 1: the text says "Modify the code", does this mean we have to extend the C++-program, or are we also allowed to write the complete program in an other language like Java? '''Mirko 7Feb10 16:33''' |
Line 27: | Line 31: |
== Questions or comments below this line, most recent on top please == | I can't say now with certainty which courses exactly I will offer in a year. It is likely though, that we will offer bachelor / master projects on a regular basis. '''Hannah 5Feb10 21:17''' |
Line 29: | Line 33: |
I am sorry for the downtime, I totally forgot about these maintenance works. The downtime also killed our servers, but now they are running again. About the corrections: of course you should get comments on what you did wrong and why you got less points for what. Sorry, if that didn't happen for the first exercise sheet. I will talk with Marjan. '''Hannah 1Nov09 00:38am''' | Yes, Marjan is right. '''Johannes 2010-02-05:1931''' |
Line 31: | Line 35: |
Yes, Eric. Today afternoon and evening was nearly the whole Uni-Net Offline because of intended maintenance. I have a question: Will there be any correction or comments or a sample solution or something like that for every exercise because I now only know how many points I have for every Exercise in Exercise Sheet 1 but I don't know why I have a lack of 1/2 point in one exercise and in another. I think it would be good if anybody will know what he has done wrong or/and what he could have done better. '''Waldemar 31Oct09 21:08''' | Probably Johannes wants/has to do his project next year. '''Marjan 5Feb10 18:32''' |
Line 33: | Line 37: |
I cannot connect to vulcano.informatik.uni-freiburg.de:9999 and :8888, neither from outside, nor from inside (logged into pool account). Also the whole informatik.uni... was down a few minutes ago. It's hard to solve exercise 4 without the servers running. '''Eric 31Oct09 19:00''' | Hi Johannes, I don't quite understand your comment, what do you mean? '''Hannah 5Feb10 17:33''' |
Line 35: | Line 39: |
Hi, I wonder whether those three servers for exercise 4 of sheet 2 online or not? '''Zhongjie 31Oct09 11:49''' | Even if it was in one year? '''Johannes 2010-02-05:1719''' |
Line 37: | Line 41: |
To all: Please note that (almost) everybody will get +2 points for Exercise Sheet 1 (previously I did not assign any points to the first and the last problem). '''Marjan 31Oct09 10:03''' | If you are interested in doing the Bachelor / Master's project (following this course as a block course), please send me an email asap: bast@informatik.uni-freiburg.de . '''Hannah 4Deb10 19:30''' |
Line 39: | Line 43: |
To Triatmoko and Ahmed + all: You obviously haven't created your wiki page. Please go to your link and click "Create a new page" and then "Save changes". You should then upload your solutions there and put the link on the wiki like everybody else did. If it is still not clear please ask some of your fellow students. I don't remember if it is mentioned, but you can skip one Exercise Sheet without losing any points. '''Marjan 31Oct09 09:38''' Hi, I have some question about my exercise Sheet 1, I saw in my exercise page that my name, my upload solution and code in gray color and other persons in blue color. I try to click my attachment file in my exercise page, and I have some message that there are no attachment. any body know about this issue?because my exercise uploaded on Monday 26oct '''Triatmoko 30Oct09 22:18''' Hi Björn + all: For Exercise 4 from Exercise Sheet 1 you had to write code that is at least able to process 2-word queries. If your code can indeed only handle 2-word queries and not an arbitrary number of query words, that is also fine for this exercise, you won't get less points because of that. Your second question is also very valid. You should put the various functionalities into modules / classes of their own, so that you can easily combine them for the three different binaries required for Exercises 1 - 3. Each of your three programs will then be quite short, just putting together the right things. I should have added it to my list of evil coding NoNos: never ever duplicate code, but instead put it in a class / module of its own. I hope this answers your questions, if not please ask again. Sorry for the late answer, but I was super busy until now, hardly had time to breathe. '''Hannah 30Oct09 19:08''' I have a question concerning exercise 2. There was no concrete task to produce "query processing code" on ex sheet 1. Are there any requirements that have to be fulfilled? Should it be able to handle two word queries? n-word queries? Additionally there is something else I want to ask: I think it surely isn't bad practice to write a more generic webserver and use it for exercises 1-3. Apart from that it says "change your code" some times in the exercises. How should your submission behave w.r.t. the exercises? Different src files / executables for each exercise? One program that solve each exercises depending on startup parameters? Anything else? '''Björn 30ct09 2:25pm''' I now reorganized the page. Old stuff went to separate pages (links above). The idea is that the front page is always for the current lecture / exercises. The problem with your exercise page should be solved now, Ivo. '''Hannah 30ct09 00:05am''' Having problems to access my exercise page after loging in - IvoChichkovExercises, '''Ivo 29Oct 22:56pm''' Sorry to bother you. I added the Link to exercise sheet 2 with the linked pdf. Needed this to find the sheet as fast as possible. '''Marius 29Oct 10:04 p.m.''' Is there a webpage for exercise sheet 2 somewhere? '''Johannes 29Oct 07:45 pm''' |
Here is the [[attachment:HierarchicalClustering.cpp|hierarchical clustering code]] which we have written in the lecture. '''Hannah 2Feb10 18:47''' |
Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.
Here are PDFs of the slides of the lectures so far: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6, Lecture 7, Lecture 8, Lecture 9, Lecture 10, Lecture 11, Lecture 12, Lecture 13.
Here are the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: Recording Lecture 1 (LPD), Recording Lecture 3 (LPD), Recording Lecture 4 (LPD), Recording Lecture 5 (LPD without audio), Recording Lecture 6 (LPD), Recording Lecture 7 (AVI), Recording Lecture 8 (AVI), Recording Lecture 9 (AVI), Recording Lecture 10 (AVI), Recording Lecture 11 (AVI), Recording Lecture 12 (AVI), Recording Lecture 13 (AVI). To play the Lecturnity recordings (.lpd files) you need the Lecturnity Player, which you can download here. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend VLC.
Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5, Exercise Sheet 6, Exercise Sheet 7, Exercise Sheet 8, Exercise Sheet 9, Exercise Sheet 10, Exercise Sheet 11, Exercise Sheet 12, Exercise Sheet 13.
Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4, Solutions and Comments 5, Solutions and Comments 6, Solutions and Comments 7, Solutions and Comments 8, Solutions and Comments 9, Solutions and Comments 10, Solutions and Comments 11, Solutions and Comments 12.
Here are our master solutions: Master solution for Mid-Term Exam,Master solution for Exercise Sheet 9, Master solution for Exercise Sheet 10.
Here are the rules for the exercises as explained in Lecture 2.
Here is everything about the mid-term exam. The final exam is on Friday March 12, 2010. The written exam begins at 2.00 pm in HS 026. The oral exams are scheduled on the same day.
Here is the table with the links to your uploaded solutions for Exercise Sheet 13. The deadline is Thursday 11Feb10 16:00.
Questions and comments about Exercise Sheet 13 below this line (most recent on top)
FYI: on Thursday, we will have 45 minutes of regular lecture (16.15 - 17.00 h), and then the intro lecture for the bachelor / master's project (17.00 - 17.45 h). Those who are interested in doing a project, can just stay, the others may go then. Hannah 8Feb10 14:08
Hi Florian + all: yes, it's possible for all the four heuristics we have discussed in the lecture, including complete-link. Conceptually, you have to maintain the n x n similarity matrix SIM, where SIM[i][j] gives you the similarity of the current cluster i with the current cluster j. In each iteration, we consider only those entries SIM[i][j] where i and j are cluster represantatives (that is, they are the smallest index of an element in their cluster) and where i != j (we never want to merge a cluster with itself). Initially each point is a cluster on its own and the SIM[i][j] are just the pairwise point similarities, as in the code we wrote together in the lecture. Now in each iteration, we need to find that pair (i,j) of cluster representative for which SIM[i][j] is maximal. In the lecture we did it the trivial way, by a double loop. That requires quadratic time per iteration, giving cubic time overall. For the exercise, you shoudl instead use one priority queue per cluster representative i to store all SIM[i][j] for all valid cluster representatives j. Then for each i you get the argmax_j SIM[i][j] in logarithmic time instead of in linear time, and doing this for each i you get the argmax_(i,j) SIM[i][j] in n * log n time instead of in n^2 time per iteration. Once you have found the pair of clusters to merge, you have to update both SIM[i][j] and the priority queues appropriately. The former can be done in linear time per iteration and the latter can be done with a linear number of priority queue operations. BTW, you find pseudo-code of the algorithm in the information retrieval book: http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-of-hac-1.html maybe that helps you, too. Hannah 7Feb10 20:49
Hi, in exercise 1, is it really possible to implement the complete-link heuristic with a priority queue so that the finding of the next two clusters which have to be merged only needs logarithmic time? If I did it right, then we have to fill the priority queue with the possible pairs of clusters and use their similarity with the complete-link heuristic to sort the queue. But when two clusters are merged now, then the distances of the combined cluster (i.e. the one with the smaller index) to the other clusters can change, so after each merge the priority queue has to be rebuilt to ensure it's heap property. But a complete rebuild of the queue is quite time consuming, or am I doing anything wrong? Florian 7Feb10 20:09
Hi Mirko, you are free to modify the existing program (linked to in my first comment below), or write the program from scratch (or transcribe my C++ program) in a programming language of your choice. Hannah 7Feb10 16:40
Hi, one question about Exercise 1: the text says "Modify the code", does this mean we have to extend the C++-program, or are we also allowed to write the complete program in an other language like Java? Mirko 7Feb10 16:33
I can't say now with certainty which courses exactly I will offer in a year. It is likely though, that we will offer bachelor / master projects on a regular basis. Hannah 5Feb10 21:17
Yes, Marjan is right. Johannes 2010-02-05:1931
Probably Johannes wants/has to do his project next year. Marjan 5Feb10 18:32
Hi Johannes, I don't quite understand your comment, what do you mean? Hannah 5Feb10 17:33
Even if it was in one year? Johannes 2010-02-05:1719
If you are interested in doing the Bachelor / Master's project (following this course as a block course), please send me an email asap: bast@informatik.uni-freiburg.de . Hannah 4Deb10 19:30
Here is the hierarchical clustering code which we have written in the lecture. Hannah 2Feb10 18:47