12634
Comment:
|
11426
|
Deletions are marked like this. | Additions are marked like this. |
Line 3: | Line 3: |
Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]], [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]], [[attachment:SearchEnginesWS0910/lecture-7.pdf|Lecture 7]]. | Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]], [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]], [[attachment:SearchEnginesWS0910/lecture-7.pdf|Lecture 7]], [[attachment:SearchEnginesWS0910/lecture-8.pdf|Lecture 8]], [[attachment:SearchEnginesWS0910/lecture-9.pdf|Lecture 9]], [[attachment:SearchEnginesWS0910/lecture-10.pdf|Lecture 10]], [[attachment:SearchEnginesWS0910/lecture-11.pdf|Lecture 11]], [[attachment:SearchEnginesWS0910/lecture-12.pdf|Lecture 12]], [[attachment:SearchEnginesWS0910/lecture-13.pdf|Lecture 13]]. |
Line 5: | Line 5: |
Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (no audio)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-6.lpd|Recording Lecture 6 (with audio for a change)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-7.avi|Recording Lecture 7 (AVI)]]. | Here are the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (LPD without audio)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-6.lpd|Recording Lecture 6 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-7.avi|Recording Lecture 7 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-8.avi|Recording Lecture 8 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-9.avi|Recording Lecture 9 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-10.avi|Recording Lecture 10 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-11.avi|Recording Lecture 11 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-12.avi|Recording Lecture 12 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-13.avi|Recording Lecture 13 (AVI)]]. To play the Lecturnity recordings (.lpd files) you need the [[http://www.lecturnity.de/de/download/lecturnity-player|Lecturnity Player, which you can download here]]. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend [[http://www.videolan.org/vlc|VLC]]. |
Line 7: | Line 7: |
Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]], [[attachment:SearchEnginesWS0910/exercise-7.pdf|Exercise Sheet 7]]. | Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]], [[attachment:SearchEnginesWS0910/exercise-7.pdf|Exercise Sheet 7]], [[attachment:SearchEnginesWS0910/exercise-8.pdf|Exercise Sheet 8]], [[attachment:SearchEnginesWS0910/exercise-9.pdf|Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/exercise-10.pdf|Exercise Sheet 10]], [[attachment:SearchEnginesWS0910/exercise-11.pdf|Exercise Sheet 11]], [[attachment:SearchEnginesWS0910/exercise-12.pdf|Exercise Sheet 12]], [[attachment:SearchEnginesWS0910/exercise-13.pdf|Exercise Sheet 13]]. |
Line 9: | Line 9: |
Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]], [[SearchEnginesWS0910/ExerciseSheet6|Solutions and Comments 6]]. | Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]], [[SearchEnginesWS0910/ExerciseSheet6|Solutions and Comments 6]], [[SearchEnginesWS0910/ExerciseSheet7|Solutions and Comments 7]], [[SearchEnginesWS0910/ExerciseSheet8|Solutions and Comments 8]], [[SearchEnginesWS0910/ExerciseSheet9|Solutions and Comments 9]], [[SearchEnginesWS0910/ExerciseSheet10|Solutions and Comments 10]], [[SearchEnginesWS0910/ExerciseSheet11|Solutions and Comments 11]], [[SearchEnginesWS0910/ExerciseSheet12|Solutions and Comments 12]]. |
Line 11: | Line 11: |
= Exercise Sheet 7 = The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the Lecturnity recordings (.lpd files) you need the [[http://www.lecturnity.de/de/download/lecturnity-player|Lecturnity Player, which you can download here]]. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend [[http://www.videolan.org/vlc|VLC]]. |
Here are our master solutions: [[attachment:SearchEnginesWS0910/solution-midterm.pdf|Master solution for Mid-Term Exam]],[[attachment:SearchEnginesWS0910/solution-9.pdf|Master solution for Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/solution-10.pdf|Master solution for Exercise Sheet 10]]. |
Line 17: | Line 15: |
[[SearchEnginesWS0910/ExerciseSheet7|Here you can upload your solutions for Exercise Sheet 7]]. | [[SearchEnginesWS0910/MidTermExam|Here is everything about the mid-term exam]]. The final exam is on Friday March 12, 2010. The written exam begins at 2.00 pm in HS 026. The oral exams are scheduled on the same day. |
Line 19: | Line 17: |
== Questions or comments below this line, most recent on top please == | [[SearchEnginesWS0910/ExerciseSheet13|Here is the table with the links to your uploaded solutions for Exercise Sheet 13]]. The deadline is Thursday 11Feb10 16:00. |
Line 21: | Line 19: |
Hi Dragos + all, it is ok to treat them as one. '''Hannah 8Dec09 14:07''' | == Questions and comments about Exercise Sheet 13 below this line (most recent on top) == Hello. I was looking at the pseudo-code from the Information Retrieval Book (for the first exercise) and it says that the complexity of the algorithm is O(n^2*log n) if the insert and delete operations for the priority queue are done in O(log n)... which makes sense. The problem is that they use a delete(element) operation, which as I have looked in the implementations for priority queues in Java or Perl have a complexity greater then log(n) (commonly .. O(n)). I don't really see how to do it in log(n) time. '''Dragos 9th of February, 23.33''' |
Line 23: | Line 22: |
Is it OK to have treated exercises 2-4 as one?(provide the same html, css, js file for all) or should we split it in 3 ? '''Dragos 8Dec09 10:53''' | @Marjan: The implementation of the lecture would work for n=10^4^ (I have 2GB RAM) but it takes a very long time. The memory exception is thrown when I create the matrix C of the pseudo code from the information retrieval book, but since the priority queue contains the same number of elements as the matrix I think there is not much difference. '''Florian 8Feb 21:29''' |
Line 25: | Line 24: |
Hi Claudius + all, I mean the total size of all the inverted lists for the completions of the given prefix. '''Hannah 7Dec09 17:58''' | Another small observation: clearly, if n = 10^5^, then you have n^2^ = 10^10^ entries in your SIM matrix. Say each entry has size of 4 bytes (float), this makes it 37.2GB in total! Obviously you can't go above n=10^4^. '''Marjan 8Feb 21:09''' |
Line 27: | Line 26: |
In Exercise 1, what do you mean with "total size of the inverted lists"? The size of all terms in the whole index or only the size of all terms, matching the prefix? '''Claudius 7Dec09 17:20''' | Hi Florian. I have two questions. First, how far in terms of n can you go with the implementation from the lecture? Second, if the difference is significant, could you give a short but precise statement where does most of the memory in your algorithm get consumed? Is it the similarity matrix or the priority queue? '''Marjan 8Feb10 21:02''' |
Line 29: | Line 28: |
Hi Björn, any simple solution for this case is ok. That is, either ignore it, or introduce a cut-off value of, say, 100 or 500, and only return that many completions at most. '''Hannah 7Dec09 13:29''' | After implementing the algorithm from the information retrieval book for exercise 1 I already get an out of memory exception for n=10^4^. Just having the values for n=10^1^ ... 10^3^ is not much to see if T/n^2^ is indeed logarithmic, but is it enough for the exercise? '''Florian 8Feb10 20:40''' |
Line 31: | Line 30: |
To Björn + all: At least I don't see a big problem here. Short prefixes like "a" do not make much sense anyway and will return a large fraction of all documents in the collection. Still, there is a chance that a higher authority does not agree with me :) '''Marjan 7Dec 11:59''' | '''FYI: on Thursday, we will have 45 minutes of regular lecture (16.15 - 17.00 h), and then the intro lecture for the bachelor / master's project (17.00 - 17.45 h). Those who are interested in doing a project, can just stay, the others may go then.''' '''Hannah 8Feb10 14:08''' |
Line 33: | Line 32: |
There are prefixes (1 char) that return far over 50K results / completions. Although the query is answered within milliseconds (if i look at what firebug says), firefox just does not seem to execute the javascript (or refuse to render the site... i can't tell at the moment). Anyways, while some prefixes, like "q", work. Others, like "a", don't. A result will only apear once I enter a second char (and make it "an", "ab" or something) in those cases. Do I have to worry or try top find a fix, or can ignore that to get points? For the other steps (esp. ex 4), I don't think it makes sense to restrict the number of returned completions. If I had to solve it for a "real" application I might want to make my server provide many services (show x hits, and show the x hits relevant for the correct sorting of the table, etc). If we don't have to do something like this, I would like to avoid the effort for the exercise. '''Björn 7.12. 11:07 ''' |
Hi Florian + all: yes, it's possible for all the four heuristics we have discussed in the lecture, including complete-link. Conceptually, you have to maintain the n x n similarity matrix SIM, where SIM[i][j] gives you the similarity of the current cluster i with the current cluster j. In each iteration, we consider only those entries SIM[i][j] where i and j are cluster represantatives (that is, they are the smallest index of an element in their cluster) and where i != j (we never want to merge a cluster with itself). Initially each point is a cluster on its own and the SIM[i][j] are just the pairwise point similarities, as in the code we wrote together in the lecture. Now in each iteration, we need to find that pair (i,j) of cluster representative for which SIM[i][j] is maximal. In the lecture we did it the trivial way, by a double loop. That requires quadratic time per iteration, giving cubic time overall. For the exercise, you shoudl instead use one priority queue per cluster representative i to store all SIM[i][j] for all valid cluster representatives j. Then for each i you get the argmax_j SIM[i][j] in logarithmic time instead of in linear time, and doing this for each i you get the argmax_(i,j) SIM[i][j] in n * log n time instead of in n^2 time per iteration. Once you have found the pair of clusters to merge, you have to update both SIM[i][j] and the priority queues appropriately. The former can be done in linear time per iteration and the latter can be done with a linear number of priority queue operations. BTW, you find pseudo-code of the algorithm in the information retrieval book: http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-of-hac-1.html maybe that helps you, too. '''Hannah 7Feb10 20:49''' |
Line 36: | Line 34: |
Spplamental: Suddenly, it works with IE as well... Even the find-statements. Oh glorious java-god, who art in heaven, why have you brought us this freakin' language...? ;) '''Marius 12/06/2009 10:19 p.m.''' | Hi, in exercise 1, is it really possible to implement the complete-link heuristic with a priority queue so that the finding of the next two clusters which have to be merged only needs logarithmic time? If I did it right, then we have to fill the priority queue with the possible pairs of clusters and use their similarity with the complete-link heuristic to sort the queue. But when two clusters are merged now, then the distances of the combined cluster (i.e. the one with the smaller index) to the other clusters can change, so after each merge the priority queue has to be rebuilt to ensure it's heap property. But a complete rebuild of the queue is quite time consuming, or am I doing anything wrong? '''Florian 7Feb10 20:09''' |
Line 38: | Line 36: |
Hi, it is remarkable, too, that IE seems to have some problems with even performing the $.GET command. After I spent hours to get my Webserver to return the HTML and script-files as well as to handle the prefix request (because of the OPTIONS-HTTP-request that was sent by firefox but worked very well with IE), IE now seems to be unwilling to perform the request (JScript-error: "Schnittstelle nicht unterstützt."). I hope, FF won't have any problems from now on... '''Marius 12/06/2009 09:52 p.m.''' | Hi Mirko, you are free to modify the existing program (linked to in my first comment below), or write the program from scratch (or transcribe my C++ program) in a programming language of your choice. '''Hannah 7Feb10 16:40''' |
Line 40: | Line 38: |
Hi Florian, sure, creative solutions are always welcome, but if you ask me an xsl stylesheet is significantly more work then parsing the xml with jQuery. But as you like. '''Hannah 6Dec 21:21''' | Hi, one question about Exercise 1: the text says "Modify the code", does this mean we have to extend the C++-program, or are we also allowed to write the complete program in an other language like Java? '''Mirko 7Feb10 16:33''' |
Line 42: | Line 40: |
Can we use a xsl tranformation to get the html table out of our xml result? '''Florian 6Dec09 21:08''' | I can't say now with certainty which courses exactly I will offer in a year. It is likely though, that we will offer bachelor / master projects on a regular basis. '''Hannah 5Feb10 21:17''' |
Line 44: | Line 42: |
Yes the backend is also a/ the http server. '''Johannes 2009-12-06T2047L''' | Yes, Marjan is right. '''Johannes 2010-02-05:1931''' |
Line 46: | Line 44: |
Hi Johannes, how do you communicate with the backend then (the program that provides the contents for the table)? Is that also part of your java server, that is, does it play the role of a web server and a backend simultaneously? That would be one way of doing it, too. '''Hannah 6Dec09 16:49''' | Probably Johannes wants/has to do his project next year. '''Marjan 5Feb10 18:32''' |
Line 48: | Line 46: |
Just let your java server provide the css, js and html too. Works for me and is easy to do in java. '''Johannes 2009-12-06T1114L''' | Hi Johannes, I don't quite understand your comment, what do you mean? '''Hannah 5Feb10 17:33''' |
Line 50: | Line 48: |
I didn't know that jQuery's ''find'' does not work on Internet Explorer, and I am actually surprised to hear that. It somewhat shatters my previous belief that jQuery just works on any of the major browsers (all of which implement JavaScript a little differently, which makes the use of raw JavaScript so cumbersome). I will try to find(!) out why that is so. Sorry, if you had trouble because of this, but well, that's (web application writing) life. '''Hannah 6Dec09 0:26''' | Even if it was in one year? '''Johannes 2010-02-05:1719''' |
Line 52: | Line 50: |
In the lecture, all the files prefix-search.html, prefix-search.js, prefix-search.css, and prefix-search.php were served by an Apache web sever running on one and the same machine ''stromboli.informatik.uni-freiburg.de''. The $.(get) in the prefix-search.js was sending the query to the prefix-search.php. As Björn pointed out, Firefox asks that the html (which is what the user loads by typing the URL or clicking a link, and which in turn loads the js) be served via port 80 by a machine on the same domain as the prefix-search.php. For our machine ''domain'' refers to ''uni-freiburg.de'', that is, the php could have been located on any other machine with a URL ending in ''uni-freiburg.de'', too. Otherwise, you get a so-called ''cross-scripting'' error. This is *not* part of the JavaScript standard, however, and different browsers implement it differently. This is also what Manuela found. I leave it to you how you get around the cross-scripting problem. The preferred solution is to have all files served by web servers on machine on the same domain, as just explained. If you find other solutions that work, that is also fine, but please explain what led you to this solution, just like Manuela did below. '''Hannah 06Dec09 0:22''' | If you are interested in doing the Bachelor / Master's project (following this course as a block course), please send me an email asap: bast@informatik.uni-freiburg.de . '''Hannah 4Deb10 19:30''' |
Line 54: | Line 52: |
From what a fellow student told me in the lecture (thanks, alex) the problem with GETing the javascript comes from the fact that (for security issues) the HttpXmlRequest is only allowed be send to a ressource on the same domain that you got the HTML from. Firefox turns it into an OPTIONS request. This might also be the reason why it worked in the lecture where the html and the php were both served by the same apache, but does not work if your html is not on the apache, too (Also explains the observations posted below). Personally, I'm planning on letting my webserver provide all, the html, css, js (by letting it return files from a subfolder depending on the path in the GET request) and the xml if the GET request does not start with a prefix for that folder. Otherwise it should work if you do it just as we did in the lecture and have HTML (+ css + js) and PHP in your apache's folder. I haven't started yet but I can let you know if this works for me. Anyway, IF it does, credit goes to Alexander Gutjahr who told about this javascript issue, of course. '''Björn 05Dec 22:12''' I'm a bit confused about the exercise. For exercise 1 I extended the Java webserver from exercise sheet 2 with the prefix search of the last exercise sheet. The webserver returns the results of the prefix search as a XML document. Should I have used an webserver like apache? But I also had some problems with sending the JQuery request to the server. The webserver runs on port 80. I started with Firefox. Firefox sends an OPTIONS request to the server and so the JQuery get-function doesn't work. The same happend as I used Google Chrome. Because the Java webserver can't handle PHP I can't do it like in the lecture. So I tried Internet Explorer and this browser sends a GET request by using the JQuery get-function. I assumed I can follow with the exercise, but though I did it like in the lecture, nothing happend. I used the alert-function to check that I really get the XML document from the server (and I got it). Now I know, that the find-function doesn't work with Internet Explorer. After this I tried Safari. Safari sends a GET-request and also the find-function works. Now I can follow and build the tables like described on the exercise sheet. Is it OK to go on like that? '''Manuela 05Dec09 19:24''' Hi Alex, can you be more specific about what exactly did not work for you and what you had to do to make it work? In particular, what do you mean by "the server directory"? Do you mean apache's document root? Then where have your files been before? In a subdirectory of the root? And what do you mean by a GET request being turned into an OPTIONS request, and how did you arrive at the conclusion that this is what happens? It should not matter if the .php file is in a different directory than the .js file. My feeling is that your problem lies elsewhere, but it's hard to tell from the information you gave so far. '''Hannah 05Dec09 18:02''' @whom it may concern: for me the access-rights stuff did not work exactly as in the lecture - i had to move the whole site (.html .js ...) into the server directory. Maybe it's new to firefox 3.5 but i could not access any file on the server from a .js not being in the server directory - it always turned my GET-Requests into OPTIONS-Requests and nothing was returned - so the php-solution does not seem to work, even if my server was able to execute php. Were we supposed to do it like this anyway or is it completely wrong this way?.. '''alex 5Dec09 17:56''' Ok, the recording of Lecture 7 is now available as AVI. But beware, it's quite big: around 300 MB. '''Hannah 3Dec09 22:46''' To play the .camrec recording you need the full Camtasia Studio (you can download a 30-day test version if you want). I will soon upload an .avi version instead. '''Hannah 3Dec09 21:56''' For your reference and convenience, here is a [[attachment:prefix-search.tar|tar archive of the files which we wrote together in Lecture 7 (prefix-search.html, prefix-search.css, prefix-search.js, prefix-search.php)]]. '''Hannah 3Dec09 21:35''' |
Here is the [[attachment:HierarchicalClustering.cpp|hierarchical clustering code]] which we have written in the lecture. '''Hannah 2Feb10 18:47''' |
Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.
Here are PDFs of the slides of the lectures so far: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6, Lecture 7, Lecture 8, Lecture 9, Lecture 10, Lecture 11, Lecture 12, Lecture 13.
Here are the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: Recording Lecture 1 (LPD), Recording Lecture 3 (LPD), Recording Lecture 4 (LPD), Recording Lecture 5 (LPD without audio), Recording Lecture 6 (LPD), Recording Lecture 7 (AVI), Recording Lecture 8 (AVI), Recording Lecture 9 (AVI), Recording Lecture 10 (AVI), Recording Lecture 11 (AVI), Recording Lecture 12 (AVI), Recording Lecture 13 (AVI). To play the Lecturnity recordings (.lpd files) you need the Lecturnity Player, which you can download here. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend VLC.
Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5, Exercise Sheet 6, Exercise Sheet 7, Exercise Sheet 8, Exercise Sheet 9, Exercise Sheet 10, Exercise Sheet 11, Exercise Sheet 12, Exercise Sheet 13.
Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4, Solutions and Comments 5, Solutions and Comments 6, Solutions and Comments 7, Solutions and Comments 8, Solutions and Comments 9, Solutions and Comments 10, Solutions and Comments 11, Solutions and Comments 12.
Here are our master solutions: Master solution for Mid-Term Exam,Master solution for Exercise Sheet 9, Master solution for Exercise Sheet 10.
Here are the rules for the exercises as explained in Lecture 2.
Here is everything about the mid-term exam. The final exam is on Friday March 12, 2010. The written exam begins at 2.00 pm in HS 026. The oral exams are scheduled on the same day.
Here is the table with the links to your uploaded solutions for Exercise Sheet 13. The deadline is Thursday 11Feb10 16:00.
Questions and comments about Exercise Sheet 13 below this line (most recent on top)
Hello. I was looking at the pseudo-code from the Information Retrieval Book (for the first exercise) and it says that the complexity of the algorithm is O(n^2*log n) if the insert and delete operations for the priority queue are done in O(log n)... which makes sense. The problem is that they use a delete(element) operation, which as I have looked in the implementations for priority queues in Java or Perl have a complexity greater then log(n) (commonly .. O(n)). I don't really see how to do it in log(n) time. Dragos 9th of February, 23.33
@Marjan: The implementation of the lecture would work for n=104 (I have 2GB RAM) but it takes a very long time. The memory exception is thrown when I create the matrix C of the pseudo code from the information retrieval book, but since the priority queue contains the same number of elements as the matrix I think there is not much difference. Florian 8Feb 21:29
Another small observation: clearly, if n = 105, then you have n2 = 1010 entries in your SIM matrix. Say each entry has size of 4 bytes (float), this makes it 37.2GB in total! Obviously you can't go above n=104. Marjan 8Feb 21:09
Hi Florian. I have two questions. First, how far in terms of n can you go with the implementation from the lecture? Second, if the difference is significant, could you give a short but precise statement where does most of the memory in your algorithm get consumed? Is it the similarity matrix or the priority queue? Marjan 8Feb10 21:02
After implementing the algorithm from the information retrieval book for exercise 1 I already get an out of memory exception for n=104. Just having the values for n=101 ... 103 is not much to see if T/n2 is indeed logarithmic, but is it enough for the exercise? Florian 8Feb10 20:40
FYI: on Thursday, we will have 45 minutes of regular lecture (16.15 - 17.00 h), and then the intro lecture for the bachelor / master's project (17.00 - 17.45 h). Those who are interested in doing a project, can just stay, the others may go then. Hannah 8Feb10 14:08
Hi Florian + all: yes, it's possible for all the four heuristics we have discussed in the lecture, including complete-link. Conceptually, you have to maintain the n x n similarity matrix SIM, where SIM[i][j] gives you the similarity of the current cluster i with the current cluster j. In each iteration, we consider only those entries SIM[i][j] where i and j are cluster represantatives (that is, they are the smallest index of an element in their cluster) and where i != j (we never want to merge a cluster with itself). Initially each point is a cluster on its own and the SIM[i][j] are just the pairwise point similarities, as in the code we wrote together in the lecture. Now in each iteration, we need to find that pair (i,j) of cluster representative for which SIM[i][j] is maximal. In the lecture we did it the trivial way, by a double loop. That requires quadratic time per iteration, giving cubic time overall. For the exercise, you shoudl instead use one priority queue per cluster representative i to store all SIM[i][j] for all valid cluster representatives j. Then for each i you get the argmax_j SIM[i][j] in logarithmic time instead of in linear time, and doing this for each i you get the argmax_(i,j) SIM[i][j] in n * log n time instead of in n^2 time per iteration. Once you have found the pair of clusters to merge, you have to update both SIM[i][j] and the priority queues appropriately. The former can be done in linear time per iteration and the latter can be done with a linear number of priority queue operations. BTW, you find pseudo-code of the algorithm in the information retrieval book: http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-of-hac-1.html maybe that helps you, too. Hannah 7Feb10 20:49
Hi, in exercise 1, is it really possible to implement the complete-link heuristic with a priority queue so that the finding of the next two clusters which have to be merged only needs logarithmic time? If I did it right, then we have to fill the priority queue with the possible pairs of clusters and use their similarity with the complete-link heuristic to sort the queue. But when two clusters are merged now, then the distances of the combined cluster (i.e. the one with the smaller index) to the other clusters can change, so after each merge the priority queue has to be rebuilt to ensure it's heap property. But a complete rebuild of the queue is quite time consuming, or am I doing anything wrong? Florian 7Feb10 20:09
Hi Mirko, you are free to modify the existing program (linked to in my first comment below), or write the program from scratch (or transcribe my C++ program) in a programming language of your choice. Hannah 7Feb10 16:40
Hi, one question about Exercise 1: the text says "Modify the code", does this mean we have to extend the C++-program, or are we also allowed to write the complete program in an other language like Java? Mirko 7Feb10 16:33
I can't say now with certainty which courses exactly I will offer in a year. It is likely though, that we will offer bachelor / master projects on a regular basis. Hannah 5Feb10 21:17
Yes, Marjan is right. Johannes 2010-02-05:1931
Probably Johannes wants/has to do his project next year. Marjan 5Feb10 18:32
Hi Johannes, I don't quite understand your comment, what do you mean? Hannah 5Feb10 17:33
Even if it was in one year? Johannes 2010-02-05:1719
If you are interested in doing the Bachelor / Master's project (following this course as a block course), please send me an email asap: bast@informatik.uni-freiburg.de . Hannah 4Deb10 19:30
Here is the hierarchical clustering code which we have written in the lecture. Hannah 2Feb10 18:47