Exercise Sheet 13
The rules for uploading are the same as always. If you forgot them, you can read them again here.
Your solutions (files can only be read by the uploader and by us)
Name |
Solution (PDF) |
Code (ZIP or TGZ) |
These were the comments and questions for Exercise Sheet 13
Hi Dragos. Regarding exercise 4 - I am not sure what exactly you mean. The proof should be as formal as any other proof. Regarding exercise 5 - you're right and I have the same problem. One has to do the deletion in O(log n) but this is not provided by the standard priority queue implementation in C++, Java etc. Ideally one should do his own implementation of priority queue but then this would be more work, so I am not sure myself (the professor is still at Google). Marjan 10Feb10 12:06
My second question regards Exercise 4. Should we adopt a formal way to prove the two statements (like induction) or is it sufficient to explain in words what happens at each iteration and why the final result will be the one mentioned in the exercise. Dragos 10th of February, 10.53
Hello. I was looking at the pseudo-code from the Information Retrieval Book (for the first exercise) and it says that the complexity of the algorithm is O(n^2*log n) if the insert and delete operations for the priority queue are done in O(log n)... which makes sense. The problem is that they use a delete(element) operation, which as I have looked in the implementations for priority queues in Java or Perl have a complexity greater then log(n) (commonly .. O(n)). I don't really see how to do it in log(n) time. Dragos 9th of February, 23.33
@Marjan: The implementation of the lecture would work for n=104 (I have 2GB RAM) but it takes a very long time. The memory exception is thrown when I create the matrix C of the pseudo code from the information retrieval book, but since the priority queue contains the same number of elements as the matrix I think there is not much difference. Florian 8Feb 21:29
Another small observation: clearly, if n = 105, then you have n2 = 1010 entries in your SIM matrix. Say each entry has size of 4 bytes (float), this makes it 37.2GB in total! Obviously you can't go above n=104. Marjan 8Feb 21:09
Hi Florian. I have two questions. First, how far in terms of n can you go with the implementation from the lecture? Second, if the difference is significant, could you give a short but precise statement where does most of the memory in your algorithm get consumed? Is it the similarity matrix or the priority queue? Marjan 8Feb10 21:02
After implementing the algorithm from the information retrieval book for exercise 1 I already get an out of memory exception for n=104. Just having the values for n=101 ... 103 is not much to see if T/n2 is indeed logarithmic, but is it enough for the exercise? Florian 8Feb10 20:40
FYI: on Thursday, we will have 45 minutes of regular lecture (16.15 - 17.00 h), and then the intro lecture for the bachelor / master's project (17.00 - 17.45 h). Those who are interested in doing a project, can just stay, the others may go then. Hannah 8Feb10 14:08
Hi Florian + all: yes, it's possible for all the four heuristics we have discussed in the lecture, including complete-link. Conceptually, you have to maintain the n x n similarity matrix SIM, where SIM[i][j] gives you the similarity of the current cluster i with the current cluster j. In each iteration, we consider only those entries SIM[i][j] where i and j are cluster represantatives (that is, they are the smallest index of an element in their cluster) and where i != j (we never want to merge a cluster with itself). Initially each point is a cluster on its own and the SIM[i][j] are just the pairwise point similarities, as in the code we wrote together in the lecture. Now in each iteration, we need to find that pair (i,j) of cluster representative for which SIM[i][j] is maximal. In the lecture we did it the trivial way, by a double loop. That requires quadratic time per iteration, giving cubic time overall. For the exercise, you shoudl instead use one priority queue per cluster representative i to store all SIM[i][j] for all valid cluster representatives j. Then for each i you get the argmax_j SIM[i][j] in logarithmic time instead of in linear time, and doing this for each i you get the argmax_(i,j) SIM[i][j] in n * log n time instead of in n^2 time per iteration. Once you have found the pair of clusters to merge, you have to update both SIM[i][j] and the priority queues appropriately. The former can be done in linear time per iteration and the latter can be done with a linear number of priority queue operations. BTW, you find pseudo-code of the algorithm in the information retrieval book: http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-of-hac-1.html maybe that helps you, too. Hannah 7Feb10 20:49
Hi, in exercise 1, is it really possible to implement the complete-link heuristic with a priority queue so that the finding of the next two clusters which have to be merged only needs logarithmic time? If I did it right, then we have to fill the priority queue with the possible pairs of clusters and use their similarity with the complete-link heuristic to sort the queue. But when two clusters are merged now, then the distances of the combined cluster (i.e. the one with the smaller index) to the other clusters can change, so after each merge the priority queue has to be rebuilt to ensure it's heap property. But a complete rebuild of the queue is quite time consuming, or am I doing anything wrong? Florian 7Feb10 20:09
Hi Mirko, you are free to modify the existing program (linked to in my first comment below), or write the program from scratch (or transcribe my C++ program) in a programming language of your choice. Hannah 7Feb10 16:40
Hi, one question about Exercise 1: the text says "Modify the code", does this mean we have to extend the C++-program, or are we also allowed to write the complete program in an other language like Java? Mirko 7Feb10 16:33
I can't say now with certainty which courses exactly I will offer in a year. It is likely though, that we will offer bachelor / master projects on a regular basis. Hannah 5Feb10 21:17
Yes, Marjan is right. Johannes 2010-02-05:1931
Probably Johannes wants/has to do his project next year. Marjan 5Feb10 18:32
Hi Johannes, I don't quite understand your comment, what do you mean? Hannah 5Feb10 17:33
Even if it was in one year? Johannes 2010-02-05:1719
If you are interested in doing the Bachelor / Master's project (following this course as a block course), please send me an email asap: bast@informatik.uni-freiburg.de . Hannah 4Deb10 19:30
Here is the hierarchical clustering code which we have written in the lecture. Hannah 2Feb10 18:47