Exercise Sheet 12
The rules for uploading are the same as always. If you forgot them, you can read them again here.
Here is the file from Exercise Sheet 11 which you also needed for Exercise Sheet 12. It's a text file, where each line contains the name of the conference (in capital letters), followed by a TAB (ASCII code 9), followed by the title. There are three different conferences: STOC (2423 titles), SIGIR (2372 titles), and SIGGRAPH (1835 titles). The total number of titles / lines is 6630. The exact file size is 454365 bytes.
Your solutions (files can only be read by the uploader and by us)
Name |
Solution (PDF) |
Code (ZIP or TGZ) |
These were your questions and comments about Exercise Sheet 13
Hi Dragos, my answer comes a bit late, but maybe it helps you to read the proof sketch of Theorem 4.1 of the NIPS paper by Jon Kleinberg: http://www.cs.cornell.edu/home/kleinber/nips15.pdf . There the same thing is said in different words. Hannah 4Feb10 14:24
Hello. I have been thinking over exercise 4 for a couple of hours now, and I still can't fully understand it. This is probably because I can't see where is the trick. Is it the fact that we shrink distances in X1 and X2, but keep all other distances the same? Another thing is that I can't picture in my mind how could it be that by doing so, the centroid that was in Y, now is in X1 while the distances are preserved between the two!? I'm probably confused right now ... but maybe you could somehow explain what happens in a more clear way.. But I'm not sure that it is possible:). Dragos 10.45 04 February
Here is a more concrete hint for Exercise 4: For X = X1 u X2 pick m elements with distance <= r to each other. For Y pick gamma * m points with distance <= epsilon < r to each other, for some small gamma > 0. The distance from any point in X to any point in Y is at least r + delta, for some small delta > 0. Now choose gamma, r, epsilon, and delta appropriately so that you can show what was formulated in the other hint. Hannah 3Feb10 23:10
Hi Björn + all: you are right, that is not well formulated. What I meant is that these are the optimal centroids, that is, the one which (globally) minimize the RSS. With a suitable choice of initial centroids, 2-means will then converge towards these centroids and the corresponding clustering. So to reformulate the hint: Construct a point set with three subsets X1, X2 and Y . Then carefully define distances between all these points such that the optimal centroids are one in X1 u X2 and one in Y. But when the points within X1 and within X2 are moved closer together (with all other distances remaining the same) then the optimal centroids are one in X1 and one in X2. Hannah 3Feb10 22:52
Hmm regarding exercise 4. How does 2-means "pick a centroid" somewhere. Doesn't that always depend on the current situation / last centroid? I'm a bit confused. Are we supposed to find a movement (bringing points from those subsets x1, x2 and y closer together) where points will be clustered differently according to the old centroids? Or different clustering according to centroids that yield optimal RSS values (aren't they hard to find for non-trivial cases?)? Sorry for the confusion, but I don't seem to be able to figure it out by myself at the moment. Björn 3Feb10 21:56
Thanks for the response. Yes, I meant values of M, of course. I don't know why I wrote k - just a typo. Björn 3Feb10 20:58
BTW, I am really happy that you guys do not just mechanically do the exercises and write "my RSS value is X", but that you actually wonder what the results mean and whether things make sense. That is great research spirit, and that's how one learns the most! Hannah 3Feb10 20:23
Hi Björn + all: I think you made some good observations here. Indeed, the RSS becomes smaller as you make M larger, for exactly the reasons you explained. Which means that is doesn't really make sense to compare RSS values for different values of M. Note that the same would have happened in the vector space model, with each text represented as a normalized M-dimensional vector, where M = the dimension = number of features considered, and the similarity measure being the dot produce = the cosine similarity. The components of M-dimensional normalized vectors also become smaller, as M increases, and so does the cosine similarity (assuming that the number of non-zero components in a vector is more or less a constant). But this does not matter, what is important is that for a *fixed* value of M, the RSS value reflects the quality of the clustering, and that it does. BTW, in what you wrote, you mean M = 6 or 7 and not k = 6 or 7, right? Your observation about a too large M having a negative influence on the clusters is also very reasonable. Hannah 3Feb10 20:19
I just made the following observation: Using what Marjan explained (RSS gets in fact higher, because JD is similarity), I recognized that my RSS does gets quite high for small/medium M (like 10), while it does not for larger M. Looking at the jaccard distance / similarity, I observe that having one really large set (my centroid) that is compared to a small one (title), I necessarily get a huge union while my intersection can never exceed the size of the smaller set. Hence, I end up with low JD values and therefore low RSS. I achieved highest rss results for k=6 or 7 which also seems to be a reasonable guess of the avg title length. On the other hand, I am aware that the decrease in the RSS value affects all distances and the clustering might in fact get better. But how exactly is the RSS still meaningful like this? Thanks for any response and I hope I didn't get anything completely wrong. PS: Too large M (1000+) also have a negative influence on the clustering. I assume that's because adding a word that occurs once in one title does not really make the text a better "middle" Björn 3.2. 20:02
Sorry Florian, it's a messy week for us. Now it's there. Marjan 03Feb10 18:39
The link to upload the solutions is again still missing Florian 03Feb10 18:29
Hi Eric + all: Taking the average is fine. Marjan 3Feb10 18:15
I am not sure about this, but precision and recall are (contrary to RSS) values available for each cluster, that means every cluster has it's own precision and recall. How shall we compute the classification's prec&rec? Taking the average? Or did I miss something? Eric 3Feb10 16:50
Hi Mirko, good question. The approach you suggest looks fine to me. An alternative, very simple, yet also reasonable approach would be to simply assign the text to a random centroid in this case. Hannah 2Feb10 20:42
Hi Johannes, I would't call it a problem, since you have solved the exercise and made an interesting observation. Uncharacteristic clusters is indeed something that can happen. But it's of course an interesting question, and the one you are asking I guess, how one can get a better clustering and if that is possible with k-means at all. Well, I don't know myself, since I haven't implemented it for this data set. But one thing I would certainly try in this situation is to start with centroids which are just the "average" (in the sense defined on the exercise sheet) of all titles from that class. This is of course not unsupervised anymore, but it would be interesting to see whether at least with such a strong hint k-means is able to find a good clustering. If yes, than what you observed would just be an unfavorable local maximum. If you try the choice of centroids I just explained, please report on the results, I am curious. Hannah 2Feb10 20:40
To which cluster should we assign the the texts which have JaccardDistance=0 for every centroid? In my implementation they just get assigned to the first centroid but the thing is that, from the JaccardDistance, they belong to none of the centroids. This 'manipulates' the clusters. My idea is to just ignore these texts until there is a centroid with JaccardDistance > 0, is this okay? Mirko 2Feb10, 20:35
I have the problem, that the clusters become too average in the first rounds and no meaningful clustering happens from this point on. In the first round, the distance is 1.0 or 0.99 for almost each title-centroid pair. This leads to very uncharacteristic centroids. Johannes 2Feb10 20:30
Oh yes, what Marjan wrote is actually very important, and I didn't have it on my mind when I wrote the exercise sheet. K-means won't do something particularly meaningful if in each assignment step you assign each point / record to the *farthest* cluster. And it makes sense that it alternates between two clusterings at some point: assuming that the centroids don't change anymore, then after you assign a point to its *farthest* centroid, then in the next step the centroid to which it has been assigned before will be it's farthest centroid in the next step. Cool stuff. Hannah 2Feb10 20:19
To Hannah: I tried 10000 for M and the RSS value ist still sometimes increasing and sometimes decreasing. I'll change my program according to Marjan's observation now and report again... EDIT: Ok after changing this everything seems to make sense again, the RSS value is now constantly increasing until it does not change anymore! Florian 02Feb10 20:10
To all: Just a small but important observation. The "Jaccard distance" from the exercise sheet is in fact a similarity (between two texts), meaning that your RSS should in fact grow. This also means that you should pick the cluster centroids for each text with the maximum value of the Jaccard similarity. Marjan 02Feb10 19:46
Sure, this is a meaningful way to pick the initial centroids, too. However, there is no guarantee at all, that most SIGIR papers will be assigned to the cluster for which the initial centroid was a SIGIR paper, simply because the title of that paper might have been very untypical for SIGIR. So you might end up with very low precision and recall, although for another assignment of the clusters to the labels / classes, precision and recall might actually be very good. Hannah 2Feb10 19:32
Hi, is it also allowed to choose the initial centroids as follows: first centroid = random SIGGRAPH text, second centroid = random SIGIR text, third centroid = random STOC text? Then we would also know which centroid belongs to which class. Mirko 2Feb 19:27
Concerning the RSS values: I haven't written the program myself, but Marjan promised to do so soon. But here is some intuition. First, I would also output the RSS value divided by the number of titles (6630), and the square root of that. Let's call this number J. This gives you the average Jaccard distance of a text to its centroid. What is this J on average? Well, the number of words in your centroids quickly becomes M. In the perfect case, all the words of each text are contained in its nearest centroid. Then the Jaccard distance is text length / M, and so J should be average text length / M. Hannah 2Feb10 19:13
About computing precision and recall: yes, you need an assignment of the clusters to the three classes SIGGRAPH, SIGIR, STOC. I would do that assignment as follows. Pick the first cluster in your numbering of clusters. Find the most frequent label in the cluster (you have the labels of all titles from dblp.txt). Assign the cluster to that label. Then pick the second cluster, and find the most frequent of the remaining labels, two in this case. And so on, where for three clusters there is no "and so on", because you have only one choice of label left for the third cluster. Once you have a 1-1 assignment from clusters to labels, it should be clear how to compute precision and recall (same definitions as in the case of classification). Hannah 2Feb10 18:57
I had negative steps (as asked below), until I found, that I had a empty cluster. Fixing that removed the problem. Johannes 2763-02-02T1834
To exercise 3: Can you give an intuition, in what range the RSS-value should be (approx.)? My RSS-values seems very high to me...Additionally, I have the same question like Matthias few lines ago: How can we calculate recall and precision - without knowing which cluster belongs to which conference? Claudius 2Feb10 17:23
Thanks for the answer, Matthias, please also try with M = 10000 (if that is feasible time-wise, otherwise take M a little smaller) and report again, thanks. Hannah 2Feb10 16:43
To Hannah: Yes, for me it does happen for large Ms as well. I tried 10, 100, 1000 with the same quantitative effects. However, for larger M, the RSS value decreases in size. The RSS values are rising&falling in a very small range of about 5%. Do we have an error here or is the distance function + recalculation step given simply not convergent? Matthias
To Matthias + Florian + all: Which values of M have you tried? Does it still happen when you choose M large enough, say 1000, or even larger? Note that the cluster centroids cannot be meaningful if M is not large enough. Hannah 2Deb10 16:21
I'm noticing exactly the same behavior like Florian. After a few iterations, the RSS value is alternating esp. when using small M. Also, the RSS values do not fall, they are more or less jumping around the same base value all the time.
Also, in order to calculate Recall + Precision, we have to define which of the clusters is which conference. How should we do that? Just by counting the most frequent conference in a cluster based on the known ground truth information? Matthias
To Alex: You should compute the k-means clustering on the whole set, that's the point of clustering anyway (no train sets). Marjan 02Feb10 14:39
Should we compute k-means on the complete set or should we divide the set in a training and in a test set, like the exercise before?Alex 02Feb10 14:10
Hi Florian + all. That's a very valid question, although counterintuitive from what we expected. To give a precise answer I'll have to write the program myself but since I am in a middle of a deadline I can only promise that I will come back to it asap. My advice is to try to understand why the RSS value increases - either by debugging your own program or by writing a toy example by hand and then reporting here again. Marjan 01Feb10 22:41
The RSS values of my clustering often get higher after an iteration step, is that normal for this kind of data? Furthermore my RSS value never stops changing, it always switches between two different values after a few iterations (I now just choose three random titles as the initial centroids). I also do not understand how I can compute the precision and recall values of the clustering with respect to the "perfect" clustering. Florian 01Feb10 22:41
To Florian, Eric + all: yes, sure, the centroids must be of the same type as the things you cluster. And since the things you should cluster are strings / texts (the titles from the dblp.txt from the last lecture), the centroids also have to be strings / texts. There are many ways how you could pick the initial centroids. One way would be to just pick k random titles from the dblp.txt as centroids. Hannah 1Feb10 20:55
As I understand it the centroids have to be strings as well and the distance of a text to the centroid is just jaccard distance of their distinct words. The new centroids can then just be created by using the m most frequent occuring words in all texts which are assigned to this centroid as statet in exercise 1. The only problem then would be to get the initial centroids... Florian 01Feb10 20:50
As I understood we may chose random numbers between 0 and 1, since this is what the jaccard distance will lead us to. But my next question is: how should we determine the distance of ONE text to it's centroid? (while initially assigning them to clusters) Eric 01Feb10 20:38
Hi, how should we choose our initial centroids for the k-means algorithm with the strings? Florian 01Feb10 17:56