Diff for "SearchEnginesWS0910"

Differences between revisions 269 and 675 (spanning 406 versions)

Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.

Here are PDFs of the slides of the lectures so far: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6, Lecture 7, Lecture 8, Lecture 9, Lecture 10, Lecture 11, Lecture 12.

Here are the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: Recording Lecture 1 (LPD), Recording Lecture 3 (LPD), Recording Lecture 4 (LPD), Recording Lecture 5 (LPD without audio), Recording Lecture 6 (LPD), Recording Lecture 7 (AVI), Recording Lecture 8 (AVI), Recording Lecture 9 (AVI), Recording Lecture 10 (AVI), Recording Lecture 11 (AVI), Recording Lecture 12 (AVI).

Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5, Exercise Sheet 6, Exercise Sheet 7, Exercise Sheet 8, Exercise Sheet 9, Exercise Sheet 10, Exercise Sheet 11, Exercise Sheet 12, Exercise Sheet 13.

Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4, Solutions and Comments 5, Solutions and Comments 6, Solutions and Comments 7, Solutions and Comments 8, Solutions and Comments 9, Solutions and Comments 10, Solutions and Comments 11.

Here are our master solutions: Master solution for Mid-Term Exam,Master solution for Exercise Sheet 9, Master solution for Exercise Sheet 10.

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the Lecturnity recordings (.lpd files) you need the Lecturnity Player, which you can download here. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend VLC.

Here are the rules for the exercises as explained in Lecture 2.

Here is everything about the mid-term exam.

Here is the file for the Exercise Sheet 11. It's a text file, where each line contains the name of the conference (in capital letters), followed by a TAB (ASCII code 9), followed by the title. There are three different conferences: STOC (2423 titles), SIGIR (2372 titles), and SIGGRAPH (1835 titles). The total number of titles / lines is 6630. The exact file size is 454365 bytes.

Here is the table with the links to your uploaded solutions for Exercise Sheet 12. The deadline is Thursday 04Feb10 16:00.

Questions and comments about Exercise Sheet 12 below this line (most recent on top)

Hi Dragos, my answer comes a bit late, but maybe it helps you to read the proof sketch of Theorem 4.1 of the NIPS paper by Jon Kleinberg: http://www.cs.cornell.edu/home/kleinber/nips15.pdf . There the same thing is said in different words. Hannah 4Feb10 14:24

Hello. I have been thinking over exercise 4 for a couple of hours now, and I still can't fully understand it. This is probably because I can't see where is the trick. Is it the fact that we shrink distances in X1 and X2, but keep all other distances the same? Another thing is that I can't picture in my mind how could it be that by doing so, the centroid that was in Y, now is in X1 while the distances are preserved between the two!? I'm probably confused right now ... but maybe you could somehow explain what happens in a more clear way.. But I'm not sure that it is possible:). Dragos 10.45 04 February

Here is a more concrete hint for Exercise 4: For X = X1 u X2 pick m elements with distance <= r to each other. For Y pick gamma * m points with distance <= epsilon < r to each other, for some small gamma > 0. The distance from any point in X to any point in Y is at least r + delta, for some small delta > 0. Now choose gamma, r, epsilon, and delta appropriately so that you can show what was formulated in the other hint. Hannah 3Feb10 23:10

Hi Björn + all: you are right, that is not well formulated. What I meant is that these are the optimal centroids, that is, the one which (globally) minimize the RSS. With a suitable choice of initial centroids, 2-means will then converge towards these centroids and the corresponding clustering. So to reformulate the hint: Construct a point set with three subsets X1, X2 and Y . Then carefully define distances between all these points such that the optimal centroids are one in X1 u X2 and one in Y. But when the points within X1 and within X2 are moved closer together (with all other distances remaining the same) then the optimal centroids are one in X1 and one in X2. Hannah 3Feb10 22:52

Hmm regarding exercise 4. How does 2-means "pick a centroid" somewhere. Doesn't that always depend on the current situation / last centroid? I'm a bit confused. Are we supposed to find a movement (bringing points from those subsets x1, x2 and y closer together) where points will be clustered differently according to the old centroids? Or different clustering according to centroids that yield optimal RSS values (aren't they hard to find for non-trivial cases?)? Sorry for the confusion, but I don't seem to be able to figure it out by myself at the moment. Björn 3Feb10 21:56

Thanks for the response. Yes, I meant values of M, of course. I don't know why I wrote k - just a typo. Björn 3Feb10 20:58

BTW, I am really happy that you guys do not just mechanically do the exercises and write "my RSS value is X", but that you actually wonder what the results mean and whether things make sense. That is great research spirit, and that's how one learns the most! Hannah 3Feb10 20:23

Hi Björn + all: I think you made some good observations here. Indeed, the RSS becomes smaller as you make M larger, for exactly the reasons you explained. Which means that is doesn't really make sense to compare RSS values for different values of M. Note that the same would have happened in the vector space model, with each text represented as a normalized M-dimensional vector, where M = the dimension = number of features considered, and the similarity measure being the dot produce = the cosine similarity. The components of M-dimensional normalized vectors also become smaller, as M increases, and so does the cosine similarity (assuming that the number of non-zero components in a vector is more or less a constant). But this does not matter, what is important is that for a *fixed* value of M, the RSS value reflects the quality of the clustering, and that it does. BTW, in what you wrote, you mean M = 6 or 7 and not k = 6 or 7, right? Your observation about a too large M having a negative influence on the clusters is also very reasonable. Hannah 3Feb10 20:19

I just made the following observation: Using what Marjan explained (RSS gets in fact higher, because JD is similarity), I recognized that my RSS does gets quite high for small/medium M (like 10), while it does not for larger M. Looking at the jaccard distance / similarity, I observe that having one really large set (my centroid) that is compared to a small one (title), I necessarily get a huge union while my intersection can never exceed the size of the smaller set. Hence, I end up with low JD values and therefore low RSS. I achieved highest rss results for k=6 or 7 which also seems to be a reasonable guess of the avg title length. On the other hand, I am aware that the decrease in the RSS value affects all distances and the clustering might in fact get better. But how exactly is the RSS still meaningful like this? Thanks for any response and I hope I didn't get anything completely wrong. PS: Too large M (1000+) also have a negative influence on the clustering. I assume that's because adding a word that occurs once in one title does not really make the text a better "middle" Björn 3.2. 20:02

Sorry Florian, it's a messy week for us. Now it's there. Marjan 03Feb10 18:39

The link to upload the solutions is again still missing Florian 03Feb10 18:29

Hi Eric + all: Taking the average is fine. Marjan 3Feb10 18:15

I am not sure about this, but precision and recall are (contrary to RSS) values available for each cluster, that means every cluster has it's own precision and recall. How shall we compute the classification's prec&rec? Taking the average? Or did I miss something? Eric 3Feb10 16:50

Hi Mirko, good question. The approach you suggest looks fine to me. An alternative, very simple, yet also reasonable approach would be to simply assign the text to a random centroid in this case. Hannah 2Feb10 20:42

Hi Johannes, I would't call it a problem, since you have solved the exercise and made an interesting observation. Uncharacteristic clusters is indeed something that can happen. But it's of course an interesting question, and the one you are asking I guess, how one can get a better clustering and if that is possible with k-means at all. Well, I don't know myself, since I haven't implemented it for this data set. But one thing I would certainly try in this situation is to start with centroids which are just the "average" (in the sense defined on the exercise sheet) of all titles from that class. This is of course not unsupervised anymore, but it would be interesting to see whether at least with such a strong hint k-means is able to find a good clustering. If yes, than what you observed would just be an unfavorable local maximum. If you try the choice of centroids I just explained, please report on the results, I am curious. Hannah 2Feb10 20:40

To which cluster should we assign the the texts which have JaccardDistance=0 for every centroid? In my implementation they just get assigned to the first centroid but the thing is that, from the JaccardDistance, they belong to none of the centroids. This 'manipulates' the clusters. My idea is to just ignore these texts until there is a centroid with JaccardDistance > 0, is this okay? Mirko 2Feb10, 20:35

I have the problem, that the clusters become too average in the first rounds and no meaningful clustering happens from this point on. In the first round, the distance is 1.0 or 0.99 for almost each title-centroid pair. This leads to very uncharacteristic centroids. Johannes 2Feb10 20:30

Oh yes, what Marjan wrote is actually very important, and I didn't have it on my mind when I wrote the exercise sheet. K-means won't do something particularly meaningful if in each assignment step you assign each point / record to the *farthest* cluster. And it makes sense that it alternates between two clusterings at some point: assuming that the centroids don't change anymore, then after you assign a point to its *farthest* centroid, then in the next step the centroid to which it has been assigned before will be it's farthest centroid in the next step. Cool stuff. Hannah 2Feb10 20:19

To Hannah: I tried 10000 for M and the RSS value ist still sometimes increasing and sometimes decreasing. I'll change my program according to Marjan's observation now and report again... EDIT: Ok after changing this everything seems to make sense again, the RSS value is now constantly increasing until it does not change anymore! Florian 02Feb10 20:10

To all: Just a small but important observation. The "Jaccard distance" from the exercise sheet is in fact a similarity (between two texts), meaning that your RSS should in fact grow. This also means that you should pick the cluster centroids for each text with the maximum value of the Jaccard similarity. Marjan 02Feb10 19:46

Sure, this is a meaningful way to pick the initial centroids, too. However, there is no guarantee at all, that most SIGIR papers will be assigned to the cluster for which the initial centroid was a SIGIR paper, simply because the title of that paper might have been very untypical for SIGIR. So you might end up with very low precision and recall, although for another assignment of the clusters to the labels / classes, precision and recall might actually be very good. Hannah 2Feb10 19:32

Hi, is it also allowed to choose the initial centroids as follows: first centroid = random SIGGRAPH text, second centroid = random SIGIR text, third centroid = random STOC text? Then we would also know which centroid belongs to which class. Mirko 2Feb 19:27

Concerning the RSS values: I haven't written the program myself, but Marjan promised to do so soon. But here is some intuition. First, I would also output the RSS value divided by the number of titles (6630), and the square root of that. Let's call this number J. This gives you the average Jaccard distance of a text to its centroid. What is this J on average? Well, the number of words in your centroids quickly becomes M. In the perfect case, all the words of each text are contained in its nearest centroid. Then the Jaccard distance is text length / M, and so J should be average text length / M. Hannah 2Feb10 19:13

About computing precision and recall: yes, you need an assignment of the clusters to the three classes SIGGRAPH, SIGIR, STOC. I would do that assignment as follows. Pick the first cluster in your numbering of clusters. Find the most frequent label in the cluster (you have the labels of all titles from dblp.txt). Assign the cluster to that label. Then pick the second cluster, and find the most frequent of the remaining labels, two in this case. And so on, where for three clusters there is no "and so on", because you have only one choice of label left for the third cluster. Once you have a 1-1 assignment from clusters to labels, it should be clear how to compute precision and recall (same definitions as in the case of classification). Hannah 2Feb10 18:57

I had negative steps (as asked below), until I found, that I had a empty cluster. Fixing that removed the problem. Johannes 2763-02-02T1834

To exercise 3: Can you give an intuition, in what range the RSS-value should be (approx.)? My RSS-values seems very high to me...Additionally, I have the same question like Matthias few lines ago: How can we calculate recall and precision - without knowing which cluster belongs to which conference? Claudius 2Feb10 17:23

Thanks for the answer, Matthias, please also try with M = 10000 (if that is feasible time-wise, otherwise take M a little smaller) and report again, thanks. Hannah 2Feb10 16:43

To Hannah: Yes, for me it does happen for large Ms as well. I tried 10, 100, 1000 with the same quantitative effects. However, for larger M, the RSS value decreases in size. The RSS values are rising&falling in a very small range of about 5%. Do we have an error here or is the distance function + recalculation step given simply not convergent? Matthias

To Matthias + Florian + all: Which values of M have you tried? Does it still happen when you choose M large enough, say 1000, or even larger? Note that the cluster centroids cannot be meaningful if M is not large enough. Hannah 2Deb10 16:21

I'm noticing exactly the same behavior like Florian. After a few iterations, the RSS value is alternating esp. when using small M. Also, the RSS values do not fall, they are more or less jumping around the same base value all the time.

Also, in order to calculate Recall + Precision, we have to define which of the clusters is which conference. How should we do that? Just by counting the most frequent conference in a cluster based on the known ground truth information? Matthias

To Alex: You should compute the k-means clustering on the whole set, that's the point of clustering anyway (no train sets). Marjan 02Feb10 14:39

Should we compute k-means on the complete set or should we divide the set in a training and in a test set, like the exercise before?Alex 02Feb10 14:10

Hi Florian + all. That's a very valid question, although counterintuitive from what we expected. To give a precise answer I'll have to write the program myself but since I am in a middle of a deadline I can only promise that I will come back to it asap. My advice is to try to understand why the RSS value increases - either by debugging your own program or by writing a toy example by hand and then reporting here again. Marjan 01Feb10 22:41

The RSS values of my clustering often get higher after an iteration step, is that normal for this kind of data? Furthermore my RSS value never stops changing, it always switches between two different values after a few iterations (I now just choose three random titles as the initial centroids). I also do not understand how I can compute the precision and recall values of the clustering with respect to the "perfect" clustering. Florian 01Feb10 22:41

To Florian, Eric + all: yes, sure, the centroids must be of the same type as the things you cluster. And since the things you should cluster are strings / texts (the titles from the dblp.txt from the last lecture), the centroids also have to be strings / texts. There are many ways how you could pick the initial centroids. One way would be to just pick k random titles from the dblp.txt as centroids. Hannah 1Feb10 20:55

As I understand it the centroids have to be strings as well and the distance of a text to the centroid is just jaccard distance of their distinct words. The new centroids can then just be created by using the m most frequent occuring words in all texts which are assigned to this centroid as statet in exercise 1. The only problem then would be to get the initial centroids... Florian 01Feb10 20:50

As I understood we may chose random numbers between 0 and 1, since this is what the jaccard distance will lead us to. But my next question is: how should we determine the distance of ONE text to it's centroid? (while initially assigning them to clusters) Eric 01Feb10 20:38

Hi, how should we choose our initial centroids for the k-means algorithm with the strings? Florian 01Feb10 17:56

-  ⇤ ← Revision 269 as of 2009-11-15 20:10:24 → 
  Size: 8294
  Editor: Hannah Bast
  Comment:
+   ← Revision 675 as of 2010-02-04 14:27:33 → ⇥
  Size: 21136
  Editor: Hannah Bast
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]].
+Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]], [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]], [[attachment:SearchEnginesWS0910/lecture-7.pdf|Lecture 7]], [[attachment:SearchEnginesWS0910/lecture-8.pdf|Lecture 8]], [[attachment:SearchEnginesWS0910/lecture-9.pdf|Lecture 9]], [[attachment:SearchEnginesWS0910/lecture-10.pdf|Lecture 10]], [[attachment:SearchEnginesWS0910/lecture-11.pdf|Lecture 11]], [[attachment:SearchEnginesWS0910/lecture-12.pdf|Lecture 12]].
 Line 5:
-Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Lecture 1]] [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Lecture 3]] [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Lecture 4]].
+Here are the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (LPD without audio)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-6.lpd|Recording Lecture 6 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-7.avi|Recording Lecture 7 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-8.avi|Recording Lecture 8 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-9.avi|Recording Lecture 9 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-10.avi|Recording Lecture 10 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-11.avi|Recording Lecture 11 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-12.avi|Recording Lecture 12 (AVI)]].
 Line 7:
-Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]].
+Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]], [[attachment:SearchEnginesWS0910/exercise-7.pdf|Exercise Sheet 7]], [[attachment:SearchEnginesWS0910/exercise-8.pdf|Exercise Sheet 8]], [[attachment:SearchEnginesWS0910/exercise-9.pdf|Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/exercise-10.pdf|Exercise Sheet 10]], [[attachment:SearchEnginesWS0910/exercise-11.pdf|Exercise Sheet 11]], [[attachment:SearchEnginesWS0910/exercise-12.pdf|Exercise Sheet 12]], [[attachment:SearchEnginesWS0910/exercise-13.pdf|Exercise Sheet 13]].
 Line 9:
-Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]]
+Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]], [[SearchEnginesWS0910/ExerciseSheet6|Solutions and Comments 6]], [[SearchEnginesWS0910/ExerciseSheet7|Solutions and Comments 7]], [[SearchEnginesWS0910/ExerciseSheet8|Solutions and Comments 8]], [[SearchEnginesWS0910/ExerciseSheet9|Solutions and Comments 9]], [[SearchEnginesWS0910/ExerciseSheet10|Solutions and Comments 10]], [[SearchEnginesWS0910/ExerciseSheet11|Solutions and Comments 11]].
 Line 11:
-= Exercise Sheet 3 =
The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. [[http://www.lecturnity.de/de/download/lecturnity-player|You can download the player for free here]].
+Here are our master solutions: [[attachment:SearchEnginesWS0910/solution-midterm.pdf|Master solution for Mid-Term Exam]],[[attachment:SearchEnginesWS0910/solution-9.pdf|Master solution for Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/solution-10.pdf|Master solution for Exercise Sheet 10]].

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the Lecturnity recordings (.lpd files) you need the [[http://www.lecturnity.de/de/download/lecturnity-player|Lecturnity Player, which you can download here]]. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend [[http://www.videolan.org/vlc|VLC]].
-Line 16:
+Line 17:
-[[SearchEnginesWS0910/ExerciseSheet4|Here you can upload your solutions for Exercise Sheet 4]].
+[[SearchEnginesWS0910/MidTermExam|Here is everything about the mid-term exam]].
-Line 18:
+Line 19:
-== Questions or comments below this line, most recent on top please ==
+[[attachment:dblp.txt|Here is the file for the Exercise Sheet 11]]. It's a text file, where each line contains the name of the conference (in capital letters), followed by a TAB (ASCII code 9), followed by the title. There are three different conferences: STOC (2423 titles), SIGIR (2372 titles), and SIGGRAPH (1835 titles). The total number of titles / lines is 6630. The exact file size is 454365 bytes.
-Line 20:
+Line 21:
-To Florian + all: yes, sorry, I forgot to mention this in the lecture. Marjan already explained how to clear the disk cache. Let me add to this an explanation what the disk cache actually is. Whenever you read a (part of a) file from disk, the operating system of your computed will use whatever memory is currently unused to store that (part of the) file there. When you read it again and the (part of the) file hasn't changed and the memory used to store it has not been used otherwise in the meantime, than that data is read right from memory, which is much faster than reading it from disk. Usually that effect is desirable, because it speeds up things, but when you do experiments, it is undesirable, because it leads to unrealistically good running times, especially when carrying out an experiment many times in a row. '''Hannah 15Nov09 8:10pm'''
+[[SearchEnginesWS0910/ExerciseSheet12|Here is the table with the links to your uploaded solutions for Exercise Sheet 12]]. The deadline is Thursday 04Feb10 16:00.
-Line 22:
+Line 23:
-To Florian: Indeed, we were running out of time and there was no room for this in the lecture. I can suggest you few ways how to clear the disk cache: before carrying out your final experiment, read a large amount of data (let's say close to the amount of RAM you have) from disk - this will ensure that your data (the inverted list) is cleared from the disk cache and replaced by something else (thus an actual reading from disk get's timed, and not reading from RAM). Another way is to restart your computer before doing the timing. '''Marjan 15Nov09 7:27pm'''
+== Questions and comments about Exercise Sheet 12 below this line (most recent on top) ==
-Line 24:
+Line 25:
-In exercise 4 it says: "Important note: Whenver you measure running times for reading data from disk, you have to clear the disk cache before, as discussed in the lecture". I think that this was not discussed in the lecture? What do we have to do here? '''Florian 15Nov09 7:15pm'''
+Hi Dragos, my answer comes a bit late, but maybe it helps you to read the proof sketch of Theorem 4.1 of the NIPS paper by Jon Kleinberg: http://www.cs.cornell.edu/home/kleinber/nips15.pdf . There the same thing is said in different words. '''Hannah 4Feb10 14:24'''
-Line 26:
+Line 27:
-@Bit shifting: The syntax for that is actually the same, irrespectively of whether you use Java, C++, perl, python, or whatever. The >> operator shifts to the right, the << operator shifts to the left, the & operator ands the bits of the two operands and the | operator ors the bits of the two operands. Very simple. You will also find zillions of example programs on the web by typing something like ''java bit shifting'' into Google or whatever your favorite search engine is. '''Hannah 15Nov09 1:16'''
+Hello. I have been thinking over exercise 4 for a couple of hours now, and I still can't fully understand it. This is probably because I can't see where is the trick. Is it the fact that we shrink distances in X1 and X2, but keep all other distances the same? Another thing is that I can't picture in my mind how could it be that by doing so, the centroid that was in Y, now is in X1 while the distances are preserved between the two!? I'm probably confused right now ... but maybe you could somehow explain what happens in a more clear way.. But I'm not sure that it is possible:). '''Dragos 10.45 04 February'''
-Line 28:
+Line 29:
-Hi Marius + all: For Exercise 4, an inverted list of size m with doc ids from the range [1..n] is simply a sorted list of m numbers from the range [1..n]. I leave it to you, whether your lists potentially contain duplicates (as in  3, 5, 5, 8, 12, ...) or whether you generate them in a way that they don't contain duplicates (as in 3, 5, 8, 17, ...). It doesn't really matter for the exercise whether your list has duplicated or not. In any case, consider simple flat lists like in the two examples I gave (and like all the examples I gave in this and past lectures), not lists of lists or anything. '''Hannah 15Nov09 1:12am'''
+Here is a more concrete hint for Exercise 4: For X = X1 u X2 pick m elements with distance <= r to each other. For Y pick gamma * m points with distance <= epsilon < r to each other, for some small gamma > 0. The distance from any point in X to any point in Y is at least r + delta, for some small delta > 0. Now choose gamma, r, epsilon, and delta appropriately so that you can show what was formulated in the other hint. '''Hannah 3Feb10 23:10'''
-Line 30:
+Line 31:
-@Mirko: Sure, but an inverted list is a list of words where the Doc-IDs are attached to each words in which the words occur. So for Example: If word no. 5 occurs in Doc1, Doc2 and Doc3 and word no. 2 occurs in Doc5, the list would look like: 5 -> Doc1, Doc2, Doc3; 2 -> Doc5. Or am I mistaken? My question then is, how long should these attached lists be in average case? I mean, one could imagine that we got 1mil. documents over 3 words, so these lists could get very large...
+Hi Björn + all: you are right, that is not well formulated. What I meant is that these are the optimal centroids, that is, the one which (globally) minimize the RSS. With a suitable choice of initial centroids, 2-means will then converge towards these centroids and the corresponding clustering. So to reformulate the hint: ''Construct a point set with three subsets X1, X2 and Y . Then carefully define distances between all these points such that the optimal centroids are one in X1 u X2 and one in Y. But when the points within X1 and within X2 are moved closer together (with all other distances remaining the same) then the optimal centroids are one in X1 and one in X2.'' '''Hannah 3Feb10 22:52'''
-Line 32:
+Line 33:
-EDIT: Oh ok. Now, I see your point. It's not an index, it's a list. Okay. So, what is an inverted list with Doc-IDs, then?
+Hmm regarding exercise 4. How does 2-means "pick a centroid" somewhere. Doesn't that always depend on the current situation / last centroid? I'm a bit confused. Are we supposed to find a movement (bringing points from those subsets x1, x2 and y closer together) where points will be clustered differently according to the old centroids? Or different clustering according to centroids that yield optimal RSS values (aren't they hard to find for non-trivial cases?)? Sorry for the confusion, but I don't seem to be able to figure it out by myself at the moment. '''Björn 3Feb10 21:56'''
-Line 34:
+Line 35:
-EDIT EDIT: And to your question, Mirko, take a look at [[http://snippets.dzone.com/posts/show/93]]. Especially at Comment no. 2. Maybe this helps... I think, Java supports StreamWriters/Readers that are able to write/read bytes. '''Marius 11/14/2009 08:46pm'''
+Thanks for the response. Yes, I meant values of M, of course. I don't know why I wrote k - just a typo. '''Björn 3Feb10 20:58'''
-Line 36:
+Line 37:
-EDIT EDIT EDIT: Sorry, me again. Well, I bothered Wikipedia which redirects from [[http://en.wikipedia.org/wiki/Inverted_list]] to Inverted Index. So it seems to me, this is being used as a synonym. Actually, I think I'm confused enough, now. I'll better wait for any responses... ;-) '''Marius 11/14/2009 9:08 pm'''
+BTW, I am really happy that you guys do not just mechanically do the exercises and write "my RSS value is X", but that you actually wonder what the results mean and whether things make sense. That is great research spirit, and that's how one learns the most! '''Hannah 3Feb10 20:23'''
-Line 38:
+Line 39:
-@ Marius: i think we are supposed to generate one inverted __list__ of size m, with doc ids from 1..n (therefore n>=m, because no duplicates?).
+Hi Björn + all: I think you made some good observations here. Indeed, the RSS becomes smaller as you make M larger, for exactly the reasons you explained. Which means that is doesn't really make sense to compare RSS values for different values of M. Note that the same would have happened in the vector space model, with each text represented as a normalized M-dimensional vector, where M = the dimension = number of features considered, and the similarity measure being the dot produce = the cosine similarity. The components of M-dimensional normalized vectors also become smaller, as M increases, and so does the cosine similarity (assuming that the number of non-zero components in a vector is more or less a constant). But this does not matter, what is important is that for a *fixed* value of M, the RSS value reflects the quality of the clustering, and that it does. BTW, in what you wrote, you mean M = 6 or 7 and not k = 6 or 7, right? Your observation about a too large M having a negative influence on the clusters is also very reasonable. '''Hannah 3Feb10 20:19'''
-Line 40:
+Line 41:
-Now a question from my side: ex.4, programming the compression in __java__, is there any __good__ tutorial about how to handle the bit-stuff? (otherwise, i think, it would cost me too much time..) '''Mirko 14Nov09, 19:18'''
 Line 42:
-Hi, do you have any suggestions what the best numbers for m and n in exercise 4 should look like? Or are we supposed to mess around a bit with ints and longs? And: How long should the list of documents in the inverted index be? '''Marius 14Nov09 6:40pm'''
+I just made the following observation: Using what Marjan explained (RSS gets in fact higher, because JD is similarity), I recognized that my RSS does gets quite high for small/medium M (like 10), while it does not for larger M.  Looking at the jaccard distance / similarity, I observe that having one really large set (my centroid) that is compared to a small one (title), I necessarily get a huge union while my intersection can never exceed the size of the smaller set. Hence, I end up with low JD values and therefore low RSS. I achieved highest rss results for k=6 or 7 which also seems to be a reasonable guess of the avg title length. On the other hand, I am aware that the decrease in the RSS value affects all distances and the clustering might in fact get better. But how exactly is the RSS still meaningful like this? Thanks for any response and I hope I didn't get anything completely wrong. 
PS: Too large M (1000+) also have a negative influence on the clustering. I assume that's because adding a word that occurs once in one title does not really make the text a better "middle" '''Björn 3.2. 20:02'''
-Line 44:
+Line 45:
-And just to clarify what a single-cycle permutation is. Here is an example for an array of size 5 with a permutation that is a single cycle: 5 4 1 3 2. Why single cycle? Well, A[1] = 5, A[5] = 2, A[2] = 4, A[4] = 3, A[3] = 1. (My indices in this example are 1,...,5 and not 0,...,4.) Here is an example of a permutation with three cycles: 2 1 4 3 5. The first cycle is A[1] = 2, A[2] =1. The second cycle is A[3] = 4, A[4] = 3. The third cycle is A[5] = 5. '''Hannah 12Nov09 8:04pm'''
+Sorry Florian, it's a messy week for us. Now it's there. '''Marjan 03Feb10 18:39'''
-Line 46:
+Line 47:
-Hi Daniel + all, I don't quite understand your question and your example (if your array is 1 5 3 4 2, why is A[1] = 3?). In case you refer to the requirement of the exercise that the permutation consists only of a single cycle. That is because your code should go over each element exactly once (it should, of course, stop after n iterations, where n is the size of the array). If your permutation has more than one cycle, it is hard to achieve that. Also note that for both (1) and (2), the sum of the array values should be sum_i=1,...,n i = n * (n+1) / 2. '''Hannah 12Nov09 7:54pm'''
+The link to upload the solutions is again still missing :)  '''Florian 03Feb10 18:29'''
-Line 48:
+Line 49:
-Hi, I just looked at the new exercise sheet 4, in exercise 1 we should generate a permutation and sum the resulting array up, am I wrong or doesn't iterating method two iterate throw the whole array in every situation. for ex.: n= 5 permutation: 1 5 3 4 2, then A[1] = 3, A[A[1]]= A[3] = 1, A[1] = 3 ... '''Daniel 12Nov09 19:44pm'''
+Hi Eric + all: Taking the average is fine. '''Marjan 3Feb10 18:15'''

I am not sure about this, but precision and recall are (contrary to RSS) values available for each cluster, that means every cluster has it's own precision and recall. How shall we compute the
classification's prec&rec? Taking the average? Or did I miss something? '''Eric 3Feb10 16:50'''

Hi Mirko, good question. The approach you suggest looks fine to me. An alternative, very simple, yet also reasonable approach would be to simply assign the text to a random centroid in this case. '''Hannah 2Feb10 20:42'''

Hi Johannes, I would't call it a problem, since you have solved the exercise and made an interesting observation. Uncharacteristic clusters is indeed something that can happen. But it's of course an interesting question, and the one you are asking I guess, how one can get a better clustering and if that is possible with k-means at all. Well, I don't know myself, since I haven't implemented it for this data set. But one thing I would certainly try in this situation is to start with centroids which are just the "average" (in the sense defined on the exercise sheet) of all titles from that class. This is of course not unsupervised anymore, but it would be interesting to see whether at least with such a strong hint k-means is able to find a good clustering. If yes, than what you observed would just be an unfavorable local maximum. If you try the choice of centroids I just explained, please report on the results, I am curious. '''Hannah 2Feb10 20:40'''

To which cluster should we assign the the texts which have JaccardDistance=0 for every centroid? In my implementation they just get assigned to the first centroid but the thing is that, from the__ __JaccardDistance, they belong to none of the centroids. This 'manipulates' the clusters. My idea is to just ignore these texts until there is a centroid with JaccardDistance > 0, is this okay? '''Mirko 2Feb10, 20:35'''

I have the problem, that the clusters become too average in the first rounds and no meaningful clustering happens from this point on. In the first round, the distance is 1.0 or 0.99 for almost each title-centroid pair. This leads to very uncharacteristic centroids. '''Johannes 2Feb10 20:30'''

Oh yes, what Marjan wrote is actually very important, and I didn't have it on my mind when I wrote the exercise sheet. K-means won't do something particularly meaningful if in each assignment step you assign each point / record to the *farthest* cluster. And it makes sense that it alternates between two clusterings at some point: assuming that the centroids don't change anymore, then after you assign a point to its *farthest* centroid, then in the next step the centroid to which it has been assigned before will be it's farthest centroid in the next step. Cool stuff. '''Hannah 2Feb10 20:19'''

To Hannah: I tried 10000 for M and the RSS value ist still sometimes increasing and sometimes decreasing. I'll change my program according to Marjan's observation now and report again... EDIT: Ok after changing this everything seems to make sense again, the RSS value is now constantly increasing until it does not change anymore! '''Florian 02Feb10 20:10'''

To all: Just a small but important observation. The "Jaccard distance" from the exercise sheet is in fact a similarity (between two texts), meaning that your RSS should in fact grow. This also means that you should pick the cluster centroids for each text with the maximum value of the Jaccard similarity. '''Marjan 02Feb10 19:46'''

Sure, this is a meaningful way to pick the initial centroids, too. However, there is no guarantee at all, that most SIGIR papers will be assigned to the cluster for which the initial centroid was a SIGIR paper, simply because the title of that paper might have been very untypical for SIGIR. So you might end up with very low precision and recall, although for another assignment of the clusters to the labels / classes, precision and recall might actually be very good. '''Hannah 2Feb10 19:32'''

Hi, is it also allowed to choose the initial centroids as follows: first centroid = random SIGGRAPH text, second centroid = random SIGIR text, third centroid = random STOC text? Then we would also know which centroid belongs to which class. '''Mirko 2Feb 19:27'''

Concerning the RSS values: I haven't written the program myself, but Marjan promised to do so soon. But here is some intuition. First, I would also output the RSS value divided by the number of titles (6630), and the square root of that. Let's call this number J. This gives you the average Jaccard distance of a text to its centroid. What is this J on average? Well, the number of words in your centroids quickly becomes M. In the perfect case, all the words of each text are contained in its nearest centroid. Then the Jaccard distance is text length / M, and so J should be average text length / M. '''Hannah 2Feb10 19:13'''

About computing precision and recall: yes, you need an assignment of the clusters to the three classes SIGGRAPH, SIGIR, STOC. I would do that assignment as follows. Pick the first cluster in your numbering of clusters. Find the most frequent label in the cluster (you have the labels of all titles from dblp.txt). Assign the cluster to that label. Then pick the second cluster, and find the most frequent of the remaining labels, two in this case. And so on, where for three clusters there is no "and so on", because you have only one choice of label left for the third cluster. Once you have a 1-1 assignment from clusters to labels, it should be clear how to compute precision and recall (same definitions as in the case of classification). '''Hannah 2Feb10 18:57'''

I had negative steps (as asked below), until I found, that I had a empty cluster. Fixing that removed the problem. '''Johannes 2763-02-02T1834'''

To exercise 3: Can you give an intuition, in what range the RSS-value should be (approx.)? My RSS-values seems very high to me...Additionally, I have the same question like Matthias few lines ago: How can we calculate recall and precision - without knowing which cluster belongs to which conference? '''Claudius 2Feb10 17:23'''

Thanks for the answer, Matthias, please also try with M = 10000 (if that is feasible time-wise, otherwise take M a little smaller) and report again, thanks. '''Hannah 2Feb10 16:43'''

To Hannah: Yes, for me it does happen for large Ms as well. I tried 10, 100, 1000 with the same quantitative effects. However, for larger M, the RSS value decreases in size. The RSS values are rising&falling in a very small range of about 5%. Do we have an error here or is  the distance function + recalculation step given simply not convergent?  '''Matthias'''

To Matthias + Florian + all:  Which values of M have you tried? Does it still happen when you choose M large enough, say 1000, or even larger? Note that the cluster centroids cannot be meaningful if M is not large enough. '''Hannah 2Deb10 16:21'''

I'm noticing exactly the same behavior like Florian. After a few iterations, the RSS value is alternating esp. when using small M. Also, the RSS values do not fall, they are more or less jumping around the same base value all the time.

Also, in order to calculate Recall + Precision, we have to define which of the clusters is which conference. How should we do that? Just by counting the most frequent conference in a cluster based on the known ground truth information? '''Matthias'''

To Alex: You should compute the k-means clustering on the whole set, that's the point of clustering anyway (no train sets). '''Marjan 02Feb10 14:39'''

Should we compute k-means on the complete set or should we divide the set in a training and in a test set, like the exercise before?'''Alex 02Feb10 14:10'''

Hi Florian + all. That's a very valid question, although counterintuitive from what we expected. To give a precise answer I'll have to write the program myself but since I am in a middle of a deadline I can only promise that I will come back to it asap. My advice is to try to understand why the RSS value increases - either by debugging your own program or by writing a toy example by hand and then reporting here again. '''Marjan 01Feb10 22:41'''

The RSS values of my clustering often get higher after an iteration step, is that normal for this kind of data? Furthermore my RSS value never stops changing, it always switches between two different values after a few iterations (I now just choose three random titles as the initial centroids). I also do not understand how I can compute the precision and recall values of the clustering with respect to the "perfect" clustering. '''Florian 01Feb10 22:41'''

To Florian, Eric + all: yes, sure, the centroids must be of the same type as the things you cluster. And since the things you should cluster are strings / texts (the titles from the dblp.txt from the last lecture), the centroids also have to be strings / texts. There are many ways how you could pick the initial centroids. One way would be to just pick k random titles from the dblp.txt as centroids. '''Hannah 1Feb10 20:55'''

As I understand it the centroids have to be strings as well and the distance of a text to the centroid is just jaccard distance of their distinct words. The new centroids can then just be created by using the m most frequent occuring words in all texts which are assigned to this centroid as statet in exercise 1. The only problem then would be to get the initial centroids... '''Florian 01Feb10 20:50'''

As I understood we may chose random numbers between 0 and 1, since this is what the jaccard distance will lead us to. But my next question is: how should we determine the distance of ONE text to it's centroid? (while initially assigning them to clusters) '''Eric 01Feb10 20:38'''

Hi, how should we choose our initial centroids for the k-means algorithm with the strings? '''Florian 01Feb10 17:56'''