Diff for "SearchEnginesWS0910"

Differences between revisions 365 and 668 (spanning 303 versions)

Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.

Here are PDFs of the slides of the lectures so far: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6, Lecture 7, Lecture 8, Lecture 9, Lecture 10, Lecture 11, Lecture 12.

Here are the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: Recording Lecture 1 (LPD), Recording Lecture 3 (LPD), Recording Lecture 4 (LPD), Recording Lecture 5 (LPD without audio), Recording Lecture 6 (LPD), Recording Lecture 7 (AVI), Recording Lecture 8 (AVI), Recording Lecture 9 (AVI), Recording Lecture 10 (AVI), Recording Lecture 11 (AVI), Recording Lecture 12 (AVI).

Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5, Exercise Sheet 6, Exercise Sheet 7, Exercise Sheet 8, Exercise Sheet 9, Exercise Sheet 10, Exercise Sheet 11, Exercise Sheet 12.

Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4, Solutions and Comments 5, Solutions and Comments 6, Solutions and Comments 7, Solutions and Comments 8, Solutions and Comments 9, Solutions and Comments 10.

Here are our master solutions: Master solution for Mid-Term Exam,Master solution for Exercise Sheet 9, Master solution for Exercise Sheet 10.

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the Lecturnity recordings (.lpd files) you need the Lecturnity Player, which you can download here. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend VLC.

Here are the rules for the exercises as explained in Lecture 2.

Here is everything about the mid-term exam.

Here is the file for the Exercise Sheet 11. It's a text file, where each line contains the name of the conference (in capital letters), followed by a TAB (ASCII code 9), followed by the title. There are three different conferences: STOC (2423 titles), SIGIR (2372 titles), and SIGGRAPH (1835 titles). The total number of titles / lines is 6630. The exact file size is 454365 bytes.

Here is the table with the links to your uploaded solutions for Exercise Sheet 12. The deadline is Thursday 04Feb10 16:00.

Questions and comments about Exercise Sheet 12 below this line (most recent on top)

Hmm regarding exercise 4. How does 2-means "pick a centroid" somewhere. Doesn't that always depend on the current situation / last centroid? I'm a bit confused. Are we supposed to find a movement (bringing points from those subsets x1, x2 and y closer together) where points will be clustered differently according to the old centroids? Or different clustering according to centroids that yield optimal RSS values (aren't they hard to find for non-trivial cases?)? Sorry for the confusion, but I don't seem to be able to figure it out by myself at the moment. Björn 3Feb10 21:56

Thanks for the response. Yes, I meant values of M, of course. I don't know why I wrote k - just a typo. Björn 3Feb10 20:58

BTW, I am really happy that you guys do not just mechanically do the exercises and write "my RSS value is X", but that you actually wonder what the results mean and whether things make sense. That is great research spirit, and that's how one learns the most! Hannah 3Feb10 20:23

Hi Björn + all: I think you made some good observations here. Indeed, the RSS becomes smaller as you make M larger, for exactly the reasons you explained. Which means that is doesn't really make sense to compare RSS values for different values of M. Note that the same would have happened in the vector space model, with each text represented as a normalized M-dimensional vector, where M = the dimension = number of features considered, and the similarity measure being the dot produce = the cosine similarity. The components of M-dimensional normalized vectors also become smaller, as M increases, and so does the cosine similarity (assuming that the number of non-zero components in a vector is more or less a constant). But this does not matter, what is important is that for a *fixed* value of M, the RSS value reflects the quality of the clustering, and that it does. BTW, in what you wrote, you mean M = 6 or 7 and not k = 6 or 7, right? Your observation about a too large M having a negative influence on the clusters is also very reasonable. Hannah 3Feb10 20:19

I just made the following observation: Using what Marjan explained (RSS gets in fact higher, because JD is similarity), I recognized that my RSS does gets quite high for small/medium M (like 10), while it does not for larger M. Looking at the jaccard distance / similarity, I observe that having one really large set (my centroid) that is compared to a small one (title), I necessarily get a huge union while my intersection can never exceed the size of the smaller set. Hence, I end up with low JD values and therefore low RSS. I achieved highest rss results for k=6 or 7 which also seems to be a reasonable guess of the avg title length. On the other hand, I am aware that the decrease in the RSS value affects all distances and the clustering might in fact get better. But how exactly is the RSS still meaningful like this? Thanks for any response and I hope I didn't get anything completely wrong. PS: Too large M (1000+) also have a negative influence on the clustering. I assume that's because adding a word that occurs once in one title does not really make the text a better "middle" Björn 3.2. 20:02

Sorry Florian, it's a messy week for us. Now it's there. Marjan 03Feb10 18:39

The link to upload the solutions is again still missing Florian 03Feb10 18:29

Hi Eric + all: Taking the average is fine. Marjan 3Feb10 18:15

I am not sure about this, but precision and recall are (contrary to RSS) values available for each cluster, that means every cluster has it's own precision and recall. How shall we compute the classification's prec&rec? Taking the average? Or did I miss something? Eric 3Feb10 16:50

Hi Mirko, good question. The approach you suggest looks fine to me. An alternative, very simple, yet also reasonable approach would be to simply assign the text to a random centroid in this case. Hannah 2Feb10 20:42

Hi Johannes, I would't call it a problem, since you have solved the exercise and made an interesting observation. Uncharacteristic clusters is indeed something that can happen. But it's of course an interesting question, and the one you are asking I guess, how one can get a better clustering and if that is possible with k-means at all. Well, I don't know myself, since I haven't implemented it for this data set. But one thing I would certainly try in this situation is to start with centroids which are just the "average" (in the sense defined on the exercise sheet) of all titles from that class. This is of course not unsupervised anymore, but it would be interesting to see whether at least with such a strong hint k-means is able to find a good clustering. If yes, than what you observed would just be an unfavorable local maximum. If you try the choice of centroids I just explained, please report on the results, I am curious. Hannah 2Feb10 20:40

To which cluster should we assign the the texts which have JaccardDistance=0 for every centroid? In my implementation they just get assigned to the first centroid but the thing is that, from the JaccardDistance, they belong to none of the centroids. This 'manipulates' the clusters. My idea is to just ignore these texts until there is a centroid with JaccardDistance > 0, is this okay? Mirko 2Feb10, 20:35

I have the problem, that the clusters become too average in the first rounds and no meaningful clustering happens from this point on. In the first round, the distance is 1.0 or 0.99 for almost each title-centroid pair. This leads to very uncharacteristic centroids. Johannes 2Feb10 20:30

Oh yes, what Marjan wrote is actually very important, and I didn't have it on my mind when I wrote the exercise sheet. K-means won't do something particularly meaningful if in each assignment step you assign each point / record to the *farthest* cluster. And it makes sense that it alternates between two clusterings at some point: assuming that the centroids don't change anymore, then after you assign a point to its *farthest* centroid, then in the next step the centroid to which it has been assigned before will be it's farthest centroid in the next step. Cool stuff. Hannah 2Feb10 20:19

To Hannah: I tried 10000 for M and the RSS value ist still sometimes increasing and sometimes decreasing. I'll change my program according to Marjan's observation now and report again... EDIT: Ok after changing this everything seems to make sense again, the RSS value is now constantly increasing until it does not change anymore! Florian 02Feb10 20:10

To all: Just a small but important observation. The "Jaccard distance" from the exercise sheet is in fact a similarity (between two texts), meaning that your RSS should in fact grow. This also means that you should pick the cluster centroids for each text with the maximum value of the Jaccard similarity. Marjan 02Feb10 19:46

Sure, this is a meaningful way to pick the initial centroids, too. However, there is no guarantee at all, that most SIGIR papers will be assigned to the cluster for which the initial centroid was a SIGIR paper, simply because the title of that paper might have been very untypical for SIGIR. So you might end up with very low precision and recall, although for another assignment of the clusters to the labels / classes, precision and recall might actually be very good. Hannah 2Feb10 19:32

Hi, is it also allowed to choose the initial centroids as follows: first centroid = random SIGGRAPH text, second centroid = random SIGIR text, third centroid = random STOC text? Then we would also know which centroid belongs to which class. Mirko 2Feb 19:27

Concerning the RSS values: I haven't written the program myself, but Marjan promised to do so soon. But here is some intuition. First, I would also output the RSS value divided by the number of titles (6630), and the square root of that. Let's call this number J. This gives you the average Jaccard distance of a text to its centroid. What is this J on average? Well, the number of words in your centroids quickly becomes M. In the perfect case, all the words of each text are contained in its nearest centroid. Then the Jaccard distance is text length / M, and so J should be average text length / M. Hannah 2Feb10 19:13

About computing precision and recall: yes, you need an assignment of the clusters to the three classes SIGGRAPH, SIGIR, STOC. I would do that assignment as follows. Pick the first cluster in your numbering of clusters. Find the most frequent label in the cluster (you have the labels of all titles from dblp.txt). Assign the cluster to that label. Then pick the second cluster, and find the most frequent of the remaining labels, two in this case. And so on, where for three clusters there is no "and so on", because you have only one choice of label left for the third cluster. Once you have a 1-1 assignment from clusters to labels, it should be clear how to compute precision and recall (same definitions as in the case of classification). Hannah 2Feb10 18:57

I had negative steps (as asked below), until I found, that I had a empty cluster. Fixing that removed the problem. Johannes 2763-02-02T1834

To exercise 3: Can you give an intuition, in what range the RSS-value should be (approx.)? My RSS-values seems very high to me...Additionally, I have the same question like Matthias few lines ago: How can we calculate recall and precision - without knowing which cluster belongs to which conference? Claudius 2Feb10 17:23

Thanks for the answer, Matthias, please also try with M = 10000 (if that is feasible time-wise, otherwise take M a little smaller) and report again, thanks. Hannah 2Feb10 16:43

To Hannah: Yes, for me it does happen for large Ms as well. I tried 10, 100, 1000 with the same quantitative effects. However, for larger M, the RSS value decreases in size. The RSS values are rising&falling in a very small range of about 5%. Do we have an error here or is the distance function + recalculation step given simply not convergent? Matthias

To Matthias + Florian + all: Which values of M have you tried? Does it still happen when you choose M large enough, say 1000, or even larger? Note that the cluster centroids cannot be meaningful if M is not large enough. Hannah 2Deb10 16:21

I'm noticing exactly the same behavior like Florian. After a few iterations, the RSS value is alternating esp. when using small M. Also, the RSS values do not fall, they are more or less jumping around the same base value all the time.

Also, in order to calculate Recall + Precision, we have to define which of the clusters is which conference. How should we do that? Just by counting the most frequent conference in a cluster based on the known ground truth information? Matthias

To Alex: You should compute the k-means clustering on the whole set, that's the point of clustering anyway (no train sets). Marjan 02Feb10 14:39

Should we compute k-means on the complete set or should we divide the set in a training and in a test set, like the exercise before?Alex 02Feb10 14:10

Hi Florian + all. That's a very valid question, although counterintuitive from what we expected. To give a precise answer I'll have to write the program myself but since I am in a middle of a deadline I can only promise that I will come back to it asap. My advice is to try to understand why the RSS value increases - either by debugging your own program or by writing a toy example by hand and then reporting here again. Marjan 01Feb10 22:41

The RSS values of my clustering often get higher after an iteration step, is that normal for this kind of data? Furthermore my RSS value never stops changing, it always switches between two different values after a few iterations (I now just choose three random titles as the initial centroids). I also do not understand how I can compute the precision and recall values of the clustering with respect to the "perfect" clustering. Florian 01Feb10 22:41

To Florian, Eric + all: yes, sure, the centroids must be of the same type as the things you cluster. And since the things you should cluster are strings / texts (the titles from the dblp.txt from the last lecture), the centroids also have to be strings / texts. There are many ways how you could pick the initial centroids. One way would be to just pick k random titles from the dblp.txt as centroids. Hannah 1Feb10 20:55

As I understand it the centroids have to be strings as well and the distance of a text to the centroid is just jaccard distance of their distinct words. The new centroids can then just be created by using the m most frequent occuring words in all texts which are assigned to this centroid as statet in exercise 1. The only problem then would be to get the initial centroids... Florian 01Feb10 20:50

As I understood we may chose random numbers between 0 and 1, since this is what the jaccard distance will lead us to. But my next question is: how should we determine the distance of ONE text to it's centroid? (while initially assigning them to clusters) Eric 01Feb10 20:38

Hi, how should we choose our initial centroids for the k-means algorithm with the strings? Florian 01Feb10 17:56

-  ⇤ ← Revision 365 as of 2009-12-01 10:15:50 → 
  Size: 8036
  Editor: Hannah Bast
  Comment:
+   ← Revision 668 as of 2010-02-03 21:57:04 → ⇥
  Size: 18952
  Editor: HSI-KBW-078-042-029-035
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]], [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]].
+Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]], [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]], [[attachment:SearchEnginesWS0910/lecture-7.pdf|Lecture 7]], [[attachment:SearchEnginesWS0910/lecture-8.pdf|Lecture 8]], [[attachment:SearchEnginesWS0910/lecture-9.pdf|Lecture 9]], [[attachment:SearchEnginesWS0910/lecture-10.pdf|Lecture 10]], [[attachment:SearchEnginesWS0910/lecture-11.pdf|Lecture 11]], [[attachment:SearchEnginesWS0910/lecture-12.pdf|Lecture 12]].
 Line 5:
-Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (no audio)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-6.lpd|Recording Lecture 6 (with audio for a change)]].
+Here are the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (LPD without audio)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-6.lpd|Recording Lecture 6 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-7.avi|Recording Lecture 7 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-8.avi|Recording Lecture 8 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-9.avi|Recording Lecture 9 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-10.avi|Recording Lecture 10 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-11.avi|Recording Lecture 11 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-12.avi|Recording Lecture 12 (AVI)]].
 Line 7:
-Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]].
+Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]], [[attachment:SearchEnginesWS0910/exercise-7.pdf|Exercise Sheet 7]], [[attachment:SearchEnginesWS0910/exercise-8.pdf|Exercise Sheet 8]], [[attachment:SearchEnginesWS0910/exercise-9.pdf|Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/exercise-10.pdf|Exercise Sheet 10]], [[attachment:SearchEnginesWS0910/exercise-11.pdf|Exercise Sheet 11]], [[attachment:SearchEnginesWS0910/exercise-12.pdf|Exercise Sheet 12]].
 Line 9:
-Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]].
+Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]], [[SearchEnginesWS0910/ExerciseSheet6|Solutions and Comments 6]], [[SearchEnginesWS0910/ExerciseSheet7|Solutions and Comments 7]], [[SearchEnginesWS0910/ExerciseSheet8|Solutions and Comments 8]], [[SearchEnginesWS0910/ExerciseSheet9|Solutions and Comments 9]], [[SearchEnginesWS0910/ExerciseSheet10|Solutions and Comments 10]].
 Line 11:
-= Exercise Sheet 6 =
The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. [[http://www.lecturnity.de/de/download/lecturnity-player|You can download the player for free here]].
+Here are our master solutions: [[attachment:SearchEnginesWS0910/solution-midterm.pdf|Master solution for Mid-Term Exam]],[[attachment:SearchEnginesWS0910/solution-9.pdf|Master solution for Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/solution-10.pdf|Master solution for Exercise Sheet 10]].

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the Lecturnity recordings (.lpd files) you need the [[http://www.lecturnity.de/de/download/lecturnity-player|Lecturnity Player, which you can download here]]. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend [[http://www.videolan.org/vlc|VLC]].
-Line 16:
+Line 17:
-[[SearchEnginesWS0910/ExerciseSheet6|Here you can upload your solutions for Exercise Sheet 6]].
+[[SearchEnginesWS0910/MidTermExam|Here is everything about the mid-term exam]].
-Line 18:
+Line 19:
-== Questions or comments below this line, most recent on top please ==
+[[attachment:dblp.txt|Here is the file for the Exercise Sheet 11]]. It's a text file, where each line contains the name of the conference (in capital letters), followed by a TAB (ASCII code 9), followed by the title. There are three different conferences: STOC (2423 titles), SIGIR (2372 titles), and SIGGRAPH (1835 titles). The total number of titles / lines is 6630. The exact file size is 454365 bytes.
-Line 20:
+Line 21:
-Yes, you are right Marjan, I will just go through the ratios and invert those < 1. But one thing is for sure: a ratio of plain zero DOES NOT MAKE SENSE. So those with a plain zero, please either put the inverse, or use scientific notation. '''Hannah 1Dec09 10:17am'''
+[[SearchEnginesWS0910/ExerciseSheet12|Here is the table with the links to your uploaded solutions for Exercise Sheet 12]]. The deadline is Thursday 04Feb10 16:00.
-Line 22:
+Line 23:
-I computed the ratios as it was shown in the example here on the wiki, that was dividing the values in the same order as they are mentioned (first through second). For me that meant values > 1 for exercise 1 and values < 1 for exercise 2, but I wanted to compute them in the same way. '''Florian 1Dec09 01:22am'''
+== Questions and comments about Exercise Sheet 12 below this line (most recent on top) ==
-Line 24:
+Line 25:
-But if everybody computes the ratios as they want, the numbers in the table won't make sense! It also does not make sense if somebody computes the ration for Problem 1 in one way and then for Problem 2 to in another (I already don't know how the ratios are computed for the few uploaded solutions!) '''Marjan 30Nov09 00:25'''
+Hmm regarding exercise 4. How does 2-means "pick a centroid" somewhere. Doesn't that always depend on the current situation / last centroid? I'm a bit confused. Are we supposed to find a movement (bringing points from those subsets x1, x2 and y closer together) where points will be clustered differently according to the old centroids? Or different clustering according to centroids that yield optimal RSS values (aren't they hard to find for non-trivial cases?)? Sorry for the confusion, but I don't seem to be able to figure it out by myself at the moment. '''Björn 3Feb10 21:56'''
-Line 26:
+Line 27:
-Hi Björn + all, it doesn't really matter, but I (and probably most humans) find ratios > 1 more intuitive. Just compare 8 and 0.125, which one is easier to grasp? '''Hannah 30Nov09 11:59pm'''
+Thanks for the response. Yes, I meant values of M, of course. I don't know why I wrote k - just a typo. '''Björn 3Feb10 20:58'''
-Line 28:
+Line 29:
-Does it matter which way round we express the ratios? Depending on how we build the quotient, we get different values (all smaller or all greater 1). Or is that up to us? Should be possible to compare our results anyway, I assume. '''Björn 30Nov 23:36'''
+BTW, I am really happy that you guys do not just mechanically do the exercises and write "my RSS value is X", but that you actually wonder what the results mean and whether things make sense. That is great research spirit, and that's how one learns the most! '''Hannah 3Feb10 20:23'''
-Line 30:
+Line 31:
-To Björn: You can assume you have gaps of arbitrary size. '''Marjan 30Nov 14:43'''
+Hi Björn + all: I think you made some good observations here. Indeed, the RSS becomes smaller as you make M larger, for exactly the reasons you explained. Which means that is doesn't really make sense to compare RSS values for different values of M. Note that the same would have happened in the vector space model, with each text represented as a normalized M-dimensional vector, where M = the dimension = number of features considered, and the similarity measure being the dot produce = the cosine similarity. The components of M-dimensional normalized vectors also become smaller, as M increases, and so does the cosine similarity (assuming that the number of non-zero components in a vector is more or less a constant). But this does not matter, what is important is that for a *fixed* value of M, the RSS value reflects the quality of the clustering, and that it does. BTW, in what you wrote, you mean M = 6 or 7 and not k = 6 or 7, right? Your observation about a too large M having a negative influence on the clusters is also very reasonable. '''Hannah 3Feb10 20:19'''
-Line 32:
+Line 33:
-To Claudius: The whole collection with all words. '''Marjan 30Nov 14:43'''
 Line 34:
-Is there a limit on how large gaps may be in exercise 3? I'm not sure for which case the two entropies actually fulfill the equation. Gaps that "make sense" (ther sum is not larger than n-1), gaps that are at most n, or arbitrary gaps? '''Björn 30Nov09 14:31'''
+I just made the following observation: Using what Marjan explained (RSS gets in fact higher, because JD is similarity), I recognized that my RSS does gets quite high for small/medium M (like 10), while it does not for larger M.  Looking at the jaccard distance / similarity, I observe that having one really large set (my centroid) that is compared to a small one (title), I necessarily get a huge union while my intersection can never exceed the size of the smaller set. Hence, I end up with low JD values and therefore low RSS. I achieved highest rss results for k=6 or 7 which also seems to be a reasonable guess of the avg title length. On the other hand, I am aware that the decrease in the RSS value affects all distances and the clustering might in fact get better. But how exactly is the RSS still meaningful like this? Thanks for any response and I hope I didn't get anything completely wrong. 
PS: Too large M (1000+) also have a negative influence on the clustering. I assume that's because adding a word that occurs once in one title does not really make the text a better "middle" '''Björn 3.2. 20:02'''
-Line 36:
+Line 37:
-In Exercise 2, you ask for the costs of scanning the inv. lists of all words in the "collection". Do you mean the collection of words, matching the prefix or the the whole collection with all words in the inv. index? '''Claudius 30Nov09 2:16pm'''
+Sorry Florian, it's a messy week for us. Now it's there. '''Marjan 03Feb10 18:39'''
-Line 38:
+Line 39:
-Hi Dragos, just three-letter prefixes are fine. I have no plans yet for future exercises with a "*" in the middle. '''Hannah 29Nov09 11:10pm'''
+The link to upload the solutions is again still missing :)  '''Florian 03Feb10 18:29'''
-Line 40:
+Line 41:
-For exercise 1, should we allow the "*" to be in any place ? Or just three letter prefix is sufficient ? I am asking because it would be good to know if we might need on later Exercise Sheets searches that allow multiple "*" in different positions, so that we do it now. '''Dragos 29 Nov 22:55'''
+Hi Eric + all: Taking the average is fine. '''Marjan 3Feb10 18:15'''
-Line 42:
+Line 43:
-Hi Björn, by ratio I simply mean the quotient, that is, how much bigger the one is then the other. For example, if, for a particular prefix, the total size from (1) is one million, and the size from (2) is ten thousand, then, for that prefix, the ratio between the two is one hundred. '''Hannah 29Nov09 7:48pm'''
+I am not sure about this, but precision and recall are (contrary to RSS) values available for each cluster, that means every cluster has it's own precision and recall. How shall we compute the
classification's prec&rec? Taking the average? Or did I miss something? '''Eric 3Feb10 16:50'''
-Line 44:
+Line 46:
-Hello, I wonder what's meant with the ratio demanded in exercise 1. If i have n lists with a maximum length of "a" and a total length of "b". Isn't the ratio something like "a:b"? At least that is what I thought. But adding a colon does not seem to be sufficient for a part of the exercise. Sorry for the meaningless question but I don't want to miss points because I'm not sure how to understand the word ratio. '''Björn 11-29 19:39'''
+Hi Mirko, good question. The approach you suggest looks fine to me. An alternative, very simple, yet also reasonable approach would be to simply assign the text to a random centroid in this case. '''Hannah 2Feb10 20:42'''
-Line 46:
+Line 48:
-To all: about the selection of the ten prefixes. The idea was that you pick a meaningful variety by hand, that is, such prefixes which one could imagine that one would really type them. The exact selection doesn't really matter, but do avoid extreme cases like a prefix ''yzq'' with one completion and an inverted list of three doc ids. '''Hannah 29Nov09 6:28pm'''
+Hi Johannes, I would't call it a problem, since you have solved the exercise and made an interesting observation. Uncharacteristic clusters is indeed something that can happen. But it's of course an interesting question, and the one you are asking I guess, how one can get a better clustering and if that is possible with k-means at all. Well, I don't know myself, since I haven't implemented it for this data set. But one thing I would certainly try in this situation is to start with centroids which are just the "average" (in the sense defined on the exercise sheet) of all titles from that class. This is of course not unsupervised anymore, but it would be interesting to see whether at least with such a strong hint k-means is able to find a good clustering. If yes, than what you observed would just be an unfavorable local maximum. If you try the choice of centroids I just explained, please report on the results, I am curious. '''Hannah 2Feb10 20:40'''
-Line 48:
+Line 50:
-To Marius + all: yes, I am sorry, "cost" was very imprecise here, I actually simply meant the time your code takes. '''Hannah 29Nov09 6:19pm'''
+To which cluster should we assign the the texts which have JaccardDistance=0 for every centroid? In my implementation they just get assigned to the first centroid but the thing is that, from the__ __JaccardDistance, they belong to none of the centroids. This 'manipulates' the clusters. My idea is to just ignore these texts until there is a centroid with JaccardDistance > 0, is this okay? '''Mirko 2Feb10, 20:35'''
-Line 50:
+Line 52:
-So you say that you mean by "costs" the running time? Or do you understand something else when you say we have to calculate the costs? '''Marius 11/29/09 4:58pm'''
+I have the problem, that the clusters become too average in the first rounds and no meaningful clustering happens from this point on. In the first round, the distance is 1.0 or 0.99 for almost each title-centroid pair. This leads to very uncharacteristic centroids. '''Johannes 2Feb10 20:30'''
-Line 52:
+Line 54:
-Notice about Problem 2: You should use precise timers when measuring the running times. If your collection is very small and you round up your times, it's easy to get 0 ms when merging or scanning the inverted lists. I recommend using microsecond scale. '''Marjan 29Nov09 16:47'''
+Oh yes, what Marjan wrote is actually very important, and I didn't have it on my mind when I wrote the exercise sheet. K-means won't do something particularly meaningful if in each assignment step you assign each point / record to the *farthest* cluster. And it makes sense that it alternates between two clusterings at some point: assuming that the centroids don't change anymore, then after you assign a point to its *farthest* centroid, then in the next step the centroid to which it has been assigned before will be it's farthest centroid in the next step. Cool stuff. '''Hannah 2Feb10 20:19'''
-Line 54:
+Line 56:
-To Florian: Yes, you can do anything you want to find those words (as long as you produce the required outputs). '''Marjan 29Nov09 16:44'''
+To Hannah: I tried 10000 for M and the RSS value ist still sometimes increasing and sometimes decreasing. I'll change my program according to Marjan's observation now and report again... EDIT: Ok after changing this everything seems to make sense again, the RSS value is now constantly increasing until it does not change anymore! '''Florian 02Feb10 20:10'''
-Line 56:
+Line 58:
-For exercise 1, should we use one of the methods presented in the lecture to find all words in the collection with the prefixes or can we do just anything to get them (though it might not be as efficient)? '''Florian 29Nov09 03:51pm'''
+To all: Just a small but important observation. The "Jaccard distance" from the exercise sheet is in fact a similarity (between two texts), meaning that your RSS should in fact grow. This also means that you should pick the cluster centroids for each text with the maximum value of the Jaccard similarity. '''Marjan 02Feb10 19:46'''
-Line 58:
+Line 60:
-When you scan, please make sure that you do something very simple with the elements, like summing up all doc ids, and then outputting that sum. Otherwise a clever compiler might figure out that it can remove the whole loop, because it is not producing a result that is used anywhere. '''Hannah 28Nov09 11:48pm'''
+Sure, this is a meaningful way to pick the initial centroids, too. However, there is no guarantee at all, that most SIGIR papers will be assigned to the cluster for which the initial centroid was a SIGIR paper, simply because the title of that paper might have been very untypical for SIGIR. So you might end up with very low precision and recall, although for another assignment of the clusters to the labels / classes, precision and recall might actually be very good. '''Hannah 2Feb10 19:32'''
-Line 60:
+Line 62:
-To Mirko: Yes, scanning means one pass over the elements. '''Marjan 28Nov09 19:19'''
+Hi, is it also allowed to choose the initial centroids as follows: first centroid = random SIGGRAPH text, second centroid = random SIGIR text, third centroid = random STOC text? Then we would also know which centroid belongs to which class. '''Mirko 2Feb 19:27'''
-Line 62:
+Line 64:
-Hi, about exercise2: is by "scanning" meant that one looks at every element exactly once? (=> costs of scanning a list are just the size of the list) '''Mirko 28Nov, 19:12'''
+Concerning the RSS values: I haven't written the program myself, but Marjan promised to do so soon. But here is some intuition. First, I would also output the RSS value divided by the number of titles (6630), and the square root of that. Let's call this number J. This gives you the average Jaccard distance of a text to its centroid. What is this J on average? Well, the number of words in your centroids quickly becomes M. In the perfect case, all the words of each text are contained in its nearest centroid. Then the Jaccard distance is text length / M, and so J should be average text length / M. '''Hannah 2Feb10 19:13'''

About computing precision and recall: yes, you need an assignment of the clusters to the three classes SIGGRAPH, SIGIR, STOC. I would do that assignment as follows. Pick the first cluster in your numbering of clusters. Find the most frequent label in the cluster (you have the labels of all titles from dblp.txt). Assign the cluster to that label. Then pick the second cluster, and find the most frequent of the remaining labels, two in this case. And so on, where for three clusters there is no "and so on", because you have only one choice of label left for the third cluster. Once you have a 1-1 assignment from clusters to labels, it should be clear how to compute precision and recall (same definitions as in the case of classification). '''Hannah 2Feb10 18:57'''

I had negative steps (as asked below), until I found, that I had a empty cluster. Fixing that removed the problem. '''Johannes 2763-02-02T1834'''

To exercise 3: Can you give an intuition, in what range the RSS-value should be (approx.)? My RSS-values seems very high to me...Additionally, I have the same question like Matthias few lines ago: How can we calculate recall and precision - without knowing which cluster belongs to which conference? '''Claudius 2Feb10 17:23'''

Thanks for the answer, Matthias, please also try with M = 10000 (if that is feasible time-wise, otherwise take M a little smaller) and report again, thanks. '''Hannah 2Feb10 16:43'''

To Hannah: Yes, for me it does happen for large Ms as well. I tried 10, 100, 1000 with the same quantitative effects. However, for larger M, the RSS value decreases in size. The RSS values are rising&falling in a very small range of about 5%. Do we have an error here or is  the distance function + recalculation step given simply not convergent?  '''Matthias'''

To Matthias + Florian + all:  Which values of M have you tried? Does it still happen when you choose M large enough, say 1000, or even larger? Note that the cluster centroids cannot be meaningful if M is not large enough. '''Hannah 2Deb10 16:21'''

I'm noticing exactly the same behavior like Florian. After a few iterations, the RSS value is alternating esp. when using small M. Also, the RSS values do not fall, they are more or less jumping around the same base value all the time.

Also, in order to calculate Recall + Precision, we have to define which of the clusters is which conference. How should we do that? Just by counting the most frequent conference in a cluster based on the known ground truth information? '''Matthias'''

To Alex: You should compute the k-means clustering on the whole set, that's the point of clustering anyway (no train sets). '''Marjan 02Feb10 14:39'''

Should we compute k-means on the complete set or should we divide the set in a training and in a test set, like the exercise before?'''Alex 02Feb10 14:10'''

Hi Florian + all. That's a very valid question, although counterintuitive from what we expected. To give a precise answer I'll have to write the program myself but since I am in a middle of a deadline I can only promise that I will come back to it asap. My advice is to try to understand why the RSS value increases - either by debugging your own program or by writing a toy example by hand and then reporting here again. '''Marjan 01Feb10 22:41'''

The RSS values of my clustering often get higher after an iteration step, is that normal for this kind of data? Furthermore my RSS value never stops changing, it always switches between two different values after a few iterations (I now just choose three random titles as the initial centroids). I also do not understand how I can compute the precision and recall values of the clustering with respect to the "perfect" clustering. '''Florian 01Feb10 22:41'''

To Florian, Eric + all: yes, sure, the centroids must be of the same type as the things you cluster. And since the things you should cluster are strings / texts (the titles from the dblp.txt from the last lecture), the centroids also have to be strings / texts. There are many ways how you could pick the initial centroids. One way would be to just pick k random titles from the dblp.txt as centroids. '''Hannah 1Feb10 20:55'''

As I understand it the centroids have to be strings as well and the distance of a text to the centroid is just jaccard distance of their distinct words. The new centroids can then just be created by using the m most frequent occuring words in all texts which are assigned to this centroid as statet in exercise 1. The only problem then would be to get the initial centroids... '''Florian 01Feb10 20:50'''

As I understood we may chose random numbers between 0 and 1, since this is what the jaccard distance will lead us to. But my next question is: how should we determine the distance of ONE text to it's centroid? (while initially assigning them to clusters) '''Eric 01Feb10 20:38'''

Hi, how should we choose our initial centroids for the k-means algorithm with the strings? '''Florian 01Feb10 17:56'''