Diff for "SearchEnginesWS0910"

Differences between revisions 333 and 667 (spanning 334 versions)

Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.

Here are PDFs of the slides of the lectures so far: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6, Lecture 7, Lecture 8, Lecture 9, Lecture 10, Lecture 11, Lecture 12.

Here are the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: Recording Lecture 1 (LPD), Recording Lecture 3 (LPD), Recording Lecture 4 (LPD), Recording Lecture 5 (LPD without audio), Recording Lecture 6 (LPD), Recording Lecture 7 (AVI), Recording Lecture 8 (AVI), Recording Lecture 9 (AVI), Recording Lecture 10 (AVI), Recording Lecture 11 (AVI), Recording Lecture 12 (AVI).

Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5, Exercise Sheet 6, Exercise Sheet 7, Exercise Sheet 8, Exercise Sheet 9, Exercise Sheet 10, Exercise Sheet 11, Exercise Sheet 12.

Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4, Solutions and Comments 5, Solutions and Comments 6, Solutions and Comments 7, Solutions and Comments 8, Solutions and Comments 9, Solutions and Comments 10.

Here are our master solutions: Master solution for Mid-Term Exam,Master solution for Exercise Sheet 9, Master solution for Exercise Sheet 10.

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the Lecturnity recordings (.lpd files) you need the Lecturnity Player, which you can download here. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend VLC.

Here are the rules for the exercises as explained in Lecture 2.

Here is everything about the mid-term exam.

Here is the file for the Exercise Sheet 11. It's a text file, where each line contains the name of the conference (in capital letters), followed by a TAB (ASCII code 9), followed by the title. There are three different conferences: STOC (2423 titles), SIGIR (2372 titles), and SIGGRAPH (1835 titles). The total number of titles / lines is 6630. The exact file size is 454365 bytes.

Here is the table with the links to your uploaded solutions for Exercise Sheet 12. The deadline is Thursday 04Feb10 16:00.

Questions and comments about Exercise Sheet 12 below this line (most recent on top)

Thanks for the response. Yes, I meant values of M, of course. I don't know why I wrote k - just a typo. Björn 3Feb10 20:58

BTW, I am really happy that you guys do not just mechanically do the exercises and write "my RSS value is X", but that you actually wonder what the results mean and whether things make sense. That is great research spirit, and that's how one learns the most! Hannah 3Feb10 20:23

Hi Björn + all: I think you made some good observations here. Indeed, the RSS becomes smaller as you make M larger, for exactly the reasons you explained. Which means that is doesn't really make sense to compare RSS values for different values of M. Note that the same would have happened in the vector space model, with each text represented as a normalized M-dimensional vector, where M = the dimension = number of features considered, and the similarity measure being the dot produce = the cosine similarity. The components of M-dimensional normalized vectors also become smaller, as M increases, and so does the cosine similarity (assuming that the number of non-zero components in a vector is more or less a constant). But this does not matter, what is important is that for a *fixed* value of M, the RSS value reflects the quality of the clustering, and that it does. BTW, in what you wrote, you mean M = 6 or 7 and not k = 6 or 7, right? Your observation about a too large M having a negative influence on the clusters is also very reasonable. Hannah 3Feb10 20:19

I just made the following observation: Using what Marjan explained (RSS gets in fact higher, because JD is similarity), I recognized that my RSS does gets quite high for small/medium M (like 10), while it does not for larger M. Looking at the jaccard distance / similarity, I observe that having one really large set (my centroid) that is compared to a small one (title), I necessarily get a huge union while my intersection can never exceed the size of the smaller set. Hence, I end up with low JD values and therefore low RSS. I achieved highest rss results for k=6 or 7 which also seems to be a reasonable guess of the avg title length. On the other hand, I am aware that the decrease in the RSS value affects all distances and the clustering might in fact get better. But how exactly is the RSS still meaningful like this? Thanks for any response and I hope I didn't get anything completely wrong. PS: Too large M (1000+) also have a negative influence on the clustering. I assume that's because adding a word that occurs once in one title does not really make the text a better "middle" Björn 3.2. 20:02

Sorry Florian, it's a messy week for us. Now it's there. Marjan 03Feb10 18:39

The link to upload the solutions is again still missing Florian 03Feb10 18:29

Hi Eric + all: Taking the average is fine. Marjan 3Feb10 18:15

I am not sure about this, but precision and recall are (contrary to RSS) values available for each cluster, that means every cluster has it's own precision and recall. How shall we compute the classification's prec&rec? Taking the average? Or did I miss something? Eric 3Feb10 16:50

Hi Mirko, good question. The approach you suggest looks fine to me. An alternative, very simple, yet also reasonable approach would be to simply assign the text to a random centroid in this case. Hannah 2Feb10 20:42

Hi Johannes, I would't call it a problem, since you have solved the exercise and made an interesting observation. Uncharacteristic clusters is indeed something that can happen. But it's of course an interesting question, and the one you are asking I guess, how one can get a better clustering and if that is possible with k-means at all. Well, I don't know myself, since I haven't implemented it for this data set. But one thing I would certainly try in this situation is to start with centroids which are just the "average" (in the sense defined on the exercise sheet) of all titles from that class. This is of course not unsupervised anymore, but it would be interesting to see whether at least with such a strong hint k-means is able to find a good clustering. If yes, than what you observed would just be an unfavorable local maximum. If you try the choice of centroids I just explained, please report on the results, I am curious. Hannah 2Feb10 20:40

To which cluster should we assign the the texts which have JaccardDistance=0 for every centroid? In my implementation they just get assigned to the first centroid but the thing is that, from the JaccardDistance, they belong to none of the centroids. This 'manipulates' the clusters. My idea is to just ignore these texts until there is a centroid with JaccardDistance > 0, is this okay? Mirko 2Feb10, 20:35

I have the problem, that the clusters become too average in the first rounds and no meaningful clustering happens from this point on. In the first round, the distance is 1.0 or 0.99 for almost each title-centroid pair. This leads to very uncharacteristic centroids. Johannes 2Feb10 20:30

Oh yes, what Marjan wrote is actually very important, and I didn't have it on my mind when I wrote the exercise sheet. K-means won't do something particularly meaningful if in each assignment step you assign each point / record to the *farthest* cluster. And it makes sense that it alternates between two clusterings at some point: assuming that the centroids don't change anymore, then after you assign a point to its *farthest* centroid, then in the next step the centroid to which it has been assigned before will be it's farthest centroid in the next step. Cool stuff. Hannah 2Feb10 20:19

To Hannah: I tried 10000 for M and the RSS value ist still sometimes increasing and sometimes decreasing. I'll change my program according to Marjan's observation now and report again... EDIT: Ok after changing this everything seems to make sense again, the RSS value is now constantly increasing until it does not change anymore! Florian 02Feb10 20:10

To all: Just a small but important observation. The "Jaccard distance" from the exercise sheet is in fact a similarity (between two texts), meaning that your RSS should in fact grow. This also means that you should pick the cluster centroids for each text with the maximum value of the Jaccard similarity. Marjan 02Feb10 19:46

Sure, this is a meaningful way to pick the initial centroids, too. However, there is no guarantee at all, that most SIGIR papers will be assigned to the cluster for which the initial centroid was a SIGIR paper, simply because the title of that paper might have been very untypical for SIGIR. So you might end up with very low precision and recall, although for another assignment of the clusters to the labels / classes, precision and recall might actually be very good. Hannah 2Feb10 19:32

Hi, is it also allowed to choose the initial centroids as follows: first centroid = random SIGGRAPH text, second centroid = random SIGIR text, third centroid = random STOC text? Then we would also know which centroid belongs to which class. Mirko 2Feb 19:27

Concerning the RSS values: I haven't written the program myself, but Marjan promised to do so soon. But here is some intuition. First, I would also output the RSS value divided by the number of titles (6630), and the square root of that. Let's call this number J. This gives you the average Jaccard distance of a text to its centroid. What is this J on average? Well, the number of words in your centroids quickly becomes M. In the perfect case, all the words of each text are contained in its nearest centroid. Then the Jaccard distance is text length / M, and so J should be average text length / M. Hannah 2Feb10 19:13

About computing precision and recall: yes, you need an assignment of the clusters to the three classes SIGGRAPH, SIGIR, STOC. I would do that assignment as follows. Pick the first cluster in your numbering of clusters. Find the most frequent label in the cluster (you have the labels of all titles from dblp.txt). Assign the cluster to that label. Then pick the second cluster, and find the most frequent of the remaining labels, two in this case. And so on, where for three clusters there is no "and so on", because you have only one choice of label left for the third cluster. Once you have a 1-1 assignment from clusters to labels, it should be clear how to compute precision and recall (same definitions as in the case of classification). Hannah 2Feb10 18:57

I had negative steps (as asked below), until I found, that I had a empty cluster. Fixing that removed the problem. Johannes 2763-02-02T1834

To exercise 3: Can you give an intuition, in what range the RSS-value should be (approx.)? My RSS-values seems very high to me...Additionally, I have the same question like Matthias few lines ago: How can we calculate recall and precision - without knowing which cluster belongs to which conference? Claudius 2Feb10 17:23

Thanks for the answer, Matthias, please also try with M = 10000 (if that is feasible time-wise, otherwise take M a little smaller) and report again, thanks. Hannah 2Feb10 16:43

To Hannah: Yes, for me it does happen for large Ms as well. I tried 10, 100, 1000 with the same quantitative effects. However, for larger M, the RSS value decreases in size. The RSS values are rising&falling in a very small range of about 5%. Do we have an error here or is the distance function + recalculation step given simply not convergent? Matthias

To Matthias + Florian + all: Which values of M have you tried? Does it still happen when you choose M large enough, say 1000, or even larger? Note that the cluster centroids cannot be meaningful if M is not large enough. Hannah 2Deb10 16:21

I'm noticing exactly the same behavior like Florian. After a few iterations, the RSS value is alternating esp. when using small M. Also, the RSS values do not fall, they are more or less jumping around the same base value all the time.

Also, in order to calculate Recall + Precision, we have to define which of the clusters is which conference. How should we do that? Just by counting the most frequent conference in a cluster based on the known ground truth information? Matthias

To Alex: You should compute the k-means clustering on the whole set, that's the point of clustering anyway (no train sets). Marjan 02Feb10 14:39

Should we compute k-means on the complete set or should we divide the set in a training and in a test set, like the exercise before?Alex 02Feb10 14:10

Hi Florian + all. That's a very valid question, although counterintuitive from what we expected. To give a precise answer I'll have to write the program myself but since I am in a middle of a deadline I can only promise that I will come back to it asap. My advice is to try to understand why the RSS value increases - either by debugging your own program or by writing a toy example by hand and then reporting here again. Marjan 01Feb10 22:41

The RSS values of my clustering often get higher after an iteration step, is that normal for this kind of data? Furthermore my RSS value never stops changing, it always switches between two different values after a few iterations (I now just choose three random titles as the initial centroids). I also do not understand how I can compute the precision and recall values of the clustering with respect to the "perfect" clustering. Florian 01Feb10 22:41

To Florian, Eric + all: yes, sure, the centroids must be of the same type as the things you cluster. And since the things you should cluster are strings / texts (the titles from the dblp.txt from the last lecture), the centroids also have to be strings / texts. There are many ways how you could pick the initial centroids. One way would be to just pick k random titles from the dblp.txt as centroids. Hannah 1Feb10 20:55

As I understand it the centroids have to be strings as well and the distance of a text to the centroid is just jaccard distance of their distinct words. The new centroids can then just be created by using the m most frequent occuring words in all texts which are assigned to this centroid as statet in exercise 1. The only problem then would be to get the initial centroids... Florian 01Feb10 20:50

As I understood we may chose random numbers between 0 and 1, since this is what the jaccard distance will lead us to. But my next question is: how should we determine the distance of ONE text to it's centroid? (while initially assigning them to clusters) Eric 01Feb10 20:38

Hi, how should we choose our initial centroids for the k-means algorithm with the strings? Florian 01Feb10 17:56

-  ⇤ ← Revision 333 as of 2009-11-24 10:40:42 → 
  Size: 9617
  Editor: remote190-133
  Comment:
+   ← Revision 667 as of 2010-02-03 20:59:11 → ⇥
  Size: 18361
  Editor: HSI-KBW-078-042-029-035
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]].
+Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]], [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]], [[attachment:SearchEnginesWS0910/lecture-7.pdf|Lecture 7]], [[attachment:SearchEnginesWS0910/lecture-8.pdf|Lecture 8]], [[attachment:SearchEnginesWS0910/lecture-9.pdf|Lecture 9]], [[attachment:SearchEnginesWS0910/lecture-10.pdf|Lecture 10]], [[attachment:SearchEnginesWS0910/lecture-11.pdf|Lecture 11]], [[attachment:SearchEnginesWS0910/lecture-12.pdf|Lecture 12]].
 Line 5:
-Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (no audio)]].
+Here are the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (LPD without audio)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-6.lpd|Recording Lecture 6 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-7.avi|Recording Lecture 7 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-8.avi|Recording Lecture 8 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-9.avi|Recording Lecture 9 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-10.avi|Recording Lecture 10 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-11.avi|Recording Lecture 11 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-12.avi|Recording Lecture 12 (AVI)]].
 Line 7:
-Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]].
+Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]], [[attachment:SearchEnginesWS0910/exercise-7.pdf|Exercise Sheet 7]], [[attachment:SearchEnginesWS0910/exercise-8.pdf|Exercise Sheet 8]], [[attachment:SearchEnginesWS0910/exercise-9.pdf|Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/exercise-10.pdf|Exercise Sheet 10]], [[attachment:SearchEnginesWS0910/exercise-11.pdf|Exercise Sheet 11]], [[attachment:SearchEnginesWS0910/exercise-12.pdf|Exercise Sheet 12]].
 Line 9:
-Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]].
+Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]], [[SearchEnginesWS0910/ExerciseSheet6|Solutions and Comments 6]], [[SearchEnginesWS0910/ExerciseSheet7|Solutions and Comments 7]], [[SearchEnginesWS0910/ExerciseSheet8|Solutions and Comments 8]], [[SearchEnginesWS0910/ExerciseSheet9|Solutions and Comments 9]], [[SearchEnginesWS0910/ExerciseSheet10|Solutions and Comments 10]].
 Line 11:
-= Exercise Sheet 5 =
+Here are our master solutions: [[attachment:SearchEnginesWS0910/solution-midterm.pdf|Master solution for Mid-Term Exam]],[[attachment:SearchEnginesWS0910/solution-9.pdf|Master solution for Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/solution-10.pdf|Master solution for Exercise Sheet 10]].
 Line 13:
-The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. [[http://www.lecturnity.de/de/download/lecturnity-player|You can download the player for free here]].
+The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the Lecturnity recordings (.lpd files) you need the [[http://www.lecturnity.de/de/download/lecturnity-player|Lecturnity Player, which you can download here]]. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend [[http://www.videolan.org/vlc|VLC]].
 Line 17:
-[[SearchEnginesWS0910/ExerciseSheet5|Here you can upload your solutions for Exercise Sheet 5]].
+[[SearchEnginesWS0910/MidTermExam|Here is everything about the mid-term exam]].
 Line 19:
-== Questions or comments below this line, most recent on top please ==
+[[attachment:dblp.txt|Here is the file for the Exercise Sheet 11]]. It's a text file, where each line contains the name of the conference (in capital letters), followed by a TAB (ASCII code 9), followed by the title. There are three different conferences: STOC (2423 titles), SIGIR (2372 titles), and SIGGRAPH (1835 titles). The total number of titles / lines is 6630. The exact file size is 454365 bytes.
 Line 21:
-Hi, I would appreciate the Bus-speed of the systems in the list of the exercises, too, because the higher the access to the RAM, the better the values would be. '''Marius 11/24/2009 10:40am'''
+[[SearchEnginesWS0910/ExerciseSheet12|Here is the table with the links to your uploaded solutions for Exercise Sheet 12]]. The deadline is Thursday 04Feb10 16:00.
 Line 23:
-Hi Dragos, yes, just iterate over the doc ids and pick each with probability m/n. This will give you a sorted list of doc ids of average length m. It does not have to be of size exactly m. '''Hannah 24Nov09 0:37am'''
+== Questions and comments about Exercise Sheet 12 below this line (most recent on top) ==
Thanks for the response. Yes, I meant values of M, of course. I don't know why I wrote k - just a typo. '''Björn 3Feb10 20:58'''
-Line 25:
+Line 26:
-About generating the docID lists: I am guessing I'm not wrong when I say that we  don't have to make sure we have exactly m values in each list, we should just generate each element with a m/n probability, right? Thank you ! '''Dragos 24Nov 12.30 AM'''
+BTW, I am really happy that you guys do not just mechanically do the exercises and write "my RSS value is X", but that you actually wonder what the results mean and whether things make sense. That is great research spirit, and that's how one learns the most! '''Hannah 3Feb10 20:23'''
-Line 27:
+Line 28:
-Hi Claudius, yes, the inequality actually holds for "<", too. '''Hannah 23Nov09 4:05pm'''
+Hi Björn + all: I think you made some good observations here. Indeed, the RSS becomes smaller as you make M larger, for exactly the reasons you explained. Which means that is doesn't really make sense to compare RSS values for different values of M. Note that the same would have happened in the vector space model, with each text represented as a normalized M-dimensional vector, where M = the dimension = number of features considered, and the similarity measure being the dot produce = the cosine similarity. The components of M-dimensional normalized vectors also become smaller, as M increases, and so does the cosine similarity (assuming that the number of non-zero components in a vector is more or less a constant). But this does not matter, what is important is that for a *fixed* value of M, the RSS value reflects the quality of the clustering, and that it does. BTW, in what you wrote, you mean M = 6 or 7 and not k = 6 or 7, right? Your observation about a too large M having a negative influence on the clusters is also very reasonable. '''Hannah 3Feb10 20:19'''
-Line 29:
+Line 30:
-In Exercise 5, could it be, that the inequality holds actually for "<", too? Because k may not be 0 and k <= m. Or am I wrong? '''Claudius 23Nov09 3:58pm'''
 Line 31:
-Hi Marius + all: I mean only "real" comparisons, that is, comparisons of list elements. If you use sentinels, then sentinels count as list elements. '''Hannah 22Nov 10:38'''
+I just made the following observation: Using what Marjan explained (RSS gets in fact higher, because JD is similarity), I recognized that my RSS does gets quite high for small/medium M (like 10), while it does not for larger M.  Looking at the jaccard distance / similarity, I observe that having one really large set (my centroid) that is compared to a small one (title), I necessarily get a huge union while my intersection can never exceed the size of the smaller set. Hence, I end up with low JD values and therefore low RSS. I achieved highest rss results for k=6 or 7 which also seems to be a reasonable guess of the avg title length. On the other hand, I am aware that the decrease in the RSS value affects all distances and the clustering might in fact get better. But how exactly is the RSS still meaningful like this? Thanks for any response and I hope I didn't get anything completely wrong. 
PS: Too large M (1000+) also have a negative influence on the clustering. I assume that's because adding a word that occurs once in one title does not really make the text a better "middle" '''Björn 3.2. 20:02'''
-Line 33:
+Line 34:
-Hi, just to tie in with the last question: Do you mean by comparisons only the comparisons of list elements (such as A[i] <= B[j]) or are guarding conditional comparisons also to be counted (such as "if (i < list1.size())"? '''Marius 11/22/2009 10:32pm'''
+Sorry Florian, it's a messy week for us. Now it's there. '''Marjan 03Feb10 18:39'''
-Line 35:
+Line 36:
-Hi Björn + all: good question. One simple way to deal with this would be to use a comma expression like ''(numComparisons++, A[i] <= B[j])''. A list of expressions, separated by commas, gets evaluated from left to right, and the value of the whole expression is simply the value of the last expression. At least that is the case in C++, but most programming languages have the same or a very similar construct. An alternative that essentially does the same thing, would be to do the comparison in a separate function, which, besides doing the comparison, also increase the comparisons counter. If you do that, you should absolutely make that that function is ''inline'' though since otherwise you pay the price of a function call for every comparison, which is likely to spoil your performance. '''Hannah 22Nov09 6:30pm'''
+The link to upload the solutions is again still missing :)  '''Florian 03Feb10 18:29'''
-Line 37:
+Line 38:
-Hi, is there a good way to count comparisons if locial connectives are used? As far as I know, && will only check until the first comparison is false and || if it until some is true. Unfortunately I cannot think of a way to count those comparisons without rewriting the code and nesting the if statements. I'm not too familiar with compilers but I hope that this changes to the code won't effect the performance.'''Björn 22Nov09 6:20pm'''
+Hi Eric + all: Taking the average is fine. '''Marjan 3Feb10 18:15'''
-Line 39:
+Line 40:
-Hi Claudius + all: you are right, it's not very precise, the intended meaning is something like "compare the tables for the two algorithms" or "for each of the four measurements, compare the tables for the two algorithms". '''Hannah 22Nov09 3:39pm'''
+I am not sure about this, but precision and recall are (contrary to RSS) values available for each cluster, that means every cluster has it's own precision and recall. How shall we compute the
classification's prec&rec? Taking the average? Or did I miss something? '''Eric 3Feb10 16:50'''
-Line 41:
+Line 43:
-I am confused: In Exercise 4, you write: "Compare the two tables...", but when I take both algorithms and all 4 measurements (running time, ratio, ...), I get 8 4x4 tables (for all the combinations of 10^3^, 10^4^, 10^5^, 10^6^). What I'm doing wrong? '''Claudius 22Nov09 3:30pm'''
+Hi Mirko, good question. The approach you suggest looks fine to me. An alternative, very simple, yet also reasonable approach would be to simply assign the text to a random centroid in this case. '''Hannah 2Feb10 20:42'''
-Line 43:
+Line 45:
-Hi Björn + all: very good question and thanks for pointing that out. You should indeed always search the elements of the smaller list in the larger list, and the first thing your (advanced) list intersection algorithm should do is figure out which of the two lists is the smaller one. That is, your 4 x 4 table will be ''symmetric'', and actually only contains 10 different values (the 6 below the diagonal, which are the same as the ones above the diagonal, and the 4 on the diagonal). '''Hannah 22Nov09 2:50pm'''
+Hi Johannes, I would't call it a problem, since you have solved the exercise and made an interesting observation. Uncharacteristic clusters is indeed something that can happen. But it's of course an interesting question, and the one you are asking I guess, how one can get a better clustering and if that is possible with k-means at all. Well, I don't know myself, since I haven't implemented it for this data set. But one thing I would certainly try in this situation is to start with centroids which are just the "average" (in the sense defined on the exercise sheet) of all titles from that class. This is of course not unsupervised anymore, but it would be interesting to see whether at least with such a strong hint k-means is able to find a good clustering. If yes, than what you observed would just be an unfavorable local maximum. If you try the choice of centroids I just explained, please report on the results, I am curious. '''Hannah 2Feb10 20:40'''
-Line 45:
+Line 47:
-For the exp/bin-search intersection algorithm it clearly matters that it searches for the elements of the smaller list in the larger one. A good implementation will certainly take care of that. Should our implementation also do that or ignore it in order to get 16 measurements that are really different? '''Björn 22Nov09 1:00pm'''
+To which cluster should we assign the the texts which have JaccardDistance=0 for every centroid? In my implementation they just get assigned to the first centroid but the thing is that, from the__ __JaccardDistance, they belong to none of the centroids. This 'manipulates' the clusters. My idea is to just ignore these texts until there is a centroid with JaccardDistance > 0, is this okay? '''Mirko 2Feb10, 20:35'''
-Line 47:
+Line 49:
-Ok, no problem, I'm happy when it's clear now. '''Hannah 22Nov09 0:24am'''
+I have the problem, that the clusters become too average in the first rounds and no meaningful clustering happens from this point on. In the first round, the distance is 1.0 or 0.99 for almost each title-centroid pair. This leads to very uncharacteristic centroids. '''Johannes 2Feb10 20:30'''
-Line 49:
+Line 51:
-You're right, I misread your comment, sorry. I was thinking of 10MB per lists processed in 1 second, resulting in 20MB/s and was wondering where the 100MB/s are coming from. '''Thomas 22Nov09 00:20am'''
+Oh yes, what Marjan wrote is actually very important, and I didn't have it on my mind when I wrote the exercise sheet. K-means won't do something particularly meaningful if in each assignment step you assign each point / record to the *farthest* cluster. And it makes sense that it alternates between two clusterings at some point: assuming that the centroids don't change anymore, then after you assign a point to its *farthest* centroid, then in the next step the centroid to which it has been assigned before will be it's farthest centroid in the next step. Cool stuff. '''Hannah 2Feb10 20:19'''
-Line 51:
+Line 53:
-Hi Thomas, I am at a loss of words here. I am saying a car is driving 20 kilometers and it needs 10 minutes for that, so its average speed was 120 km / hours. And you are saying how can the speed of a car be 120 km / hours, when it only drives 20 kilometers. Well, what should I say. Besides, in my example I clearly said that the two lists ''together'' occupy 10 MB, not 10 MB per list. Please read again what I wrote. '''Hannah 22Nov09 0:16am'''
+To Hannah: I tried 10000 for M and the RSS value ist still sometimes increasing and sometimes decreasing. I'll change my program according to Marjan's observation now and report again... EDIT: Ok after changing this everything seems to make sense again, the RSS value is now constantly increasing until it does not change anymore! '''Florian 02Feb10 20:10'''
-Line 53:
+Line 55:
-Why should two lists of 10MB size result in 100MB processed, if each list is only iterated over once to do the intersection (O(m+n) complexity)? The data processed after all is just 20MB, no matter how the algorithm is implemented (even if it iterates a thousand times over every list, it still just processed 20MB of data). '''Thomas 21Nov09 12:00am'''
+To all: Just a small but important observation. The "Jaccard distance" from the exercise sheet is in fact a similarity (between two texts), meaning that your RSS should in fact grow. This also means that you should pick the cluster centroids for each text with the maximum value of the Jaccard similarity. '''Marjan 02Feb10 19:46'''
-Line 55:
+Line 57:
-By the way, whenever I talk about "lists" here or on the exercise sheets or in the lecture, I am not referring to a particular data structure (in particular I am NOT talking about a linked list), but "list of elements"  is just "series of elements". And well, "inverted list" is just common terminology. To implement a "list of doc ids" or anything like that you should of course always use an array or a vector or a data structure like that. '''Hannah 21Nov09 8:30pm'''
+Sure, this is a meaningful way to pick the initial centroids, too. However, there is no guarantee at all, that most SIGIR papers will be assigned to the cluster for which the initial centroid was a SIGIR paper, simply because the title of that paper might have been very untypical for SIGIR. So you might end up with very low precision and recall, although for another assignment of the clusters to the labels / classes, precision and recall might actually be very good. '''Hannah 2Feb10 19:32'''
-Line 57:
+Line 59:
-Hi Marius + all, let me explain it by an example. Your two input lists occupy a certain amount of memory. Every programming language has built-in functions for this. For example, if your list entries are ints, then for C++ you can use sizeof(int) to get the number of bytes occupied by one entry.  Multiply by the number of list elements to get the number of bytes occupied by one list. One Megabyte (MB) is 1024 * 1024 bytes. Now assume your two lists together occupy 10 MB. Assume your code takes 0.1 seconds to intersect these two lists. Then the "MB processed per second" is 100 MB / second. '''Hannah 21Nov09 8:26pm'''
+Hi, is it also allowed to choose the initial centroids as follows: first centroid = random SIGGRAPH text, second centroid = random SIGIR text, third centroid = random STOC text? Then we would also know which centroid belongs to which class. '''Mirko 2Feb 19:27'''
-Line 59:
+Line 61:
-Hi, in exercise 3, what do you mean by "MB processed per second"? Is a MB the equivalent to 4096 processed integers? And when is a MB to be considered as processed? When it's written to the intersected list or in the comparisons, already? '''Marius 21Nov09 7:33pm'''
+Concerning the RSS values: I haven't written the program myself, but Marjan promised to do so soon. But here is some intuition. First, I would also output the RSS value divided by the number of titles (6630), and the square root of that. Let's call this number J. This gives you the average Jaccard distance of a text to its centroid. What is this J on average? Well, the number of words in your centroids quickly becomes M. In the perfect case, all the words of each text are contained in its nearest centroid. Then the Jaccard distance is text length / M, and so J should be average text length / M. '''Hannah 2Feb10 19:13'''
-Line 61:
+Line 63:
-The slides + all my hand-writing on it are now online, see the link ''Recording Lecture 5 (no audio)'' above. '''Hannah 20Nov09 3:24am'''
+About computing precision and recall: yes, you need an assignment of the clusters to the three classes SIGGRAPH, SIGIR, STOC. I would do that assignment as follows. Pick the first cluster in your numbering of clusters. Find the most frequent label in the cluster (you have the labels of all titles from dblp.txt). Assign the cluster to that label. Then pick the second cluster, and find the most frequent of the remaining labels, two in this case. And so on, where for three clusters there is no "and so on", because you have only one choice of label left for the third cluster. Once you have a 1-1 assignment from clusters to labels, it should be clear how to compute precision and recall (same definitions as in the case of classification). '''Hannah 2Feb10 18:57'''
-Line 63:
+Line 65:
-The recording of todays lecture again did not work. I am very sorry for that (and very angry that there are so many problems with this software). Anyway, the end result of the lecture, that is the slides with all the writing on it are available and I will put them online as soon as possible. '''Hannah 19Nov09 11:23pm'''
+I had negative steps (as asked below), until I found, that I had a empty cluster. Fixing that removed the problem. '''Johannes 2763-02-02T1834'''
-Line 65:
+Line 67:
-There is a typo in Exercise 5 of the new sheet. The two occurrences of ''n'' should be ''m''. '''Hannah 19Nov09 11:22pm'''
+To exercise 3: Can you give an intuition, in what range the RSS-value should be (approx.)? My RSS-values seems very high to me...Additionally, I have the same question like Matthias few lines ago: How can we calculate recall and precision - without knowing which cluster belongs to which conference? '''Claudius 2Feb10 17:23'''

Thanks for the answer, Matthias, please also try with M = 10000 (if that is feasible time-wise, otherwise take M a little smaller) and report again, thanks. '''Hannah 2Feb10 16:43'''

To Hannah: Yes, for me it does happen for large Ms as well. I tried 10, 100, 1000 with the same quantitative effects. However, for larger M, the RSS value decreases in size. The RSS values are rising&falling in a very small range of about 5%. Do we have an error here or is  the distance function + recalculation step given simply not convergent?  '''Matthias'''

To Matthias + Florian + all:  Which values of M have you tried? Does it still happen when you choose M large enough, say 1000, or even larger? Note that the cluster centroids cannot be meaningful if M is not large enough. '''Hannah 2Deb10 16:21'''

I'm noticing exactly the same behavior like Florian. After a few iterations, the RSS value is alternating esp. when using small M. Also, the RSS values do not fall, they are more or less jumping around the same base value all the time.

Also, in order to calculate Recall + Precision, we have to define which of the clusters is which conference. How should we do that? Just by counting the most frequent conference in a cluster based on the known ground truth information? '''Matthias'''

To Alex: You should compute the k-means clustering on the whole set, that's the point of clustering anyway (no train sets). '''Marjan 02Feb10 14:39'''

Should we compute k-means on the complete set or should we divide the set in a training and in a test set, like the exercise before?'''Alex 02Feb10 14:10'''

Hi Florian + all. That's a very valid question, although counterintuitive from what we expected. To give a precise answer I'll have to write the program myself but since I am in a middle of a deadline I can only promise that I will come back to it asap. My advice is to try to understand why the RSS value increases - either by debugging your own program or by writing a toy example by hand and then reporting here again. '''Marjan 01Feb10 22:41'''

The RSS values of my clustering often get higher after an iteration step, is that normal for this kind of data? Furthermore my RSS value never stops changing, it always switches between two different values after a few iterations (I now just choose three random titles as the initial centroids). I also do not understand how I can compute the precision and recall values of the clustering with respect to the "perfect" clustering. '''Florian 01Feb10 22:41'''

To Florian, Eric + all: yes, sure, the centroids must be of the same type as the things you cluster. And since the things you should cluster are strings / texts (the titles from the dblp.txt from the last lecture), the centroids also have to be strings / texts. There are many ways how you could pick the initial centroids. One way would be to just pick k random titles from the dblp.txt as centroids. '''Hannah 1Feb10 20:55'''

As I understand it the centroids have to be strings as well and the distance of a text to the centroid is just jaccard distance of their distinct words. The new centroids can then just be created by using the m most frequent occuring words in all texts which are assigned to this centroid as statet in exercise 1. The only problem then would be to get the initial centroids... '''Florian 01Feb10 20:50'''

As I understood we may chose random numbers between 0 and 1, since this is what the jaccard distance will lead us to. But my next question is: how should we determine the distance of ONE text to it's centroid? (while initially assigning them to clusters) '''Eric 01Feb10 20:38'''

Hi, how should we choose our initial centroids for the k-means algorithm with the strings? '''Florian 01Feb10 17:56'''