Exercise Sheet 11

The rules for uploading are the same as always. If you forgot them, you can read them again here.

Here is the file for the Exercise Sheet 11. It's a text file, where each line contains the name of the conference (in capital letters), followed by a TAB (ASCII code 9), followed by the title. There are three different conferences: STOC (2423 titles), SIGIR (2372 titles), and SIGGRAPH (1835 titles). The total number of titles / lines is 6630. The exact file size is 454365 bytes.

Your solutions (files can only be read by the uploader and by us)

Name	Solution (PDF)	Code (ZIP or TGZ)
Achille Nana	PDF	ZIP
Alexander Gutjahr	PDF	ZIP
Alexander Nutz	PDF	ZIP
Alexander Schneider	PDF	ZIP
Björn Buchhold	PDF	ZIP
Claudius Korzen	PDF	ZIP
Daniel Schauenberg	PDF	tar.gz
Dragos Sorescu	PDF	ZIP
Eric Lacher	PDF	ZIP
Florian Bäurle	PDF	ZIP
Jens Silva Santisteban	PDF	ZIP
Johannes A. Stork	PDF	tar.gz
Jonas Krisch	PDF	ZIP
Manuela Ortlieb	PDF	ZIP
Marius Greitschus	PDF	TGZ
Markus Gruetzner	PDF	ZIP
Matthias Frorath	PDF	ZIP
MatthiasSauer	PDF	tar.gz
Mirko Brodesser	PDF	ZIP
Paresh Paradkar	PDF	ZIP
Waleed Butt	PDF	ZIP
Zhongjie Cai	PDF	ZIP
Richard Zahoransky	PDF	ZIP

These were your questions and comments about Exercise Sheet 11

Hi Johannes, nice way of telling me, and yes, you are of course right, it should be p = h / lambda and q = t / lambda and then p + q = 1 => lambda = h + t. And not p = lambda * h and q = lambda * t and lambda = 1 / (h + t). But believe me, it's very easy to make such stupid mistakes when doing calculations at the (virtual) blackboard. That is why I always ask you guys to pay attention when I am doing calculations, and to correct me if I am doing something wrong. Anyway, also your late feedback is of course appreciated, and I will see how I can correct that thingy on the slides. Hannah 27Jan10 17:34

I have cognitive dissonances from slide 12 from the last lecture. The implications give me an uncomfortable feeling. Johannes 2763-01-27T1704

Hi Matthias, well, you are learning the Pr(W = w|C = c) from the 10% training data (every tenth record), so that's where they should come from. Hannah 27Jan10 16:07

Hi, could you please explain again what data should be used for calculating the highest Pr(W = w|C = c) in excercies 2? The whole data, the remaining 90%, the learning 10% ? Matthias 27Jan10 11:52

A comment for all who haven't submitted their solutions yet (= most). To compute the argmax_c Pr(C = c) * Prod_w Pr(W = w|C = c), better compute the argmax_c of the logarithms, that is, argmax_c log(Pr(C = c)) + sum_w log Pr(W = W|C = c). The result will be the same, because log is a mononotone function. However, computing the sum of the logs is numerical more stable, while computing the product of many small probabilities can lead to numerical problems which can distort results. I don't think it's a big issue for the relatively small data set I gave you, but I would still do it. Anyway, it's not more work computing the sums of logs of things than the product of the things. Hannah 27Jan10 5:19am

Hi Alex + all: I didn't have a program so far, but have just written one, and my overall precision is 83.61%. So classification seems to work pretty well on this dataset. I didn't do anything fancy, used the +1 smoothing to avoid zero probabilities, and used every word. Note that the titles also contain commas, parantheses and stuff. I am saying this because I have seen that some people have words like "(extended" or "abstract)" or "title.". So please do pay attention to that and do not tokenize merely by whitepace. Also, whenever you write a program, test it!!! That is, have a small procedure that outputs the learned probabilities (or better, the counts), and then check them for a small example. I did that as well for my program, otherwise I would never be convinced that it does the correct thing. Hannah 27Jan10 1:10am

Can you give an short hint how high the match rate should be. At my program the detection rate is around 1/3. I think this is a little bit low, but I also found no hint what rate is a good rate for the given set of documents. Edit: I found a bug now the detection rate is around 70%, but the question is still the same, is this a possible result? Alex 26Jan10 18:20

Hi Claudius + all: to get the points, you only have to compute the w with the highest P(W=w|C=c), even if that is a word like "for" or "of". Would be nice and more interesting though, and not really more work, to compute those w with the k highest P(W=w|C=c). For some not too large k, some interesting words should crop up. You can also choose to ignore stopwords altogether as Marjan suggested. Here is a list of English stopwords. Hannah 25Jan10 18:25

Hi Claudius. My recommendation is to ignore stop-words (e.g. the, a, of, is, are etc., for reasons already explained in the lecture) but please wait for a reply from Hannah to be sure. Marjan 25Jan10 14:30

Hi. In Exercise 2, we have to identify the most predictive word for each conference. But, when I take the heighest Pr(W=w|C=c), I get not very predictive words like "for" and "of". Is this sufficient, or should we make an effort, to find words, which are more predictive? Claudius 25 Jan 14:26

Yes, very good question (the second one), I had it on my agenda for the lecture, but somehow forgot to tell you about it. There is a very simple and effective solution to that problem, which you should also use in the exercise. On slide #10, I told you to take Pr(W = w | C = c) = n_wc / sum_w n_wc, where n_wc is the total number of occurrences of word w in class c. Well, just take Pr(W = w | C = c) = (n_wc + 1) / sum_w (n_wc + 1), which can never be zero. Intuitively, this is like saying that every word occurs at least once for each class. Which is also reasonable, because if your amount of data is big enough, that will indeed happen. It's just an artefact of small data that some words don't occur at all for certain classes. Please ask again in case that was not crystal clear. Hannah 24Jan10 21:49

To Florian + all: Of course you should use the Bayes formula to predict the most probable conference (class). The second question is a good one. I think the natural way is to take that probability as zero. Another way (actually the opposite) is to ignore the words that have not appeared in the original training set i.e. assume that they're not relevant for the prediction. Marjan 24Jan10 21:32

I have a question to Exercise 2: I do not quite understand how we should predict the conferences for the remaining records. Should we just decide by looking at the most predictive word and decide with that or should we use the Naive Bayes formula of the slides ( argmax_c Pr(C = c) · Π_i=1,...,m Pr(W_i = w_i | C = c) ). And using the Bayes fomula, how should we handle occuring words that did not occur in the training data? Using zero for their probability makes the whole probability for the conference zero as well which is not very reasonable. Florian 24Jan10 21:20

I have also uploaded the master solutions for exercise sheet 10 now, see the link above. Note that it's just two pages. Above you also find links to the previous master solutions now (that is, for the mid-term exam and for exercise sheet 9). If you find any mistakes in any of the master solutions, please let us know immediately, thanks. Also, if you have any questions / comments regarding the master solutions, don't hesitate to ask. Hannah 24Jan10 16:05

Ok, the file is now there, see the link and short description above. Have fun, and let us know if you are having any problems. NOTE: I said it in the lectures, but let me repeat it here, just in case, you must, of course, only use only the words from the title as features. The conference name in the first column is only so that you know the ground truth, which you need for the learning in Exercise 1, as well as for the quality assessment in Exercise 4. Hannah 24Jan10 15:48

I will do it right now, sorry, it was just procrastination from my side. Hannah 24Jan10 15:06

Hi, can you please upload the text-file with the publication records? Claudius 24 Jan 12:05

Hi Manuela + all: I understand your point. I think that when one is familiar with basic linear algebra, then all the exercises (including Exercise 2, given my fairly strong and concrete hints) are something which you just sit down and do, no deep thinking required. But when one is not familiar, then yes, I can see that most of the time will be spend on understanding the meaning of basic things (which, I agree, is very important) like why can one write something like u * v', where u and v are vectors, and obtain a matrix. I guess I am constantly underestimating the mathematical background and exercise you received in you first semesters here in Freiburg. Anyway, I will take this into account when computing the marks from your points for the exercise sheets 9, 10, 11, etc. Note that also for the first 8 exercise sheets you could get a 1.0 without getting all the points, even after taking the worst sheet out of the counting. We will have something similar for the second half, too. So don't worry, it will be fair, and please continue to make an effort with the exercises, and continue to give me feedback when an exercise consumed way too much time, for whatever reason. Hannah 21Jan 17:48

Maybe it's only a problem for me that I can't sit down and start to prove f.e. exercise 2 or 3 immediately. I'm not familiar with linear algebra and it's difficult to understand the meaning of what we do. So before I can start I have to search for information and have to read what matrix norms and Frobenius norms and so on is. That's why it took much time for me to do exercise 2 and 3. Proving the hints (at the bottom of this page) is also nothing what I can do in five minutes. And for exercise 1 it was my own fault that I need much more time for it. I was confused and made some silly stuff. Of course it would be nice to have the bonus points for the exam, but it will be hard (and time consuming) to solve all tasks of all exercise sheets without gaps. Thanks for the hints and I think that the new bonus point system is much better than the old one. The only thing is that I'm not sure, if the "time calculation" is better than before. Maybe I'm just too slow. Manuela

To Björn at all: Yes, I see, I think the solution to an exercise like Exercise 1 is much faster to write on paper and then scan it in. Typesetting lots of matrices etc. in Latex is no fun and takes lots of time and shouldn't really be part of an exercise. Hannah 21Jan10 14:32

Yes, your last hint was very helpful. Thanks a lot. Sorry for the late response but I had to work for other courses first and it took me like 3 hours to put the other solutions into Latex (maybe this is also one reason why this sheet takes lots of time again. Especially Ex1 is okay to solve using applets/programs + copy&paste for all intermediate steps, but writing everything down, still takes ages). Now that I looked at exercise 2 again, your hint really helped. Björn 21Jan 13:03

AD Teaching Wiki: SearchEnginesWS0910/ExerciseSheet11 (last edited 2010-02-04 18:50:50 by Hannah Bast)