8544
Comment:
|
12722
|
Deletions are marked like this. | Additions are marked like this. |
Line 3: | Line 3: |
Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]]. | Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]], [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]], [[attachment:SearchEnginesWS0910/lecture-7.pdf|Lecture 7]], [[attachment:SearchEnginesWS0910/lecture-8.pdf|Lecture 8]], [[attachment:SearchEnginesWS0910/lecture-9.pdf|Lecture 9]], [[attachment:SearchEnginesWS0910/lecture-10.pdf|Lecture 10]],[[attachment:SearchEnginesWS0910/lecture-11.pdf|Lecture 11]]. |
Line 5: | Line 5: |
Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Lecture 1]] [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Lecture 3]] [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Lecture 4]]. | |
Line 7: | Line 6: |
Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]]. | Here are the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (LPD without audio)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-6.lpd|Recording Lecture 6 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-7.avi|Recording Lecture 7 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-8.avi|Recording Lecture 8 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-9.avi|Recording Lecture 9 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-10.avi|Recording Lecture 10 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-11.avi|Recording Lecture 11 (AVI)]]. |
Line 9: | Line 8: |
Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]] | Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]], [[attachment:SearchEnginesWS0910/exercise-7.pdf|Exercise Sheet 7]], [[attachment:SearchEnginesWS0910/exercise-8.pdf|Exercise Sheet 8]], [[attachment:SearchEnginesWS0910/exercise-9.pdf|Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/exercise-10.pdf|Exercise Sheet 10]], [[attachment:SearchEnginesWS0910/exercise-11.pdf|Exercise Sheet 11]]. |
Line 11: | Line 10: |
= Exercise Sheet 3 = The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. [[http://www.lecturnity.de/de/download/lecturnity-player|You can download the player for free here]]. |
Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]], [[SearchEnginesWS0910/ExerciseSheet6|Solutions and Comments 6]], [[SearchEnginesWS0910/ExerciseSheet7|Solutions and Comments 7]], [[SearchEnginesWS0910/ExerciseSheet8|Solutions and Comments 8]], [[SearchEnginesWS0910/ExerciseSheet9|Solutions and Comments 9]], [[SearchEnginesWS0910/ExerciseSheet10|Solutions and Comments 10]]. Here are our master solutions: [[attachment:SearchEnginesWS0910/solution-midterm.pdf|Master solution for Mid-Term Exam]],[[attachment:SearchEnginesWS0910/solution-9.pdf|Master solution for Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/solution-10.pdf|Master solution for Exercise Sheet 10]]. The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the Lecturnity recordings (.lpd files) you need the [[http://www.lecturnity.de/de/download/lecturnity-player|Lecturnity Player, which you can download here]]. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend [[http://www.videolan.org/vlc|VLC]]. |
Line 16: | Line 18: |
[[SearchEnginesWS0910/ExerciseSheet4|Here you can upload your solutions for Exercise Sheet 4]]. | [[SearchEnginesWS0910/MidTermExam|Here is everything about the mid-term exam]]. |
Line 18: | Line 20: |
== Questions or comments below this line, most recent on top please == | [[attachment:dblp.txt|Here is the file for the Exercise Sheet 11]]. It's a text file, where each line contains the name of the conference (in capital letters), followed by a TAB (ASCII code 9), followed by the title. There are three different conferences: STOC (2423 titles), SIGIR (2372 titles), and SIGGRAPH (1835 titles). The total number of titles / lines is 6630. The exact file size is 454365 bytes. |
Line 20: | Line 22: |
I'd like to suggest that everyone grades the exercise sheet from 1 (for "way to easy") to 10 ("way to hard"). This might provide the professor with the feedback she asks for in the lecture. How about that idea? '''Johannes 2009-11-15T20:40L''' | [[SearchEnginesWS0910/ExerciseSheet11|Here is the table with the links to your uploaded solutions for Exercise Sheet 11]]. The deadline is Thursday 28Jan10 16:00. |
Line 22: | Line 24: |
To Florian + all: yes, sorry, I forgot to mention this in the lecture. Marjan already explained how to clear the disk cache. Let me add to this an explanation what the disk cache actually is. Whenever you read a (part of a) file from disk, the operating system of your computed will use whatever memory is currently unused to store that (part of the) file there. When you read it again and the (part of the) file hasn't changed and the memory used to store it has not been used otherwise in the meantime, than that data is read right from memory, which is much faster than reading it from disk. Usually that effect is desirable, because it speeds up things, but when you do experiments, it is undesirable, because it leads to unrealistically good running times, especially when carrying out an experiment many times in a row. '''Hannah 15Nov09 8:10pm''' | == Questions and comments about Exercise Sheet 11 below this line (most recent on top) == |
Line 24: | Line 26: |
To Florian: Indeed, we were running out of time and there was no room for this in the lecture. I can suggest to you few ways how to clear the disk cache: before carrying out your final experiment, read a large amount of data (let's say close to the amount of RAM you have) from disk - this will ensure that your data (the inverted list) is cleared from the disk cache and replaced by something else (thus an actual reading from disk get's timed, and not reading from RAM). Another way is to restart your computer before doing the timing. '''Marjan 15Nov09 7:27pm''' | Can you give an short hint how high the match rate should be. At my program the detection rate is around 1/3. I think this is a little bit low, but I also found no hint what rate is a good rate for the given set of documents. Edit: I found a bug now the detection rate is around 70%, but the question is still the same, is this a possible result? '''Alex 26Jan10 18:20''' |
Line 26: | Line 28: |
In exercise 4 it says: "Important note: Whenver you measure running times for reading data from disk, you have to clear the disk cache before, as discussed in the lecture". I think that this was not discussed in the lecture? What do we have to do here? '''Florian 15Nov09 7:15pm''' | Hi Claudius + all: to get the points, you only have to compute the w with the highest P(W=w|C=c), even if that is a word like "for" or "of". Would be nice and more interesting though, and not really more work, to compute those w with the k highest P(W=w|C=c). For some not too large k, some interesting words should crop up. You can also choose to ignore stopwords altogether as Marjan suggested. Here is a [[http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words|list of English stopwords]]. '''Hannah 25Jan10 18:25''' |
Line 28: | Line 30: |
@Bit shifting: The syntax for that is actually the same, irrespectively of whether you use Java, C++, perl, python, or whatever. The >> operator shifts to the right, the << operator shifts to the left, the & operator ands the bits of the two operands and the | operator ors the bits of the two operands. Very simple. You will also find zillions of example programs on the web by typing something like ''java bit shifting'' into Google or whatever your favorite search engine is. '''Hannah 15Nov09 1:16''' | Hi Claudius. My recommendation is to ignore stop-words (e.g. the, a, of, is, are etc., for reasons already explained in the lecture) but please wait for a reply from Hannah to be sure. '''Marjan 25Jan10 14:30''' |
Line 30: | Line 32: |
Hi Marius + all: For Exercise 4, an inverted list of size m with doc ids from the range [1..n] is simply a sorted list of m numbers from the range [1..n]. I leave it to you, whether your lists potentially contain duplicates (as in 3, 5, 5, 8, 12, ...) or whether you generate them in a way that they don't contain duplicates (as in 3, 5, 8, 17, ...). It doesn't really matter for the exercise whether your list has duplicated or not. In any case, consider simple flat lists like in the two examples I gave (and like all the examples I gave in this and past lectures), not lists of lists or anything. '''Hannah 15Nov09 1:12am''' | Hi. In Exercise 2, we have to identify the most predictive word for each conference. But, when I take the heighest Pr(W=w|C=c), I get not very predictive words like "for" and "of". Is this sufficient, or should we make an effort, to find words, which are more predictive? '''Claudius 25 Jan 14:26''' |
Line 32: | Line 34: |
@Mirko: Sure, but an inverted list is a list of words where the Doc-IDs are attached to each words in which the words occur. So for Example: If word no. 5 occurs in Doc1, Doc2 and Doc3 and word no. 2 occurs in Doc5, the list would look like: 5 -> Doc1, Doc2, Doc3; 2 -> Doc5. Or am I mistaken? My question then is, how long should these attached lists be in average case? I mean, one could imagine that we got 1mil. documents over 3 words, so these lists could get very large... | Yes, very good question (the second one), I had it on my agenda for the lecture, but somehow forgot to tell you about it. There is a very simple and effective solution to that problem, which you should also use in the exercise. On slide #10, I told you to take Pr(W = w | C = c) = n_wc / sum_w n_wc, where n_wc is the total number of occurrences of word w in class c. Well, just take Pr(W = w | C = c) = (n_wc + 1) / sum_w (n_wc + 1), which can never be zero. Intuitively, this is like saying that every word occurs at least once for each class. Which is also reasonable, because if your amount of data is big enough, that will indeed happen. It's just an artefact of small data that some words don't occur at all for certain classes. Please ask again in case that was not crystal clear. '''Hannah 24Jan10 21:49''' |
Line 34: | Line 36: |
EDIT: Oh ok. Now, I see your point. It's not an index, it's a list. Okay. So, what is an inverted list with Doc-IDs, then? | To Florian + all: Of course you should use the Bayes formula to predict the most probable conference (class). The second question is a good one. I think the natural way is to take that probability as zero. Another way (actually the opposite) is to ignore the words that have not appeared in the original training set i.e. assume that they're not relevant for the prediction. '''Marjan 24Jan10 21:32''' |
Line 36: | Line 38: |
EDIT EDIT: And to your question, Mirko, take a look at [[http://snippets.dzone.com/posts/show/93]]. Especially at Comment no. 2. Maybe this helps... I think, Java supports StreamWriters/Readers that are able to write/read bytes. '''Marius 11/14/2009 08:46pm''' | I have a question to Exercise 2: I do not quite understand how we should predict the conferences for the remaining records. Should we just decide by looking at the most predictive word and decide with that or should we use the Naive Bayes formula of the slides ( argmax_c Pr(C = c) · Π_i=1,...,m Pr(W_i = w_i | C = c) ). And using the Bayes fomula, how should we handle occuring words that did not occur in the training data? Using zero for their probability makes the whole probability for the conference zero as well which is not very reasonable. '''Florian 24Jan10 21:20''' |
Line 38: | Line 40: |
EDIT EDIT EDIT: Sorry, me again. Well, I bothered Wikipedia which redirects from [[http://en.wikipedia.org/wiki/Inverted_list]] to Inverted Index. So it seems to me, this is being used as a synonym. Actually, I think I'm confused enough, now. I'll better wait for any responses... ;-) '''Marius 11/14/2009 9:08 pm''' | I have also uploaded the master solutions for exercise sheet 10 now, see the link above. Note that it's just two pages. Above you also find links to the previous master solutions now (that is, for the mid-term exam and for exercise sheet 9). If you find any mistakes in any of the master solutions, please let us know immediately, thanks. Also, if you have any questions / comments regarding the master solutions, don't hesitate to ask. '''Hannah 24Jan10 16:05''' |
Line 40: | Line 42: |
@ Marius: i think we are supposed to generate one inverted __list__ of size m, with doc ids from 1..n (therefore n>=m, because no duplicates?). | Ok, the file is now there, see the link and short description above. Have fun, and let us know if you are having any problems. '''NOTE:''' I said it in the lectures, but let me repeat it here, just in case, you must, of course, only use ''only the words from the title as features''. The conference name in the first column is only so that you know the ground truth, which you need for the learning in Exercise 1, as well as for the quality assessment in Exercise 4. '''Hannah 24Jan10 15:48''' |
Line 42: | Line 44: |
Now a question from my side: ex.4, programming the compression in __java__, is there any __good__ tutorial about how to handle the bit-stuff? (otherwise, i think, it would cost me too much time..) '''Mirko 14Nov09, 19:18''' | I will do it right now, sorry, it was just procrastination from my side. '''Hannah 24Jan10 15:06''' |
Line 44: | Line 46: |
Hi, do you have any suggestions what the best numbers for m and n in exercise 4 should look like? Or are we supposed to mess around a bit with ints and longs? And: How long should the list of documents in the inverted index be? '''Marius 14Nov09 6:40pm''' | Hi, can you please upload the text-file with the publication records? '''Claudius 24 Jan 12:05''' |
Line 46: | Line 48: |
And just to clarify what a single-cycle permutation is. Here is an example for an array of size 5 with a permutation that is a single cycle: 5 4 1 3 2. Why single cycle? Well, A[1] = 5, A[5] = 2, A[2] = 4, A[4] = 3, A[3] = 1. (My indices in this example are 1,...,5 and not 0,...,4.) Here is an example of a permutation with three cycles: 2 1 4 3 5. The first cycle is A[1] = 2, A[2] =1. The second cycle is A[3] = 4, A[4] = 3. The third cycle is A[5] = 5. '''Hannah 12Nov09 8:04pm''' | Hi Manuela + all: I understand your point. I think that when one is familiar with basic linear algebra, then all the exercises (including Exercise 2, given my fairly strong and concrete hints) are something which you just sit down and do, no deep thinking required. But when one is not familiar, then yes, I can see that most of the time will be spend on understanding the meaning of basic things (which, I agree, is very important) like why can one write something like u * v', where u and v are vectors, and obtain a matrix. I guess I am constantly underestimating the mathematical background and exercise you received in you first semesters here in Freiburg. Anyway, I will take this into account when computing the marks from your points for the exercise sheets 9, 10, 11, etc. Note that also for the first 8 exercise sheets you could get a 1.0 without getting all the points, even after taking the worst sheet out of the counting. We will have something similar for the second half, too. So don't worry, it will be fair, and please continue to make an effort with the exercises, and continue to give me feedback when an exercise consumed way too much time, for whatever reason. '''Hannah 21Jan 17:48''' |
Line 48: | Line 50: |
Hi Daniel + all, I don't quite understand your question and your example (if your array is 1 5 3 4 2, why is A[1] = 3?). In case you refer to the requirement of the exercise that the permutation consists only of a single cycle. That is because your code should go over each element exactly once (it should, of course, stop after n iterations, where n is the size of the array). If your permutation has more than one cycle, it is hard to achieve that. Also note that for both (1) and (2), the sum of the array values should be sum_i=1,...,n i = n * (n+1) / 2. '''Hannah 12Nov09 7:54pm''' | Maybe it's only a problem for me that I can't sit down and start to prove f.e. exercise 2 or 3 immediately. I'm not familiar with linear algebra and it's difficult to understand the meaning of what we do. So before I can start I have to search for information and have to read what matrix norms and Frobenius norms and so on is. That's why it took much time for me to do exercise 2 and 3. Proving the hints (at the bottom of this page) is also nothing what I can do in five minutes. And for exercise 1 it was my own fault that I need much more time for it. I was confused and made some silly stuff. Of course it would be nice to have the bonus points for the exam, but it will be hard (and time consuming) to solve all tasks of all exercise sheets without gaps. Thanks for the hints and I think that the new bonus point system is much better than the old one. The only thing is that I'm not sure, if the "time calculation" is better than before. Maybe I'm just too slow. '''Manuela''' |
Line 50: | Line 52: |
Hi, I just looked at the new exercise sheet 4, in exercise 1 we should generate a permutation and sum the resulting array up, am I wrong or doesn't iterating method two iterate throw the whole array in every situation. for ex.: n= 5 permutation: 1 5 3 4 2, then A[1] = 3, A[A[1]]= A[3] = 1, A[1] = 3 ... '''Daniel 12Nov09 19:44pm''' | To Björn at all: Yes, I see, I think the solution to an exercise like Exercise 1 is much faster to write on paper and then scan it in. Typesetting lots of matrices etc. in Latex is no fun and takes lots of time and shouldn't really be part of an exercise. '''Hannah 21Jan10 14:32''' Yes, your last hint was very helpful. Thanks a lot. Sorry for the late response but I had to work for other courses first and it took me like 3 hours to put the other solutions into Latex (maybe this is also one reason why this sheet takes lots of time again. Especially Ex1 is okay to solve using applets/programs + copy&paste for all intermediate steps, but writing everything down, still takes ages). Now that I looked at exercise 2 again, your hint really helped. '''Björn 21Jan 13:03''' |
Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.
Here are PDFs of the slides of the lectures so far: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6, Lecture 7, Lecture 8, Lecture 9, Lecture 10,Lecture 11.
Here are the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: Recording Lecture 1 (LPD), Recording Lecture 3 (LPD), Recording Lecture 4 (LPD), Recording Lecture 5 (LPD without audio), Recording Lecture 6 (LPD), Recording Lecture 7 (AVI), Recording Lecture 8 (AVI), Recording Lecture 9 (AVI), Recording Lecture 10 (AVI), Recording Lecture 11 (AVI).
Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5, Exercise Sheet 6, Exercise Sheet 7, Exercise Sheet 8, Exercise Sheet 9, Exercise Sheet 10, Exercise Sheet 11.
Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4, Solutions and Comments 5, Solutions and Comments 6, Solutions and Comments 7, Solutions and Comments 8, Solutions and Comments 9, Solutions and Comments 10.
Here are our master solutions: Master solution for Mid-Term Exam,Master solution for Exercise Sheet 9, Master solution for Exercise Sheet 10.
The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the Lecturnity recordings (.lpd files) you need the Lecturnity Player, which you can download here. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend VLC.
Here are the rules for the exercises as explained in Lecture 2.
Here is everything about the mid-term exam.
Here is the file for the Exercise Sheet 11. It's a text file, where each line contains the name of the conference (in capital letters), followed by a TAB (ASCII code 9), followed by the title. There are three different conferences: STOC (2423 titles), SIGIR (2372 titles), and SIGGRAPH (1835 titles). The total number of titles / lines is 6630. The exact file size is 454365 bytes.
Here is the table with the links to your uploaded solutions for Exercise Sheet 11. The deadline is Thursday 28Jan10 16:00.
Questions and comments about Exercise Sheet 11 below this line (most recent on top)
Can you give an short hint how high the match rate should be. At my program the detection rate is around 1/3. I think this is a little bit low, but I also found no hint what rate is a good rate for the given set of documents. Edit: I found a bug now the detection rate is around 70%, but the question is still the same, is this a possible result? Alex 26Jan10 18:20
Hi Claudius + all: to get the points, you only have to compute the w with the highest P(W=w|C=c), even if that is a word like "for" or "of". Would be nice and more interesting though, and not really more work, to compute those w with the k highest P(W=w|C=c). For some not too large k, some interesting words should crop up. You can also choose to ignore stopwords altogether as Marjan suggested. Here is a list of English stopwords. Hannah 25Jan10 18:25
Hi Claudius. My recommendation is to ignore stop-words (e.g. the, a, of, is, are etc., for reasons already explained in the lecture) but please wait for a reply from Hannah to be sure. Marjan 25Jan10 14:30
Hi. In Exercise 2, we have to identify the most predictive word for each conference. But, when I take the heighest Pr(W=w|C=c), I get not very predictive words like "for" and "of". Is this sufficient, or should we make an effort, to find words, which are more predictive? Claudius 25 Jan 14:26
Yes, very good question (the second one), I had it on my agenda for the lecture, but somehow forgot to tell you about it. There is a very simple and effective solution to that problem, which you should also use in the exercise. On slide #10, I told you to take Pr(W = w | C = c) = n_wc / sum_w n_wc, where n_wc is the total number of occurrences of word w in class c. Well, just take Pr(W = w | C = c) = (n_wc + 1) / sum_w (n_wc + 1), which can never be zero. Intuitively, this is like saying that every word occurs at least once for each class. Which is also reasonable, because if your amount of data is big enough, that will indeed happen. It's just an artefact of small data that some words don't occur at all for certain classes. Please ask again in case that was not crystal clear. Hannah 24Jan10 21:49
To Florian + all: Of course you should use the Bayes formula to predict the most probable conference (class). The second question is a good one. I think the natural way is to take that probability as zero. Another way (actually the opposite) is to ignore the words that have not appeared in the original training set i.e. assume that they're not relevant for the prediction. Marjan 24Jan10 21:32
I have a question to Exercise 2: I do not quite understand how we should predict the conferences for the remaining records. Should we just decide by looking at the most predictive word and decide with that or should we use the Naive Bayes formula of the slides ( argmax_c Pr(C = c) · Π_i=1,...,m Pr(W_i = w_i | C = c) ). And using the Bayes fomula, how should we handle occuring words that did not occur in the training data? Using zero for their probability makes the whole probability for the conference zero as well which is not very reasonable. Florian 24Jan10 21:20
I have also uploaded the master solutions for exercise sheet 10 now, see the link above. Note that it's just two pages. Above you also find links to the previous master solutions now (that is, for the mid-term exam and for exercise sheet 9). If you find any mistakes in any of the master solutions, please let us know immediately, thanks. Also, if you have any questions / comments regarding the master solutions, don't hesitate to ask. Hannah 24Jan10 16:05
Ok, the file is now there, see the link and short description above. Have fun, and let us know if you are having any problems. NOTE: I said it in the lectures, but let me repeat it here, just in case, you must, of course, only use only the words from the title as features. The conference name in the first column is only so that you know the ground truth, which you need for the learning in Exercise 1, as well as for the quality assessment in Exercise 4. Hannah 24Jan10 15:48
I will do it right now, sorry, it was just procrastination from my side. Hannah 24Jan10 15:06
Hi, can you please upload the text-file with the publication records? Claudius 24 Jan 12:05
Hi Manuela + all: I understand your point. I think that when one is familiar with basic linear algebra, then all the exercises (including Exercise 2, given my fairly strong and concrete hints) are something which you just sit down and do, no deep thinking required. But when one is not familiar, then yes, I can see that most of the time will be spend on understanding the meaning of basic things (which, I agree, is very important) like why can one write something like u * v', where u and v are vectors, and obtain a matrix. I guess I am constantly underestimating the mathematical background and exercise you received in you first semesters here in Freiburg. Anyway, I will take this into account when computing the marks from your points for the exercise sheets 9, 10, 11, etc. Note that also for the first 8 exercise sheets you could get a 1.0 without getting all the points, even after taking the worst sheet out of the counting. We will have something similar for the second half, too. So don't worry, it will be fair, and please continue to make an effort with the exercises, and continue to give me feedback when an exercise consumed way too much time, for whatever reason. Hannah 21Jan 17:48
Maybe it's only a problem for me that I can't sit down and start to prove f.e. exercise 2 or 3 immediately. I'm not familiar with linear algebra and it's difficult to understand the meaning of what we do. So before I can start I have to search for information and have to read what matrix norms and Frobenius norms and so on is. That's why it took much time for me to do exercise 2 and 3. Proving the hints (at the bottom of this page) is also nothing what I can do in five minutes. And for exercise 1 it was my own fault that I need much more time for it. I was confused and made some silly stuff. Of course it would be nice to have the bonus points for the exam, but it will be hard (and time consuming) to solve all tasks of all exercise sheets without gaps. Thanks for the hints and I think that the new bonus point system is much better than the old one. The only thing is that I'm not sure, if the "time calculation" is better than before. Maybe I'm just too slow. Manuela
To Björn at all: Yes, I see, I think the solution to an exercise like Exercise 1 is much faster to write on paper and then scan it in. Typesetting lots of matrices etc. in Latex is no fun and takes lots of time and shouldn't really be part of an exercise. Hannah 21Jan10 14:32
Yes, your last hint was very helpful. Thanks a lot. Sorry for the late response but I had to work for other courses first and it took me like 3 hours to put the other solutions into Latex (maybe this is also one reason why this sheet takes lots of time again. Especially Ex1 is okay to solve using applets/programs + copy&paste for all intermediate steps, but writing everything down, still takes ages). Now that I looked at exercise 2 again, your hint really helped. Björn 21Jan 13:03