Diff for "SearchEnginesWS0910"

Differences between revisions 288 and 750 (spanning 462 versions)

Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.

Here are PDFs of the slides of the lectures: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6, Lecture 7, Lecture 8, Lecture 9, Lecture 10, Lecture 11, Lecture 12, Lecture 13, Lecture 14, Projects.

Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5, Exercise Sheet 6, Exercise Sheet 7, Exercise Sheet 8, Exercise Sheet 9, Exercise Sheet 10, Exercise Sheet 11, Exercise Sheet 12, Exercise Sheet 13, Exercise Sheet 14.

Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4, Solutions and Comments 5, Solutions and Comments 6, Solutions and Comments 7, Solutions and Comments 8, Solutions and Comments 9, Solutions and Comments 10, Solutions and Comments 11, Solutions and Comments 12, Solutions and Comments 13.

Here are our master solutions: Master solution for Mid-Term Exam,Master solution for Exercise Sheet 9, Master solution for Exercise Sheet 10, Master solution for Exercise Sheet 11, Master solution for Exercise Sheet 12.

Here are the rules for the exercises as explained in Lecture 2.

Here is everything about the mid-term exam. The final exam is on Friday March 12, 2010. The written exam begins at 2.00 pm in HS 026. The oral exams are scheduled on the same day.

Here is the table with the links to your uploaded solutions for Exercise Sheet 14. The deadline is Thursday 18Feb10 16:00.

Questions and comments about Exercise Sheet 14 below this line (most recent on top)

Hi Johannes: why don't you start with the first few questions, and then let's see whether it makes sense to continue this via the Wiki, or via private email, or via a meeting in person. Hannah 6Mar10 17:36

Yes, the final exam is like the mid-term exam in this respect. Hannah 6Mar10 17:36

Alex: http://vulcano.informatik.uni-freiburg.de/wiki/teaching/SearchEnginesWS0910/MidTermExam, so it seems to be allowed. Mirko, 6Mar10 16:10

Hi, I was wondering, will the exam next week also be an open book exam like the mid-term? Perhaps I overlooked it, but I don't think this is stated anywhere yet. Alex 6Mar10 13:49

I have lots of questions and don't know where to put them. I suppose this wiki-page will get chaotic pretty fast if I post 20 questions. Johannes VI Mar MMX 12:00

I'm sorry for the delay with the master solutions. I am at a conference right now but will try to make progress with this over the weekend. Hannah 4Mar10 23:59

Do we get master solutions for ex. 11, 12, 13 and 14? Johannes 04Mar2010 23:32 ZULU

Now they're there again. Marjan 01Mar18:09

ARGH! I'm very sorry. My Down-Them-All Plugin for Firefox seems to have deleted all the lecture PDFs! Sorry for that. Rollback to previous versions does not seem to work. I hope, someone has already downloaded them all and is able to restore them! SORRY! Interesting, I've got the rights to delete something from the main page, though. Marius Mar 1st 2010 2:38 p.m.

(Reminder:) Hello, the master solutions are not online, yet. alex n 1Mar10 11:08

Yes, we are working on it. Please remind us again if they aren't online by the end of this week. Hannah 23Feb10 14:30

Do we get master solutions for ex. 11, 12, 13 and 14? Johannes 23Feb10 14:05

Hi Matthias, yes, Pr(A) = 1 - Pr(not A), for any event A, and so for any random variable X, Pr(X <= x) = 1 - Pr(X > x), because X <= x and X > x are complementary events. For continuous random variables (like variables with a normal distribution), the difference between <= and < and >= and > is immaterial, because Pr(X = x) for each fixed x. But anyway, to compute the probability, you first have to transform it a bit, like I did in the lecture, and then obtain Pr(N(0,1) >= sqrt(n1) * (µ1 - µ) / σ) and Pr(N(0,1) <= sqrt(n2) * (µ - µ2) / σ). To evaluate the latter you can also simply use the symmetry of the normal distribution, due to which one has Pr(N(0,1) <= -x) = Pr(N(0,1) >= x). Hannah 18Feb10 12:58

Hi, how can we compute Pr(N(n2 * µ2, n2 * σ²) <= n2 * µ2 ? Can we use 1- (Pr(N(n2 * µ2, n2 * σ²) >= n2 * µ2) for that ? Matthias 18Feb10 12:01

Hi Florian + all, one of µ1 and µ2 is larger than µ and one is smaller. Let's assume µ1 is larger and µ2 is smaller. Then for µ1 you have to look at Pr(N(n1 * µ, n1 * σ²) >= n1 * µ1). But for µ2 you have to look at Pr(N(n2 * µ2, n2 * σ²) <= n2 * µ2). Note the <= instead of the >= for the second probability. Recall the meaning of these probabilities. Just as an example, let µ be 100 and µ1 be 150 and µ2 be 50. Then the first probability means: what is the probability that I see a mean of 150 or more in my first sample, although the mean of my distribution is 100. The second probability means: what is the probability that I see a mean of 50 or less in my second sample, although the mean of my distribution is 100. If you take both <= or both >= for both probabilities, it is to be expected that you get two completely different probabilities, one very low and one very high (except when they are both close to 50%). Please ask again if this is still unclear. Hannah 17Feb10 21:51

Sorry, with probability for µ1 I meant Pr(N(n1 * µ, n1 * σ²) >= n1 * µ1) and accordingly with probability for µ2 I meant Pr(N(n2 * µ, n2 * σ²) >= n2 * µ2) where n1=n2 for the exercise sheet. Florian 17Feb10 21:18

Hi Florian, what exactly do you mean by probability for µ1 and probability for µ2? Hannah 17Feb10 21:02

Hi, what values are we expected to get for exercise 4? I always get a probability of about 99.9% for μ1 and a value of about 0.07% for μ2, can that be? Florian 17Feb10 18:25

Hi Florian, yes, the averages in Exercise 3 should be average running times. I uploaded a new version of the sheet, where I corrected this. Hannah 14Feb10 17:48

Hi, I guess we should measure the running times to determine the efficiency of the programs for exercise 3? Florian 15Feb10 17:42

Hi Claudius, you should compute Pr(D|H0), exactly as done in the lecture for Example 2, where we computed this probability as Pr(X > x), where X is a random variable with distribution N(0,1), that is, normal with mean 0 and variance 1, and x depends on the mean and variance of your data. Hannah 14Feb10 16:44

Hi. If I have understood correctly, we have to compute Pr(H|D) in Exercise 4. From statistical hypothesis testing, we get Pr(D|H). Now, Pr(H|D) = Pr(D|H) * (Pr(H) / Pr(D)). We know Pr(D|H) and we can compute Pr(D), but what value do we have to use for Pr(H)? Claudius 14Feb10 14:41

Hi Eric, I don't care whether you use integers or doubles, but I am curious why the one should be any harder than the other? Hannah 12Feb10 19:02

May we use integers for sorting? Or do we have to use doubles? This is important for generating my sorted array Eric 12Feb10 18:56

If you're asking about the merging you can of course use a priority queue if you want, but you don't really need it when merging 2 lists. Marjan 18:28

Why would you use a priority queue? It's simple sorting, the exercise is not about implementing your own sorting algorithm or something like that. About exercise 3, it should be clear from the exercise itself that the sequences should be sorted (otherwise how can the merging work?) Marjan 18:23

Means that we have nothing to do than use a priority queue or something like that and don't have to implement the sorting? And at Exercise 3 the random set should be an ordered one or not? Alex 12Feb10 18:19

We prefer randomized sorting using bitonic networks, alternatively combined with LSD radix sort or simple pancake sort. That's of course a joke, it should be clear that you can use the built-in sorting functions (your own implementation will be certainly slower). Marjan 12Feb10 18:12

What does "do a standard sort" in exercise 2 mean? Shall I implement one on my own, or may I use the Java built-in sorting mechanisms? Also, which sorting algorithm do you prefer for this? Eric 12Feb10 18:04

-  ⇤ ← Revision 288 as of 2009-11-16 20:39:56 → 
  Size: 11033
  Editor: aragon
  Comment:
+   ← Revision 750 as of 2010-03-07 00:40:54 → ⇥
  Size: 12749
  Editor: Hannah Bast
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]].
+Here are PDFs of the slides of the lectures: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]], [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]], [[attachment:SearchEnginesWS0910/lecture-7.pdf|Lecture 7]], [[attachment:SearchEnginesWS0910/lecture-8.pdf|Lecture 8]], [[attachment:SearchEnginesWS0910/lecture-9.pdf|Lecture 9]], [[attachment:SearchEnginesWS0910/lecture-10.pdf|Lecture 10]], [[attachment:SearchEnginesWS0910/lecture-11.pdf|Lecture 11]], [[attachment:SearchEnginesWS0910/lecture-12.pdf|Lecture 12]], [[attachment:SearchEnginesWS0910/lecture-13.pdf|Lecture 13]], [[attachment:SearchEnginesWS0910/lecture-14.pdf|Lecture 14]], [[attachment:SearchEnginesWS0910/lecture-projects.pdf|Projects]].
 Line 5:
-Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Lecture 1]] [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Lecture 3]] [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Lecture 4]].
+Here are the recordings of the lectures (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (LPD without audio)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-6.lpd|Recording Lecture 6 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-7.avi|Recording Lecture 7 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-8.avi|Recording Lecture 8 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-9.avi|Recording Lecture 9 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-10.avi|Recording Lecture 10 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-11.avi|Recording Lecture 11 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-12.avi|Recording Lecture 12 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-13.avi|Recording Lecture 13 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-14.avi|Recording Lecture 14 (AVI)]]. To play the Lecturnity recordings (.lpd files) you need the [[http://www.lecturnity.de/de/download/lecturnity-player|Lecturnity Player, which you can download here]]. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend [[http://www.videolan.org/vlc|VLC]].
 Line 7:
-Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]].
+Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]], [[attachment:SearchEnginesWS0910/exercise-7.pdf|Exercise Sheet 7]], [[attachment:SearchEnginesWS0910/exercise-8.pdf|Exercise Sheet 8]], [[attachment:SearchEnginesWS0910/exercise-9.pdf|Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/exercise-10.pdf|Exercise Sheet 10]], [[attachment:SearchEnginesWS0910/exercise-11.pdf|Exercise Sheet 11]], [[attachment:SearchEnginesWS0910/exercise-12.pdf|Exercise Sheet 12]], [[attachment:SearchEnginesWS0910/exercise-13.pdf|Exercise Sheet 13]], [[attachment:SearchEnginesWS0910/exercise-14.pdf|Exercise Sheet 14]].
 Line 9:
-Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]]
+Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]], [[SearchEnginesWS0910/ExerciseSheet6|Solutions and Comments 6]], [[SearchEnginesWS0910/ExerciseSheet7|Solutions and Comments 7]], [[SearchEnginesWS0910/ExerciseSheet8|Solutions and Comments 8]], [[SearchEnginesWS0910/ExerciseSheet9|Solutions and Comments 9]], [[SearchEnginesWS0910/ExerciseSheet10|Solutions and Comments 10]], [[SearchEnginesWS0910/ExerciseSheet11|Solutions and Comments 11]], [[SearchEnginesWS0910/ExerciseSheet12|Solutions and Comments 12]], [[SearchEnginesWS0910/ExerciseSheet13|Solutions and Comments 13]].
 Line 11:
-= Exercise Sheet 3 =

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. [[http://www.lecturnity.de/de/download/lecturnity-player|You can download the player for free here]].
+Here are our master solutions: [[attachment:SearchEnginesWS0910/solution-midterm.pdf|Master solution for Mid-Term Exam]],[[attachment:SearchEnginesWS0910/solution-9.pdf|Master solution for Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/solution-10.pdf|Master solution for Exercise Sheet 10]], [[attachment:SearchEnginesWS0910/solution-11.pdf|Master solution for Exercise Sheet 11]], [[attachment:SearchEnginesWS0910/solution-12.pdf|Master solution for Exercise Sheet 12]].
-Line 17:
+Line 15:
-[[SearchEnginesWS0910/ExerciseSheet4|Here you can upload your solutions for Exercise Sheet 4]].
+[[SearchEnginesWS0910/MidTermExam|Here is everything about the mid-term exam]]. The final exam is on Friday March 12, 2010. The written exam begins at 2.00 pm in HS 026. The oral exams are scheduled on the same day.
-Line 19:
+Line 17:
-== Questions or comments below this line, most recent on top please ==
+[[SearchEnginesWS0910/ExerciseSheet14|Here is the table with the links to your uploaded solutions for Exercise Sheet 14]]. The deadline is Thursday 18Feb10 16:00.
-Line 21:
+Line 19:
-To Florian: This was very well explained in the lecture (it is still there on the slides). It means the speed-up achieved by reading and decompressing your compressed list compared to when the list is read in uncompressed format. As your inverted list is randomly generated, you might have different speed-ups for different inverted lists. In order to have any speed-up, of course, your compression scheme should really work and reduce the size of the inverted list + should not be too inefficient. One extreme example for inefficient code would be using strings instead of bit-shifting for your coding. '''Marjan 8:38pm'''
+== Questions and comments about Exercise Sheet 14 below this line (most recent on top) ==
-Line 23:
+Line 21:
-What is meant with the best speedup for Exercise 4 which we should add on the solution page? The best speedup for just reading the data from disk or the best speedup for reading and decompressing the list? '''Florian 16Nov09 8:12pm'''
+Hi Johannes: why don't you start with the first few questions, and then let's see whether it makes sense to continue this via the Wiki, or via private email, or via a meeting in person. '''Hannah 6Mar10 17:36'''
-Line 25:
+Line 23:
-To Dragos: Gap encoding + Elias code is not trivial at all and you can use it. Gap encoding + Byte code is also fine. '''Marjan 16Nov09 2:15pm'''
+Yes, the final exam is like the mid-term exam in this respect. '''Hannah 6Mar10 17:36'''
-Line 27:
+Line 25:
-Is it ok to use Elias code for the compressing from Exercise 4(2)? Or it's too 'trivial'?:)'''Dragos 16Nov09 14.14'''
+Alex: http://vulcano.informatik.uni-freiburg.de/wiki/teaching/SearchEnginesWS0910/MidTermExam, so it seems to be allowed. '''Mirko, 6Mar10 16:10'''
-Line 29:
+Line 27:
-To Florian + all: A single number doesn't really make sense, does it? For the discussion part of the exercise, think in terms of probability distributions, as we did in the lecture (when discussing which probability distribution a certain encoding is optimal for). For the example, give a sequence of numbers. '''Hannah 15Nov09 9:41pm'''
+Hi, I was wondering, will the exam next week also be an open book exam like the mid-term? Perhaps I overlooked it, but I don't think this is stated anywhere yet. '''Alex 6Mar10 13:49'''
-Line 31:
+Line 29:
-To Johannes + all: Yes, good idea. I will anyway at some point in the next weeks hand out a sheet where you have the opportunity to give feedback on the lectures and the exercises. But yes, why not give me that feedback on the current exercise sheet already now. Let me refine your proposal a bit. It would be useful for me if you would provide ''two'' grades: one for the hardness (pick one of: too hard for me, challenging but feasible, not very hard) and one for the amount of work (pick one of: too much for me, a lot but feasible, not more than for other lectures). It would also be helpful if you would not just give a grade but put your opinion into words. It's no problem if you are critical but please stay polite. I will take your comments seriously, don't worry. '''Hannah 15Nov09 9:33am'''
+I have lots of questions and don't know where to put them. I suppose this wiki-page will get chaotic pretty fast if I post 20 questions. '''Johannes VI Mar MMX 12:00'''
-Line 33:
+Line 31:
-In Exercise we should "give an example of data for which k = x is the best choice". What is meant by "an example of data" here? A single number or a set of numbers or anything else? '''Florian 15Nov09 8:52pm'''
+I'm sorry for the delay with the master solutions. I am at a conference right now but will try to make progress with this over the weekend. '''Hannah 4Mar10 23:59'''
-Line 35:
+Line 33:
-I'd like to suggest that everyone grades the exercise sheet from 1 (for "way to easy") to 10 ("way to hard"). This might provide the professor with the feedback she asks for in the lecture. How about that idea? '''Johannes 2009-11-15T20:40L'''
+Do we get master solutions for ex. 11, 12, 13 and 14? '''Johannes 04Mar2010 23:32 ZULU'''
-Line 37:
+Line 35:
-To Florian + all: yes, sorry, I forgot to mention this in the lecture. Marjan already explained how to clear the disk cache. Let me add to this an explanation what the disk cache actually is. Whenever you read a (part of a) file from disk, the operating system of your computed will use whatever memory is currently unused to store that (part of the) file there. When you read it again and the (part of the) file hasn't changed and the memory used to store it has not been used otherwise in the meantime, than that data is read right from memory, which is much faster than reading it from disk. Usually that effect is desirable, because it speeds up things, but when you do experiments, it is undesirable, because it leads to unrealistically good running times, especially when carrying out an experiment many times in a row. '''Hannah 15Nov09 8:10pm'''
+Now they're there again. '''Marjan 01Mar18:09'''
-Line 39:
+Line 37:
-To Florian: Indeed, we were running out of time and there was no room for this in the lecture. I can suggest to you few ways how to clear the disk cache: before carrying out your final experiment, read a large amount of data (let's say close to the amount of RAM you have) from disk - this will ensure that your data (the inverted list) is cleared from the disk cache and replaced by something else (thus an actual reading from disk get's timed, and not reading from RAM). Another way is to restart your computer before doing the timing. '''Marjan 15Nov09 7:27pm'''
+ARGH! I'm very sorry. My Down-Them-All Plugin for Firefox seems to have deleted all the lecture PDFs! Sorry for that. Rollback to previous versions does not seem to work. I hope, someone has already downloaded them all and is able to restore them! SORRY! Interesting, I've got the rights to delete something from the main page, though. '''Marius Mar 1st 2010 2:38 p.m.'''
-Line 41:
+Line 39:
-In exercise 4 it says: "Important note: Whenver you measure running times for reading data from disk, you have to clear the disk cache before, as discussed in the lecture". I think that this was not discussed in the lecture? What do we have to do here? '''Florian 15Nov09 7:15pm'''
+(Reminder:) Hello, the master solutions are not online, yet. '''alex n 1Mar10 11:08'''
-Line 43:
+Line 41:
-@Bit shifting: The syntax for that is actually the same, irrespectively of whether you use Java, C++, perl, python, or whatever. The >> operator shifts to the right, the << operator shifts to the left, the & operator ands the bits of the two operands and the | operator ors the bits of the two operands. Very simple. You will also find zillions of example programs on the web by typing something like ''java bit shifting'' into Google or whatever your favorite search engine is. '''Hannah 15Nov09 1:16'''
+Yes, we are working on it. Please remind us again if they aren't online by the end of this week. '''Hannah 23Feb10 14:30'''
-Line 45:
+Line 43:
-Hi Marius + all: For Exercise 4, an inverted list of size m with doc ids from the range [1..n] is simply a sorted list of m numbers from the range [1..n]. I leave it to you, whether your lists potentially contain duplicates (as in  3, 5, 5, 8, 12, ...) or whether you generate them in a way that they don't contain duplicates (as in 3, 5, 8, 17, ...). It doesn't really matter for the exercise whether your list has duplicated or not. In any case, consider simple flat lists like in the two examples I gave (and like all the examples I gave in this and past lectures), not lists of lists or anything. '''Hannah 15Nov09 1:12am'''
+Do we get master solutions for ex. 11, 12, 13 and 14? '''Johannes 23Feb10 14:05'''
-Line 47:
+Line 45:
-@Mirko: Sure, but an inverted list is a list of words where the Doc-IDs are attached to each words in which the words occur. So for Example: If word no. 5 occurs in Doc1, Doc2 and Doc3 and word no. 2 occurs in Doc5, the list would look like: 5 -> Doc1, Doc2, Doc3; 2 -> Doc5. Or am I mistaken? My question then is, how long should these attached lists be in average case? I mean, one could imagine that we got 1mil. documents over 3 words, so these lists could get very large...
+Hi Matthias, yes, Pr(A) = 1 - Pr(not A), for any event A, and so for any random variable X, Pr(X <= x) = 1 - Pr(X > x), because X <= x and X > x are complementary events. For continuous random variables (like variables with a normal distribution), the difference between <= and < and >= and > is immaterial, because Pr(X = x) for each fixed x. But anyway, to compute the probability, you first have to transform it a bit, like I did in the lecture, and then obtain Pr(N(0,1) >= sqrt(n1) * (µ1 - µ) / σ) and Pr(N(0,1) <= sqrt(n2) * (µ - µ2) / σ). To evaluate the latter you can also simply use the symmetry of the normal distribution, due to which one has Pr(N(0,1) <= -x) = Pr(N(0,1) >= x). '''Hannah 18Feb10 12:58'''
-Line 49:
+Line 47:
-EDIT: Oh ok. Now, I see your point. It's not an index, it's a list. Okay. So, what is an inverted list with Doc-IDs, then?
+Hi, how can we compute Pr(N(n2 * µ2, n2 * σ^2^) <= n2 * µ2 ? Can we use 1- (Pr(N(n2 * µ2, n2 * σ^2^) >= n2 * µ2) for that ? '''Matthias 18Feb10 12:01'''
-Line 51:
+Line 49:
-EDIT EDIT: And to your question, Mirko, take a look at http://snippets.dzone.com/posts/show/93. Especially at Comment no. 2. Maybe this helps... I think, Java supports StreamWriters/Readers that are able to write/read bytes. '''Marius 11/14/2009 08:46pm'''
+Hi Florian + all, one of µ1 and µ2 is larger than µ and one is smaller. Let's assume µ1 is larger and µ2 is smaller. Then for µ1 you have to look at Pr(N(n1 * µ, n1 * σ^2^) >= n1 * µ1). But for µ2 you have to look at Pr(N(n2 * µ2, n2 * σ^2^) <= n2 * µ2). Note the <= instead of the >= for the second probability. Recall the meaning of these probabilities. Just as an example, let µ be 100 and µ1 be 150 and µ2 be 50. Then the first probability means: what is the probability that I see a mean of ''150 or more'' in my first sample, although the mean of my distribution is 100. The second probability means: what is the probability that I see a mean of ''50 or less'' in my second sample, although the mean of my distribution is 100. If you take both <= or both >= for both probabilities, it is to be expected that you get two completely different probabilities, one very low and one very high (except when they are both close to 50%). Please ask again if this is still unclear. '''Hannah 17Feb10 21:51'''
-Line 53:
+Line 51:
-EDIT EDIT EDIT: Sorry, me again. Well, I bothered Wikipedia which redirects from http://en.wikipedia.org/wiki/Inverted_list to Inverted Index. So it seems to me, this is being used as a synonym. Actually, I think I'm confused enough, now. I'll better wait for any responses... ;-) '''Marius 11/14/2009 9:08 pm'''
+Sorry, with probability for µ1 I meant Pr(N(n1 * µ, n1 * σ^2^) >= n1 * µ1) and accordingly with probability for µ2 I meant Pr(N(n2 * µ, n2 * σ^2^) >= n2 * µ2) where n1=n2 for the exercise sheet.  '''Florian 17Feb10 21:18'''
-Line 55:
+Line 53:
-@ Marius: i think we are supposed to generate one inverted __list__ of size m, with doc ids from 1..n (therefore n>=m, because no duplicates?).
+Hi Florian, what exactly do you mean by ''probability for µ1'' and ''probability for µ2''? '''Hannah 17Feb10 21:02'''
-Line 57:
+Line 55:
-Now a question from my side: ex.4, programming the compression in __java__, is there any __good__ tutorial about how to handle the bit-stuff? (otherwise, i think, it would cost me too much time..) '''Mirko 14Nov09, 19:18'''
+Hi, what values are we expected to get for exercise 4? I always get a probability of about 99.9% for μ1 and a value of about 0.07% for μ2, can that be? '''Florian 17Feb10 18:25'''
-Line 59:
+Line 57:
-Hi, do you have any suggestions what the best numbers for m and n in exercise 4 should look like? Or are we supposed to mess around a bit with ints and longs? And: How long should the list of documents in the inverted index be? '''Marius 14Nov09 6:40pm'''
+Hi Florian, yes, the ''averages'' in Exercise 3 should be ''average running times''. I uploaded a new version of the sheet, where I corrected this. '''Hannah 14Feb10 17:48'''
-Line 61:
+Line 59:
-And just to clarify what a single-cycle permutation is. Here is an example for an array of size 5 with a permutation that is a single cycle: 5 4 1 3 2. Why single cycle? Well, A[1] = 5, A[5] = 2, A[2] = 4, A[4] = 3, A[3] = 1. (My indices in this example are 1,...,5 and not 0,...,4.) Here is an example of a permutation with three cycles: 2 1 4 3 5. The first cycle is A[1] = 2, A[2] =1. The second cycle is A[3] = 4, A[4] = 3. The third cycle is A[5] = 5. '''Hannah 12Nov09 8:04pm'''
+Hi, I guess we should measure the running times to determine the efficiency of the programs for exercise 3? '''Florian 15Feb10 17:42'''
-Line 63:
+Line 61:
-Hi Daniel + all, I don't quite understand your question and your example (if your array is 1 5 3 4 2, why is A[1] = 3?). In case you refer to the requirement of the exercise that the permutation consists only of a single cycle. That is because your code should go over each element exactly once (it should, of course, stop after n iterations, where n is the size of the array). If your permutation has more than one cycle, it is hard to achieve that. Also note that for both (1) and (2), the sum of the array values should be sum_i=1,...,n i = n * (n+1) / 2. '''Hannah 12Nov09 7:54pm'''
+Hi Claudius, you should compute Pr(D|H0), exactly as done in the lecture for Example 2, where we computed this probability as Pr(X > x), where X is a random variable with distribution N(0,1), that is, normal with mean 0 and variance 1, and x depends on the mean and variance of your data. '''Hannah 14Feb10 16:44'''
-Line 65:
+Line 63:
-Hi, I just looked at the new exercise sheet 4, in exercise 1 we should generate a permutation and sum the resulting array up, am I wrong or doesn't iterating method two iterate throw the whole array in every situation. for ex.: n= 5 permutation: 1 5 3 4 2, then A[1] = 3, A[A[1]]= A[3] = 1, A[1] = 3 ... '''Daniel 12Nov09 19:44pm'''
+Hi. If I have understood correctly, we have to compute Pr(H|D) in Exercise 4. From statistical hypothesis testing, we get Pr(D|H). Now, Pr(H|D) = Pr(D|H) * (Pr(H) / Pr(D)). We know Pr(D|H) and we can compute Pr(D), but what value do we have to use for Pr(H)? '''Claudius 14Feb10 14:41'''

Hi Eric, I don't care whether you use integers or doubles, but I am curious why the one should be any harder than the other? '''Hannah 12Feb10 19:02'''

May we use integers for sorting? Or do we have to use doubles? This is important for generating my sorted array '''Eric 12Feb10 18:56'''

If you're asking about the merging you can of course use a priority queue if you want, but you don't really need it when merging 2 lists. '''Marjan 18:28'''

Why would you use a priority queue? It's simple sorting, the exercise is not about implementing your own sorting algorithm or something like that. About exercise 3, it should be clear from the exercise itself that the sequences should be sorted (otherwise how can the merging work?) '''Marjan 18:23'''

Means that we have nothing to do than use a priority queue or something like that and don't have to implement the sorting? And at Exercise 3 the random set should be an ordered one or not? '''Alex 12Feb10 18:19'''

We prefer randomized sorting using bitonic networks, alternatively combined with LSD radix sort or simple pancake sort. That's of course a joke, it should be clear that you can use the built-in sorting functions (your own implementation will be certainly slower). '''Marjan 12Feb10 18:12'''

What does "do a standard sort" in exercise 2 mean? Shall I implement one on my own, or may I use the Java built-in sorting mechanisms? Also, which sorting algorithm do you prefer for this? '''Eric 12Feb10 18:04'''