Diff for "SearchEnginesWS0910"

Differences between revisions 708 and 755 (spanning 47 versions)

Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.

Here are PDFs of the slides of the lectures: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6, Lecture 7, Lecture 8, Lecture 9, Lecture 10, Lecture 11, Lecture 12, Lecture 13, Lecture 14, Projects.

Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5, Exercise Sheet 6, Exercise Sheet 7, Exercise Sheet 8, Exercise Sheet 9, Exercise Sheet 10, Exercise Sheet 11, Exercise Sheet 12, Exercise Sheet 13, Exercise Sheet 14.

Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4, Solutions and Comments 5, Solutions and Comments 6, Solutions and Comments 7, Solutions and Comments 8, Solutions and Comments 9, Solutions and Comments 10, Solutions and Comments 11, Solutions and Comments 12, Solutions and Comments 13.

Here are our master solutions: Master solution for Mid-Term Exam,Master solution for Exercise Sheet 9, Master solution for Exercise Sheet 10, Master solution for Exercise Sheet 11, Master solution for Exercise Sheet 12.

Here are the rules for the exercises as explained in Lecture 2.

Here is everything about the mid-term exam. The final exam is on Friday March 12, 2010. The written exam begins at 2.00 pm in HS 026. The oral exams are scheduled on the same day.

Here is the table with the links to your uploaded solutions for Exercise Sheet 14. The deadline is Thursday 18Feb10 16:00.

Questions and comments about the master solution of the mid-term exam

Johannes 2010-03-07T12:40 :

1.3: CLAIM: If an encoding is prefix-free, then there is no code that is a prefix of a different code. Does this claim hold? If so, then 001 mustn't be a code, since 0 is a code and a prefix of 001. Is this right?

There was an obvious mistake which I now corrected (00 should be mapped to 1, not 0). Hannah 7Mar10 12:56

1.4: It states: "For a sequence of length n, we need to generate n/2 such codes [...]." Does not each symbol of the n from the sequence get encoded?

Each code stands for two bits at a time, so for a sequence of n bits, you have to generate n/2 codes. I replaced sequence of length n by sequence of n bits to make this clearer. Hannah 7Mar10 12:58

3.4: The function returns the number of common k-grams (as far as I see). Can the return-line be completed with a call to the function from 3.2 to return the Jaccard-distance?

Yes, indeed, I replaced return l by return jaccardDistance(x, y, k, l). Hannah 7Mar10 13:01

5.4: Does the top-k-algorithm return the top k documents? If so, which k had to been used in this task? What exactly is the condition for stopping? What exactly is the update rule for the ranges? My idea is that (for a fixed document) the minimum is always the known minimum from any of the lists and the maximum is always the (already known) minimum plus the lowest score, seen in any list different than the one the minimum is from. In case of only two lists there may be some simplifications.

The task asked for the top-ranked document, so k = 1. We can stop when the upper bound for all documents not yet seen is strictly below the k-th largest lower bound so far, and when the score ranges for the documents already seen are such that it is clear which are the top-k documents and in which order. If there are ties, and we don't care how they are broken, and we don't care to know the order of the top-k documents, we can sometimes stop earlier. Does this answer all your questions? Hannah 7Mar10 13:06

Thanks a lot for your comments! Please go on if you have more. Hannah 7Mar10 13:07

Questions and comments about Exercise Sheet 14 below this line (most recent on top)

Hi Johannes: why don't you start with the first few questions, and then let's see whether it makes sense to continue this via the Wiki, or via private email, or via a meeting in person. Hannah 6Mar10 17:36

Yes, the final exam is like the mid-term exam in this respect. Hannah 6Mar10 17:36

Alex: http://vulcano.informatik.uni-freiburg.de/wiki/teaching/SearchEnginesWS0910/MidTermExam, so it seems to be allowed. Mirko, 6Mar10 16:10

Hi, I was wondering, will the exam next week also be an open book exam like the mid-term? Perhaps I overlooked it, but I don't think this is stated anywhere yet. Alex 6Mar10 13:49

I have lots of questions and don't know where to put them. I suppose this wiki-page will get chaotic pretty fast if I post 20 questions. Johannes VI Mar MMX 12:00

I'm sorry for the delay with the master solutions. I am at a conference right now but will try to make progress with this over the weekend. Hannah 4Mar10 23:59

Do we get master solutions for ex. 11, 12, 13 and 14? Johannes 04Mar2010 23:32 ZULU

Now they're there again. Marjan 01Mar18:09

ARGH! I'm very sorry. My Down-Them-All Plugin for Firefox seems to have deleted all the lecture PDFs! Sorry for that. Rollback to previous versions does not seem to work. I hope, someone has already downloaded them all and is able to restore them! SORRY! Interesting, I've got the rights to delete something from the main page, though. Marius Mar 1st 2010 2:38 p.m.

(Reminder:) Hello, the master solutions are not online, yet. alex n 1Mar10 11:08

Yes, we are working on it. Please remind us again if they aren't online by the end of this week. Hannah 23Feb10 14:30

Do we get master solutions for ex. 11, 12, 13 and 14? Johannes 23Feb10 14:05

Hi Matthias, yes, Pr(A) = 1 - Pr(not A), for any event A, and so for any random variable X, Pr(X <= x) = 1 - Pr(X > x), because X <= x and X > x are complementary events. For continuous random variables (like variables with a normal distribution), the difference between <= and < and >= and > is immaterial, because Pr(X = x) for each fixed x. But anyway, to compute the probability, you first have to transform it a bit, like I did in the lecture, and then obtain Pr(N(0,1) >= sqrt(n1) * (µ1 - µ) / σ) and Pr(N(0,1) <= sqrt(n2) * (µ - µ2) / σ). To evaluate the latter you can also simply use the symmetry of the normal distribution, due to which one has Pr(N(0,1) <= -x) = Pr(N(0,1) >= x). Hannah 18Feb10 12:58

Hi, how can we compute Pr(N(n2 * µ2, n2 * σ²) <= n2 * µ2 ? Can we use 1- (Pr(N(n2 * µ2, n2 * σ²) >= n2 * µ2) for that ? Matthias 18Feb10 12:01

Hi Florian + all, one of µ1 and µ2 is larger than µ and one is smaller. Let's assume µ1 is larger and µ2 is smaller. Then for µ1 you have to look at Pr(N(n1 * µ, n1 * σ²) >= n1 * µ1). But for µ2 you have to look at Pr(N(n2 * µ2, n2 * σ²) <= n2 * µ2). Note the <= instead of the >= for the second probability. Recall the meaning of these probabilities. Just as an example, let µ be 100 and µ1 be 150 and µ2 be 50. Then the first probability means: what is the probability that I see a mean of 150 or more in my first sample, although the mean of my distribution is 100. The second probability means: what is the probability that I see a mean of 50 or less in my second sample, although the mean of my distribution is 100. If you take both <= or both >= for both probabilities, it is to be expected that you get two completely different probabilities, one very low and one very high (except when they are both close to 50%). Please ask again if this is still unclear. Hannah 17Feb10 21:51

Sorry, with probability for µ1 I meant Pr(N(n1 * µ, n1 * σ²) >= n1 * µ1) and accordingly with probability for µ2 I meant Pr(N(n2 * µ, n2 * σ²) >= n2 * µ2) where n1=n2 for the exercise sheet. Florian 17Feb10 21:18

Hi Florian, what exactly do you mean by probability for µ1 and probability for µ2? Hannah 17Feb10 21:02

Hi, what values are we expected to get for exercise 4? I always get a probability of about 99.9% for μ1 and a value of about 0.07% for μ2, can that be? Florian 17Feb10 18:25

Hi Florian, yes, the averages in Exercise 3 should be average running times. I uploaded a new version of the sheet, where I corrected this. Hannah 14Feb10 17:48

Hi, I guess we should measure the running times to determine the efficiency of the programs for exercise 3? Florian 15Feb10 17:42

Hi Claudius, you should compute Pr(D|H0), exactly as done in the lecture for Example 2, where we computed this probability as Pr(X > x), where X is a random variable with distribution N(0,1), that is, normal with mean 0 and variance 1, and x depends on the mean and variance of your data. Hannah 14Feb10 16:44

Hi. If I have understood correctly, we have to compute Pr(H|D) in Exercise 4. From statistical hypothesis testing, we get Pr(D|H). Now, Pr(H|D) = Pr(D|H) * (Pr(H) / Pr(D)). We know Pr(D|H) and we can compute Pr(D), but what value do we have to use for Pr(H)? Claudius 14Feb10 14:41

Hi Eric, I don't care whether you use integers or doubles, but I am curious why the one should be any harder than the other? Hannah 12Feb10 19:02

May we use integers for sorting? Or do we have to use doubles? This is important for generating my sorted array Eric 12Feb10 18:56

If you're asking about the merging you can of course use a priority queue if you want, but you don't really need it when merging 2 lists. Marjan 18:28

Why would you use a priority queue? It's simple sorting, the exercise is not about implementing your own sorting algorithm or something like that. About exercise 3, it should be clear from the exercise itself that the sequences should be sorted (otherwise how can the merging work?) Marjan 18:23

Means that we have nothing to do than use a priority queue or something like that and don't have to implement the sorting? And at Exercise 3 the random set should be an ordered one or not? Alex 12Feb10 18:19

We prefer randomized sorting using bitonic networks, alternatively combined with LSD radix sort or simple pancake sort. That's of course a joke, it should be clear that you can use the built-in sorting functions (your own implementation will be certainly slower). Marjan 12Feb10 18:12

What does "do a standard sort" in exercise 2 mean? Shall I implement one on my own, or may I use the Java built-in sorting mechanisms? Also, which sorting algorithm do you prefer for this? Eric 12Feb10 18:04

-  ⇤ ← Revision 708 as of 2010-02-11 19:42:44 → 
  Size: 12506
  Editor: Hannah Bast
  Comment:
+   ← Revision 755 as of 2010-03-07 13:20:11 → ⇥
  Size: 15002
  Editor: Hannah Bast
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 9:
-Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]], [[SearchEnginesWS0910/ExerciseSheet6|Solutions and Comments 6]], [[SearchEnginesWS0910/ExerciseSheet7|Solutions and Comments 7]], [[SearchEnginesWS0910/ExerciseSheet8|Solutions and Comments 8]], [[SearchEnginesWS0910/ExerciseSheet9|Solutions and Comments 9]], [[SearchEnginesWS0910/ExerciseSheet10|Solutions and Comments 10]], [[SearchEnginesWS0910/ExerciseSheet11|Solutions and Comments 11]], [[SearchEnginesWS0910/ExerciseSheet12|Solutions and Comments 12]].
+Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]], [[SearchEnginesWS0910/ExerciseSheet6|Solutions and Comments 6]], [[SearchEnginesWS0910/ExerciseSheet7|Solutions and Comments 7]], [[SearchEnginesWS0910/ExerciseSheet8|Solutions and Comments 8]], [[SearchEnginesWS0910/ExerciseSheet9|Solutions and Comments 9]], [[SearchEnginesWS0910/ExerciseSheet10|Solutions and Comments 10]], [[SearchEnginesWS0910/ExerciseSheet11|Solutions and Comments 11]], [[SearchEnginesWS0910/ExerciseSheet12|Solutions and Comments 12]], [[SearchEnginesWS0910/ExerciseSheet13|Solutions and Comments 13]].
 Line 11:
-Here are our master solutions: [[attachment:SearchEnginesWS0910/solution-midterm.pdf|Master solution for Mid-Term Exam]],[[attachment:SearchEnginesWS0910/solution-9.pdf|Master solution for Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/solution-10.pdf|Master solution for Exercise Sheet 10]].
+Here are our master solutions: [[attachment:SearchEnginesWS0910/solution-midterm.pdf|Master solution for Mid-Term Exam]],[[attachment:SearchEnginesWS0910/solution-9.pdf|Master solution for Exercise Sheet 9]], [[attachment:SearchEnginesWS0910/solution-10.pdf|Master solution for Exercise Sheet 10]], [[attachment:SearchEnginesWS0910/solution-11.pdf|Master solution for Exercise Sheet 11]], [[attachment:SearchEnginesWS0910/solution-12.pdf|Master solution for Exercise Sheet 12]].
 Line 17:
-[[SearchEnginesWS0910/ExerciseSheet13|Here is the table with the links to your uploaded solutions for Exercise Sheet 13]]. The deadline is Thursday 11Feb10 16:00.
+[[SearchEnginesWS0910/ExerciseSheet14|Here is the table with the links to your uploaded solutions for Exercise Sheet 14]]. The deadline is Thursday 18Feb10 16:00.
 Line 19:
-== Questions and comments about Exercise Sheet 13 below this line (most recent on top) ==
-Line 21:
+Line 20:
-Hi Dragos. Regarding exercise 4 - I am not sure what exactly you mean. The proof should be as formal as any other proof. Regarding exercise 5 - you're right and I have the same problem. One has to do the deletion in O(log n) but this is not provided by the standard priority queue implementation in C++, Java etc. Ideally one should do his own implementation of priority queue but then this would be more work, so I am not sure myself (the professor is still at Google). '''Marjan 10Feb10 12:06'''
+== Questions and comments about the master solution of the mid-term exam ==
-Line 23:
+Line 22:
-My second question regards Exercise 4. Should we adopt a formal way to prove the two statements (like induction) or is it sufficient to explain in words what happens at each iteration and why the final result will be the one mentioned in the exercise. '''Dragos 10th of February, 10.53'''
+'''Johannes 2010-03-07T12:40''' :
-Line 25:
+Line 24:
-Hello. I was looking at the pseudo-code from the Information Retrieval Book (for the first exercise) and it says that the complexity of the algorithm is O(n^2*log n) if the insert and delete operations for the priority queue are done in O(log n)... which makes sense. The problem is that they use a delete(element) operation, which as I have looked in the implementations for priority queues in Java or Perl have a complexity greater then log(n) (commonly .. O(n)). I don't really see how to do it in log(n) time. '''Dragos 9th of February, 23.33'''
+'''1.3''': CLAIM: If an encoding is prefix-free, then there is no code that is a prefix of a different code. Does this claim hold? If so, then 001 mustn't be a code, since 0 is a code and a prefix of 001. Is this right?
-Line 27:
+Line 26:
-@Marjan: The implementation of the lecture would work for n=10^4^ (I have 2GB RAM) but it takes a very long time. The memory exception is thrown when I create the matrix C of the pseudo code from the information retrieval book, but since the priority queue contains the same number of elements as the matrix I think there is not much difference. '''Florian 8Feb 21:29'''
+There was an obvious mistake which I now corrected (00 should be mapped to 1, not 0). '''Hannah 7Mar10 12:56'''
-Line 29:
+Line 28:
-Another small observation: clearly, if n = 10^5^, then you have n^2^ = 10^10^ entries in your SIM matrix. Say each entry has size of 4 bytes (float), this makes it 37.2GB in total! Obviously you can't go above n=10^4^. '''Marjan 8Feb 21:09'''
+'''1.4''': It states: "For a sequence of length n, we need to generate n/2 such codes [...]." Does not each symbol of the n from the sequence get encoded?
-Line 31:
+Line 30:
-Hi Florian. I have two questions. First, how far in terms of n can you go with the implementation from the lecture? Second, if the difference is significant, could you give a short but precise statement where does most of the memory in your algorithm get consumed? Is it the similarity matrix or the priority queue? '''Marjan 8Feb10 21:02'''
+Each code stands for two bits at a time, so for a sequence of n bits, you have to generate n/2 codes. I replaced ''sequence of length n'' by ''sequence of n bits'' to make this clearer. '''Hannah 7Mar10 12:58'''
-Line 33:
+Line 32:
-After implementing the algorithm from the information retrieval book for exercise 1 I already get an out of memory exception for n=10^4^. Just having the values for n=10^1^ ... 10^3^ is not much to see if T/n^2^ is indeed logarithmic, but is it enough for the exercise? '''Florian 8Feb10 20:40'''
+'''3.4''': The function returns the number of common k-grams (as far as I see). Can the return-line be completed with a call to the function from 3.2 to return the Jaccard-distance?
 
Yes, indeed, I replaced ''return l'' by ''return jaccardDistance(x, y, k, l)''. '''Hannah 7Mar10 13:01'''
-Line 35:
+Line 36:
-'''FYI: on Thursday, we will have 45 minutes of regular lecture (16.15 - 17.00 h), and then the intro lecture for the bachelor / master's project (17.00 - 17.45 h). Those who are interested in doing a project, can just stay, the others may go then.''' '''Hannah 8Feb10 14:08'''
+'''5.4''': Does the top-k-algorithm return the top k documents? If so, which k had to been used in this task? What exactly is the condition for stopping? What exactly is the update rule for the ranges? My idea is that (for a fixed document) the minimum is always the known minimum from any of the lists and the maximum is always the (already known) minimum plus the lowest score, seen in any list different than the one the minimum is from. In case of only two lists there may be some simplifications.
-Line 37:
+Line 38:
-Hi Florian + all: yes, it's possible for all the four heuristics we have discussed in the lecture, including complete-link. Conceptually, you have to maintain the n x n similarity matrix SIM, where SIM[i][j] gives you the similarity of the current cluster i with the current cluster j. In each iteration, we consider only those entries SIM[i][j] where i and j are cluster represantatives (that is, they are the smallest index of an element in their cluster) and where i != j (we never want to merge a cluster with itself). Initially each point is a cluster on its own and the SIM[i][j] are just the pairwise point similarities, as in the code we wrote together in the lecture. Now in each iteration, we need to find that pair (i,j) of cluster representative for which SIM[i][j] is maximal. In the lecture we did it the trivial way, by a double loop. That requires quadratic time per iteration, giving cubic time overall. For the exercise, you shoudl instead use one priority queue per cluster representative i to store all SIM[i][j] for all valid cluster representatives j. Then for each i you get the argmax_j SIM[i][j] in logarithmic time instead of in linear time, and doing this for each i you get the argmax_(i,j) SIM[i][j] in n * log n time instead of in n^2 time per iteration. Once you have found the pair of clusters to merge, you have to update both SIM[i][j] and the priority queues appropriately. The former can be done in linear time per iteration and the latter can be done with a linear number of priority queue operations. BTW, you find pseudo-code of the algorithm in the information retrieval book: http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-of-hac-1.html maybe that helps you, too. '''Hannah 7Feb10 20:49'''
+The task asked for the ''top-ranked document'', so k = 1. We can stop when the upper bound for all documents not yet seen is ''strictly'' below the k-th largest lower bound so far, and when the score ranges for the documents already seen are such that it is clear which are the top-k documents and in which order. If there are ties, and we don't care how they are broken, and we don't care to know the order of the top-k documents, we can sometimes stop earlier. Does this answer all your questions? '''Hannah 7Mar10 13:06'''
-Line 39:
+Line 40:
-Hi, in exercise 1, is it really possible to implement the complete-link heuristic with a priority queue so that the finding of the next two clusters which have to be merged only needs logarithmic time? If I did it right, then we have to fill the priority queue with the possible pairs of clusters and use their similarity with the complete-link heuristic to sort the queue. But when two clusters are merged now, then the distances of the combined cluster (i.e. the one with the smaller index) to the other clusters can change, so after each merge the priority queue has to be rebuilt to ensure it's heap property. But a complete rebuild of the queue is quite time consuming, or am I doing anything wrong? '''Florian 7Feb10 20:09'''
+Thanks a lot for your comments! Please go on if you have more. '''Hannah 7Mar10 13:07'''
-Line 41:
+Line 42:
-Hi Mirko, you are free to modify the existing program (linked to in my first comment below), or write the program from scratch (or transcribe my C++ program) in a programming language of your choice. '''Hannah 7Feb10 16:40'''
 Line 43:
-Hi, one question about Exercise 1: the text says "Modify the code", does this mean we have to extend the C++-program, or are we also allowed to write the complete program in an other language like Java? '''Mirko 7Feb10 16:33'''
+== Questions and comments about Exercise Sheet 14 below this line (most recent on top) ==
 Line 45:
-I can't say now with certainty which courses exactly I will offer in a year. It is likely though, that we will offer bachelor / master projects on a regular basis. '''Hannah 5Feb10 21:17'''
+Hi Johannes: why don't you start with the first few questions, and then let's see whether it makes sense to continue this via the Wiki, or via private email, or via a meeting in person. '''Hannah 6Mar10 17:36'''
 Line 47:
-Yes, Marjan is right. '''Johannes 2010-02-05:1931'''
+Yes, the final exam is like the mid-term exam in this respect. '''Hannah 6Mar10 17:36'''
 Line 49:
-Probably Johannes wants/has to do his project next year. '''Marjan 5Feb10 18:32'''
+Alex: http://vulcano.informatik.uni-freiburg.de/wiki/teaching/SearchEnginesWS0910/MidTermExam, so it seems to be allowed. '''Mirko, 6Mar10 16:10'''
 Line 51:
-Hi Johannes, I don't quite understand your comment, what do you mean? '''Hannah 5Feb10 17:33'''
+Hi, I was wondering, will the exam next week also be an open book exam like the mid-term? Perhaps I overlooked it, but I don't think this is stated anywhere yet. '''Alex 6Mar10 13:49'''
 Line 53:
-Even if it was in one year? '''Johannes 2010-02-05:1719'''
+I have lots of questions and don't know where to put them. I suppose this wiki-page will get chaotic pretty fast if I post 20 questions. '''Johannes VI Mar MMX 12:00'''
 Line 55:
-If you are interested in doing the Bachelor / Master's project (following this course as a block course), please send me an email asap: bast@informatik.uni-freiburg.de . '''Hannah 4Deb10 19:30'''
+I'm sorry for the delay with the master solutions. I am at a conference right now but will try to make progress with this over the weekend. '''Hannah 4Mar10 23:59'''
 Line 57:
-Here is the [[attachment:HierarchicalClustering.cpp|hierarchical clustering code]] which we have written in the lecture. '''Hannah 2Feb10 18:47'''
+Do we get master solutions for ex. 11, 12, 13 and 14? '''Johannes 04Mar2010 23:32 ZULU'''

Now they're there again. '''Marjan 01Mar18:09'''

ARGH! I'm very sorry. My Down-Them-All Plugin for Firefox seems to have deleted all the lecture PDFs! Sorry for that. Rollback to previous versions does not seem to work. I hope, someone has already downloaded them all and is able to restore them! SORRY! Interesting, I've got the rights to delete something from the main page, though. '''Marius Mar 1st 2010 2:38 p.m.'''

(Reminder:) Hello, the master solutions are not online, yet. '''alex n 1Mar10 11:08'''

Yes, we are working on it. Please remind us again if they aren't online by the end of this week. '''Hannah 23Feb10 14:30'''

Do we get master solutions for ex. 11, 12, 13 and 14? '''Johannes 23Feb10 14:05'''

Hi Matthias, yes, Pr(A) = 1 - Pr(not A), for any event A, and so for any random variable X, Pr(X <= x) = 1 - Pr(X > x), because X <= x and X > x are complementary events. For continuous random variables (like variables with a normal distribution), the difference between <= and < and >= and > is immaterial, because Pr(X = x) for each fixed x. But anyway, to compute the probability, you first have to transform it a bit, like I did in the lecture, and then obtain Pr(N(0,1) >= sqrt(n1) * (µ1 - µ) / σ) and Pr(N(0,1) <= sqrt(n2) * (µ - µ2) / σ). To evaluate the latter you can also simply use the symmetry of the normal distribution, due to which one has Pr(N(0,1) <= -x) = Pr(N(0,1) >= x). '''Hannah 18Feb10 12:58'''

Hi, how can we compute Pr(N(n2 * µ2, n2 * σ^2^) <= n2 * µ2 ? Can we use 1- (Pr(N(n2 * µ2, n2 * σ^2^) >= n2 * µ2) for that ? '''Matthias 18Feb10 12:01'''

Hi Florian + all, one of µ1 and µ2 is larger than µ and one is smaller. Let's assume µ1 is larger and µ2 is smaller. Then for µ1 you have to look at Pr(N(n1 * µ, n1 * σ^2^) >= n1 * µ1). But for µ2 you have to look at Pr(N(n2 * µ2, n2 * σ^2^) <= n2 * µ2). Note the <= instead of the >= for the second probability. Recall the meaning of these probabilities. Just as an example, let µ be 100 and µ1 be 150 and µ2 be 50. Then the first probability means: what is the probability that I see a mean of ''150 or more'' in my first sample, although the mean of my distribution is 100. The second probability means: what is the probability that I see a mean of ''50 or less'' in my second sample, although the mean of my distribution is 100. If you take both <= or both >= for both probabilities, it is to be expected that you get two completely different probabilities, one very low and one very high (except when they are both close to 50%). Please ask again if this is still unclear. '''Hannah 17Feb10 21:51'''

Sorry, with probability for µ1 I meant Pr(N(n1 * µ, n1 * σ^2^) >= n1 * µ1) and accordingly with probability for µ2 I meant Pr(N(n2 * µ, n2 * σ^2^) >= n2 * µ2) where n1=n2 for the exercise sheet.  '''Florian 17Feb10 21:18'''

Hi Florian, what exactly do you mean by ''probability for µ1'' and ''probability for µ2''? '''Hannah 17Feb10 21:02'''

Hi, what values are we expected to get for exercise 4? I always get a probability of about 99.9% for μ1 and a value of about 0.07% for μ2, can that be? '''Florian 17Feb10 18:25'''

Hi Florian, yes, the ''averages'' in Exercise 3 should be ''average running times''. I uploaded a new version of the sheet, where I corrected this. '''Hannah 14Feb10 17:48'''

Hi, I guess we should measure the running times to determine the efficiency of the programs for exercise 3? '''Florian 15Feb10 17:42'''

Hi Claudius, you should compute Pr(D|H0), exactly as done in the lecture for Example 2, where we computed this probability as Pr(X > x), where X is a random variable with distribution N(0,1), that is, normal with mean 0 and variance 1, and x depends on the mean and variance of your data. '''Hannah 14Feb10 16:44'''

Hi. If I have understood correctly, we have to compute Pr(H|D) in Exercise 4. From statistical hypothesis testing, we get Pr(D|H). Now, Pr(H|D) = Pr(D|H) * (Pr(H) / Pr(D)). We know Pr(D|H) and we can compute Pr(D), but what value do we have to use for Pr(H)? '''Claudius 14Feb10 14:41'''

Hi Eric, I don't care whether you use integers or doubles, but I am curious why the one should be any harder than the other? '''Hannah 12Feb10 19:02'''

May we use integers for sorting? Or do we have to use doubles? This is important for generating my sorted array '''Eric 12Feb10 18:56'''

If you're asking about the merging you can of course use a priority queue if you want, but you don't really need it when merging 2 lists. '''Marjan 18:28'''

Why would you use a priority queue? It's simple sorting, the exercise is not about implementing your own sorting algorithm or something like that. About exercise 3, it should be clear from the exercise itself that the sequences should be sorted (otherwise how can the merging work?) '''Marjan 18:23'''

Means that we have nothing to do than use a priority queue or something like that and don't have to implement the sorting? And at Exercise 3 the random set should be an ordered one or not? '''Alex 12Feb10 18:19'''

We prefer randomized sorting using bitonic networks, alternatively combined with LSD radix sort or simple pancake sort. That's of course a joke, it should be clear that you can use the built-in sorting functions (your own implementation will be certainly slower). '''Marjan 12Feb10 18:12'''

What does "do a standard sort" in exercise 2 mean? Shall I implement one on my own, or may I use the Java built-in sorting mechanisms? Also, which sorting algorithm do you prefer for this? '''Eric 12Feb10 18:04'''