Differences between revisions 40 and 327 (spanning 287 versions)

Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.

Here are PDFs of the slides of the lectures so far: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5.

Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): Recording Lecture 1, Recording Lecture 3, Recording Lecture 4, Recording Lecture 5 (no audio).

Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5.

Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4.

Exercise Sheet 5

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. You can download the player for free here.

Here are the rules for the exercises as explained in Lecture 2.

Here you can upload your solutions for Exercise Sheet 5.

Questions or comments below this line, most recent on top please

Hi, just to tie in with the last question: Do you mean by comparisons only the comparisons of list elements (such as A[i] <= B[j]) or are guarding conditional comparisons also to be counted (such as "if (i < list1.size())"? Marius 11/22/2009 10:32pm

Hi Björn + all: good question. One simple way to deal with this would be to use a comma expression like (numComparisons++, A[i] <= B[j]). A list of expressions, separated by commas, gets evaluated from left to right, and the value of the whole expression is simply the value of the last expression. At least that is the case in C++, but most programming languages have the same or a very similar construct. An alternative that essentially does the same thing, would be to do the comparison in a separate function, which, besides doing the comparison, also increase the comparisons counter. If you do that, you should absolutely make that that function is inline though since otherwise you pay the price of a function call for every comparison, which is likely to spoil your performance. Hannah 22Nov09 6:30pm

Hi, is there a good way to count comparisons if locial connectives are used? As far as I know, && will only check until the first comparison is false and || if it until some is true. Unfortunately I cannot think of a way to count those comparisons without rewriting the code and nesting the if statements. I'm not too familiar with compilers but I hope that this changes to the code won't effect the performance.Björn 22Nov09 6:20pm

Hi Claudius + all: you are right, it's not very precise, the intended meaning is something like "compare the tables for the two algorithms" or "for each of the four measurements, compare the tables for the two algorithms". Hannah 22Nov09 3:39pm

I am confused: In Exercise 4, you write: "Compare the two tables...", but when I take both algorithms and all 4 measurements (running time, ratio, ...), I get 8 4x4 tables (for all the combinations of 10³, 10⁴, 10⁵, 10⁶). What I'm doing wrong? Claudius 22Nov09 3:30pm

Hi Björn + all: very good question and thanks for pointing that out. You should indeed always search the elements of the smaller list in the larger list, and the first thing your (advanced) list intersection algorithm should do is figure out which of the two lists is the smaller one. That is, your 4 x 4 table will be symmetric, and actually only contains 10 different values (the 6 below the diagonal, which are the same as the ones above the diagonal, and the 4 on the diagonal). Hannah 22Nov09 2:50pm

For the exp/bin-search intersection algorithm it clearly matters that it searches for the elements of the smaller list in the larger one. A good implementation will certainly take care of that. Should our implementation also do that or ignore it in order to get 16 measurements that are really different? Björn 22Nov09 1:00pm

Ok, no problem, I'm happy when it's clear now. Hannah 22Nov09 0:24am

You're right, I misread your comment, sorry. I was thinking of 10MB per lists processed in 1 second, resulting in 20MB/s and was wondering where the 100MB/s are coming from. Thomas 22Nov09 00:20am

Hi Thomas, I am at a loss of words here. I am saying a car is driving 20 kilometers and it needs 10 minutes for that, so its average speed was 120 km / hours. And you are saying how can the speed of a car be 120 km / hours, when it only drives 20 kilometers. Well, what should I say. Besides, in my example I clearly said that the two lists together occupy 10 MB, not 10 MB per list. Please read again what I wrote. Hannah 22Nov09 0:16am

Why should two lists of 10MB size result in 100MB processed, if each list is only iterated over once to do the intersection (O(m+n) complexity)? The data processed after all is just 20MB, no matter how the algorithm is implemented (even if it iterates a thousand times over every list, it still just processed 20MB of data). Thomas 21Nov09 12:00am

By the way, whenever I talk about "lists" here or on the exercise sheets or in the lecture, I am not referring to a particular data structure (in particular I am NOT talking about a linked list), but "list of elements" is just "series of elements". And well, "inverted list" is just common terminology. To implement a "list of doc ids" or anything like that you should of course always use an array or a vector or a data structure like that. Hannah 21Nov09 8:30pm

Hi Marius + all, let me explain it by an example. Your two input lists occupy a certain amount of memory. Every programming language has built-in functions for this. For example, if your list entries are ints, then for C++ you can use sizeof(int) to get the number of bytes occupied by one entry. Multiply by the number of list elements to get the number of bytes occupied by one list. One Megabyte (MB) is 1024 * 1024 bytes. Now assume your two lists together occupy 10 MB. Assume your code takes 0.1 seconds to intersect these two lists. Then the "MB processed per second" is 100 MB / second. Hannah 21Nov09 8:26pm

Hi, in exercise 3, what do you mean by "MB processed per second"? Is a MB the equivalent to 4096 processed integers? And when is a MB to be considered as processed? When it's written to the intersected list or in the comparisons, already? Marius 21Nov09 7:33pm

The slides + all my hand-writing on it are now online, see the link Recording Lecture 5 (no audio) above. Hannah 20Nov09 3:24am

The recording of todays lecture again did not work. I am very sorry for that (and very angry that there are so many problems with this software). Anyway, the end result of the lecture, that is the slides with all the writing on it are available and I will put them online as soon as possible. Hannah 19Nov09 11:23pm

There is a typo in Exercise 5 of the new sheet. The two occurrences of n should be m. Hannah 19Nov09 11:22pm

-  ⇤ ← Revision 40 as of 2009-10-24 22:51:25 → 
  Size: 2121
  Editor: p549FA6F2
  Comment:
+   ← Revision 327 as of 2009-11-22 22:32:31 → ⇥
  Size: 8515
  Editor: p4FF665C3
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-= Exercise Sheet 1 =
[[attachment:SearchEnginesWS0910/ExerciseSheet1/lecture-1.pdf|Here is a PDF of the slides of Lecture 1]].
+Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]].
-Line 6:
+Line 5:
-[[attachment:SearchEnginesWS0910/ExerciseSheet1/exercise-1.pdf|Here is a PDF of Exercise Sheet 1]].
+Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (no audio)]].
-Line 8:
+Line 7:
-[[SearchEnginesWS0910/StudentIntros|Introduce yourself on this page please (Exercise 1)]].
+Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]].
-Line 10:
+Line 9:
-[[SearchEnginesWS0910/ExerciseSheet1|Upload your results to Exercise Sheet 1 on this page please]].
+Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]].

= Exercise Sheet 5 =

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. [[http://www.lecturnity.de/de/download/lecturnity-player|You can download the player for free here]].

[[SearchEnginesWS0910/Rules|Here are the rules for the exercises as explained in Lecture 2]].

[[SearchEnginesWS0910/ExerciseSheet5|Here you can upload your solutions for Exercise Sheet 5]].
-Line 14:
+Line 21:
-I don't know if this is the right place to ask, but I can't access my exercise page. '''Johannes 24Oct09 11:50pm'''
+Hi, just to tie in with the last question: Do you mean by comparisons only the comparisons of list elements (such as A[i] <= B[j]) or are guarding conditional comparisons also to be counted (such as "if (i < list1.size())"? '''Marius 11/22/2009 10:32pm'''
-Line 16:
+Line 23:
-Good question, Johannes. Please upload the source code separately, either as a .zip or .tgz archive. I have modified the instructions on the upload page accordingly. Sorry if that means additional work for you, we weren't expecting anybody to submit as early as this :-) '''Hannah 24Oct09 11:43pm'''
+Hi Björn + all: good question. One simple way to deal with this would be to use a comma expression like ''(numComparisons++, A[i] <= B[j])''. A list of expressions, separated by commas, gets evaluated from left to right, and the value of the whole expression is simply the value of the last expression. At least that is the case in C++, but most programming languages have the same or a very similar construct. An alternative that essentially does the same thing, would be to do the comparison in a separate function, which, besides doing the comparison, also increase the comparisons counter. If you do that, you should absolutely make that that function is ''inline'' though since otherwise you pay the price of a function call for every comparison, which is likely to spoil your performance. '''Hannah 22Nov09 6:30pm'''
-Line 18:
+Line 25:
-Shall we put the whole source into the PDF? What about tar.gz? '''Johannes 24Oct09 5:18pm'''
+Hi, is there a good way to count comparisons if locial connectives are used? As far as I know, && will only check until the first comparison is false and || if it until some is true. Unfortunately I cannot think of a way to count those comparisons without rewriting the code and nesting the if statements. I'm not too familiar with compilers but I hope that this changes to the code won't effect the performance.'''Björn 22Nov09 6:20pm'''
-Line 20:
+Line 27:
-Hi Johannes + all, the slides are now availabe as PDF, see the link above. '''Hannah 23Oct09 17:04'''
+Hi Claudius + all: you are right, it's not very precise, the intended meaning is something like "compare the tables for the two algorithms" or "for each of the four measurements, compare the tables for the two algorithms". '''Hannah 22Nov09 3:39pm'''
-Line 22:
+Line 29:
-'''Note about Exercise 5:''' One can assume that a more general model of the word frequencies is given than that given in the lecture, i.e. eps * N * (1 / i^alpha). Now both parameters (eps and alpha) can be estimated simultaneously. '''Marjan 23Oct09 3:29pm'''
+I am confused: In Exercise 4, you write: "Compare the two tables...", but when I take both algorithms and all 4 measurements (running time, ratio, ...), I get 8 4x4 tables (for all the combinations of 10^3^, 10^4^, 10^5^, 10^6^). What I'm doing wrong? '''Claudius 22Nov09 3:30pm'''
-Line 24:
+Line 31:
-Can you provide the slides as PDF? '''Johannes 23Oct09 10:05am'''
+Hi Björn + all: very good question and thanks for pointing that out. You should indeed always search the elements of the smaller list in the larger list, and the first thing your (advanced) list intersection algorithm should do is figure out which of the two lists is the smaller one. That is, your 4 x 4 table will be ''symmetric'', and actually only contains 10 different values (the 6 below the diagonal, which are the same as the ones above the diagonal, and the 4 on the diagonal). '''Hannah 22Nov09 2:50pm'''
-Line 26:
+Line 33:
-Please note that the deadline for uploading your solutions of the exercises is always Monday, 23:59 (sharp). '''Marjan 22Oct09 6:15pm'''
+For the exp/bin-search intersection algorithm it clearly matters that it searches for the elements of the smaller list in the larger one. A good implementation will certainly take care of that. Should our implementation also do that or ignore it in order to get 16 measurements that are really different? '''Björn 22Nov09 1:00pm'''
-Line 28:
+Line 35:
-When you add a question or comment here, please end it with your name and the date and time in bold face, just like I did now. '''Hannah 22Oct09 01:59am'''
+Ok, no problem, I'm happy when it's clear now. '''Hannah 22Nov09 0:24am'''

You're right, I misread your comment, sorry. I was thinking of 10MB per lists processed in 1 second, resulting in 20MB/s and was wondering where the 100MB/s are coming from. '''Thomas 22Nov09 00:20am'''

Hi Thomas, I am at a loss of words here. I am saying a car is driving 20 kilometers and it needs 10 minutes for that, so its average speed was 120 km / hours. And you are saying how can the speed of a car be 120 km / hours, when it only drives 20 kilometers. Well, what should I say. Besides, in my example I clearly said that the two lists ''together'' occupy 10 MB, not 10 MB per list. Please read again what I wrote. '''Hannah 22Nov09 0:16am'''

Why should two lists of 10MB size result in 100MB processed, if each list is only iterated over once to do the intersection (O(m+n) complexity)? The data processed after all is just 20MB, no matter how the algorithm is implemented (even if it iterates a thousand times over every list, it still just processed 20MB of data). '''Thomas 21Nov09 12:00am'''

By the way, whenever I talk about "lists" here or on the exercise sheets or in the lecture, I am not referring to a particular data structure (in particular I am NOT talking about a linked list), but "list of elements"  is just "series of elements". And well, "inverted list" is just common terminology. To implement a "list of doc ids" or anything like that you should of course always use an array or a vector or a data structure like that. '''Hannah 21Nov09 8:30pm'''

Hi Marius + all, let me explain it by an example. Your two input lists occupy a certain amount of memory. Every programming language has built-in functions for this. For example, if your list entries are ints, then for C++ you can use sizeof(int) to get the number of bytes occupied by one entry.  Multiply by the number of list elements to get the number of bytes occupied by one list. One Megabyte (MB) is 1024 * 1024 bytes. Now assume your two lists together occupy 10 MB. Assume your code takes 0.1 seconds to intersect these two lists. Then the "MB processed per second" is 100 MB / second. '''Hannah 21Nov09 8:26pm'''

Hi, in exercise 3, what do you mean by "MB processed per second"? Is a MB the equivalent to 4096 processed integers? And when is a MB to be considered as processed? When it's written to the intersected list or in the comparisons, already? '''Marius 21Nov09 7:33pm'''

The slides + all my hand-writing on it are now online, see the link ''Recording Lecture 5 (no audio)'' above. '''Hannah 20Nov09 3:24am'''

The recording of todays lecture again did not work. I am very sorry for that (and very angry that there are so many problems with this software). Anyway, the end result of the lecture, that is the slides with all the writing on it are available and I will put them online as soon as possible. '''Hannah 19Nov09 11:23pm'''

There is a typo in Exercise 5 of the new sheet. The two occurrences of ''n'' should be ''m''. '''Hannah 19Nov09 11:22pm'''