Differences between revisions 128 and 334 (spanning 206 versions)

Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.

Here are PDFs of the slides of the lectures so far: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5.

Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): Recording Lecture 1, Recording Lecture 3, Recording Lecture 4, Recording Lecture 5 (no audio).

Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5.

Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4.

Exercise Sheet 5

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. You can download the player for free here.

Here are the rules for the exercises as explained in Lecture 2.

Here you can upload your solutions for Exercise Sheet 5.

Questions or comments below this line, most recent on top please

Hi Marius, in principle youo are right, but as a rule of thumb, I would say that speed is roughly proportional to the processor speed, so processor speed is a good single number. Hannah 24Nov09 3:06pm

Hi, I would appreciate the Bus-speed of the systems in the list of the exercises, too, because the higher the access to the RAM, the better the values would be. Marius 11/24/2009 10:40am

Hi Dragos, yes, just iterate over the doc ids and pick each with probability m/n. This will give you a sorted list of doc ids of average length m. It does not have to be of size exactly m. Hannah 24Nov09 0:37am

About generating the docID lists: I am guessing I'm not wrong when I say that we don't have to make sure we have exactly m values in each list, we should just generate each element with a m/n probability, right? Thank you ! Dragos 24Nov 12.30 AM

Hi Claudius, yes, the inequality actually holds for "<", too. Hannah 23Nov09 4:05pm

In Exercise 5, could it be, that the inequality holds actually for "<", too? Because k may not be 0 and k <= m. Or am I wrong? Claudius 23Nov09 3:58pm

Hi Marius + all: I mean only "real" comparisons, that is, comparisons of list elements. If you use sentinels, then sentinels count as list elements. Hannah 22Nov 10:38

Hi, just to tie in with the last question: Do you mean by comparisons only the comparisons of list elements (such as A[i] <= B[j]) or are guarding conditional comparisons also to be counted (such as "if (i < list1.size())"? Marius 11/22/2009 10:32pm

Hi Björn + all: good question. One simple way to deal with this would be to use a comma expression like (numComparisons++, A[i] <= B[j]). A list of expressions, separated by commas, gets evaluated from left to right, and the value of the whole expression is simply the value of the last expression. At least that is the case in C++, but most programming languages have the same or a very similar construct. An alternative that essentially does the same thing, would be to do the comparison in a separate function, which, besides doing the comparison, also increase the comparisons counter. If you do that, you should absolutely make that that function is inline though since otherwise you pay the price of a function call for every comparison, which is likely to spoil your performance. Hannah 22Nov09 6:30pm

Hi, is there a good way to count comparisons if locial connectives are used? As far as I know, && will only check until the first comparison is false and || if it until some is true. Unfortunately I cannot think of a way to count those comparisons without rewriting the code and nesting the if statements. I'm not too familiar with compilers but I hope that this changes to the code won't effect the performance.Björn 22Nov09 6:20pm

Hi Claudius + all: you are right, it's not very precise, the intended meaning is something like "compare the tables for the two algorithms" or "for each of the four measurements, compare the tables for the two algorithms". Hannah 22Nov09 3:39pm

I am confused: In Exercise 4, you write: "Compare the two tables...", but when I take both algorithms and all 4 measurements (running time, ratio, ...), I get 8 4x4 tables (for all the combinations of 10³, 10⁴, 10⁵, 10⁶). What I'm doing wrong? Claudius 22Nov09 3:30pm

Hi Björn + all: very good question and thanks for pointing that out. You should indeed always search the elements of the smaller list in the larger list, and the first thing your (advanced) list intersection algorithm should do is figure out which of the two lists is the smaller one. That is, your 4 x 4 table will be symmetric, and actually only contains 10 different values (the 6 below the diagonal, which are the same as the ones above the diagonal, and the 4 on the diagonal). Hannah 22Nov09 2:50pm

For the exp/bin-search intersection algorithm it clearly matters that it searches for the elements of the smaller list in the larger one. A good implementation will certainly take care of that. Should our implementation also do that or ignore it in order to get 16 measurements that are really different? Björn 22Nov09 1:00pm

Ok, no problem, I'm happy when it's clear now. Hannah 22Nov09 0:24am

You're right, I misread your comment, sorry. I was thinking of 10MB per lists processed in 1 second, resulting in 20MB/s and was wondering where the 100MB/s are coming from. Thomas 22Nov09 00:20am

Hi Thomas, I am at a loss of words here. I am saying a car is driving 20 kilometers and it needs 10 minutes for that, so its average speed was 120 km / hours. And you are saying how can the speed of a car be 120 km / hours, when it only drives 20 kilometers. Well, what should I say. Besides, in my example I clearly said that the two lists together occupy 10 MB, not 10 MB per list. Please read again what I wrote. Hannah 22Nov09 0:16am

Why should two lists of 10MB size result in 100MB processed, if each list is only iterated over once to do the intersection (O(m+n) complexity)? The data processed after all is just 20MB, no matter how the algorithm is implemented (even if it iterates a thousand times over every list, it still just processed 20MB of data). Thomas 21Nov09 12:00am

By the way, whenever I talk about "lists" here or on the exercise sheets or in the lecture, I am not referring to a particular data structure (in particular I am NOT talking about a linked list), but "list of elements" is just "series of elements". And well, "inverted list" is just common terminology. To implement a "list of doc ids" or anything like that you should of course always use an array or a vector or a data structure like that. Hannah 21Nov09 8:30pm

Hi Marius + all, let me explain it by an example. Your two input lists occupy a certain amount of memory. Every programming language has built-in functions for this. For example, if your list entries are ints, then for C++ you can use sizeof(int) to get the number of bytes occupied by one entry. Multiply by the number of list elements to get the number of bytes occupied by one list. One Megabyte (MB) is 1024 * 1024 bytes. Now assume your two lists together occupy 10 MB. Assume your code takes 0.1 seconds to intersect these two lists. Then the "MB processed per second" is 100 MB / second. Hannah 21Nov09 8:26pm

Hi, in exercise 3, what do you mean by "MB processed per second"? Is a MB the equivalent to 4096 processed integers? And when is a MB to be considered as processed? When it's written to the intersected list or in the comparisons, already? Marius 21Nov09 7:33pm

The slides + all my hand-writing on it are now online, see the link Recording Lecture 5 (no audio) above. Hannah 20Nov09 3:24am

The recording of todays lecture again did not work. I am very sorry for that (and very angry that there are so many problems with this software). Anyway, the end result of the lecture, that is the slides with all the writing on it are available and I will put them online as soon as possible. Hannah 19Nov09 11:23pm

There is a typo in Exercise 5 of the new sheet. The two occurrences of n should be m. Hannah 19Nov09 11:22pm

-  ⇤ ← Revision 128 as of 2009-11-01 08:44:20 → 
  Size: 7622
  Editor: p4FF646EF
  Comment:
+   ← Revision 334 as of 2009-11-24 15:05:48 → ⇥
  Size: 9827
  Editor: Hannah Bast
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]].
+Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]].
 Line 5:
-Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]].
+Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (no audio)]].
 Line 7:
-Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Exercise Sheet 1]].
+Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]].

Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]].

= Exercise Sheet 5 =

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. [[http://www.lecturnity.de/de/download/lecturnity-player|You can download the player for free here]].
-Line 11:
+Line 17:
-= Exercise Sheet 2 =
Here are the details about the three servers (UDP, TCP, HTTP) for Exercise 4:

All three servers are running on our machine vulcano.informatik.uni-freiburg.de (IP address is 132.230.152.135).

The UDP server is running on port 8888 of that machine. You can send it a number and it will then send you back that number of bytes, in packets of 1000 bytes each. (That means you also have to read packets of 1000 bytes each.) The first ten bytes of each packet contain the packet id. That is interesting for checking which packets get lost and in which order packets arrive.

The TCP server is running on port 9999 of that machine. You can send it a request of the form GET /<number of bytes> HTTP/1.1, or you can just use a downloading program like wget or curl and time it.

The HTTP server is running on port 80, as web servers normally do. Just download the file http://vulcano.informatik.uni-freiburg.de/file_100M and measure the time.

For measuring your transfer and error rates, as requested by the exercise, repeat your experiments several times and also at different times, and form the average of these measurements (or report several numbers if you get very different results). You should ask for large amounts of data, like 10 MB or 100 MB.

[[SearchEnginesWS0910/ExerciseSheet2|Here you can upload your solutions for Exercise Sheet 2]].
+[[SearchEnginesWS0910/ExerciseSheet5|Here you can upload your solutions for Exercise Sheet 5]].
-Line 28:
+Line 21:
-Hello! I still could not get any response from both the UDP and the TCP server port by now, but HTTP server works fine. If anyone could get some result, please tell me that you can get response from servers, so that I will know it's my own problem... Thank you! '''Zhongjie 1Nov09 08:43'''
+Hi Marius, in principle youo are right, but as a rule of thumb, I would say that speed is roughly proportional to the processor speed, so processor speed is a good single number. '''Hannah 24Nov09 3:06pm'''
-Line 30:
+Line 23:
-Hi, I need some clarification on this term "HTTP result header" in Excercise 2 in sheet 2. Will HTTP header contains generic http information or something related to Results? Offcourse our Result will in HTML form. '''Waleed 1Nov09 5:52AM'''
+Hi, I would appreciate the Bus-speed of the systems in the list of the exercises, too, because the higher the access to the RAM, the better the values would be. '''Marius 11/24/2009 10:40am'''
-Line 32:
+Line 25:
-I am sorry for the downtime, these maintenance works were announced already several weeks ago, but then I forgot about them because they were scheduled on a Saturday which I thought would not affect me. The downtime also killed our servers, but now they are running again. About the corrections: of course you should get comments on what you did wrong and why you got less points for what. Sorry, if that didn't happen for the first exercise sheet. I will talk with Marjan. '''Hannah 1Nov09 00:38am'''
+Hi Dragos, yes, just iterate over the doc ids and pick each with probability m/n. This will give you a sorted list of doc ids of average length m. It does not have to be of size exactly m. '''Hannah 24Nov09 0:37am'''
-Line 34:
+Line 27:
-Yes, Eric. Today afternoon and evening was nearly the whole Uni-Net Offline because of intended maintenance. I have a question: Will there be any correction or comments or a sample solution or something like that for every exercise because I now only know how many points I have for every Exercise in Exercise Sheet 1 but I don't know why I have a lack of 1/2 point in one exercise and in another. I think it would be good if anybody will know what he has done wrong or/and what he could have done better. '''Waldemar 31Oct09 21:08'''
+About generating the docID lists: I am guessing I'm not wrong when I say that we  don't have to make sure we have exactly m values in each list, we should just generate each element with a m/n probability, right? Thank you ! '''Dragos 24Nov 12.30 AM'''
-Line 36:
+Line 29:
-I cannot connect to vulcano.informatik.uni-freiburg.de:9999 and :8888, neither from outside, nor from inside (logged into pool account). Also the whole informatik.uni... was down a few minutes ago. It's hard to solve exercise 4 without the servers running. '''Eric 31Oct09 19:00'''
+Hi Claudius, yes, the inequality actually holds for "<", too. '''Hannah 23Nov09 4:05pm'''
-Line 38:
+Line 31:
-Hi, I wonder whether those three servers for exercise 4 of sheet 2 online or not? '''Zhongjie 31Oct09 11:49'''
+In Exercise 5, could it be, that the inequality holds actually for "<", too? Because k may not be 0 and k <= m. Or am I wrong? '''Claudius 23Nov09 3:58pm'''
-Line 40:
+Line 33:
-To all: Please note that (almost) everybody will get +2 points for Exercise Sheet 1 (previously I did not assign any points to the first and the last problem). '''Marjan 31Oct09 10:03'''
+Hi Marius + all: I mean only "real" comparisons, that is, comparisons of list elements. If you use sentinels, then sentinels count as list elements. '''Hannah 22Nov 10:38'''
-Line 42:
+Line 35:
-To Triatmoko and Ahmed + all: You obviously haven't created your wiki page. Please go to your link and click "Create a new page" and then "Save changes". You should then upload your solutions there and put the link on the wiki like everybody else did. If it is still not clear please ask some of your fellow students. I don't remember if it is mentioned, but you can skip one Exercise Sheet without losing any points. '''Marjan 31Oct09 09:38'''
+Hi, just to tie in with the last question: Do you mean by comparisons only the comparisons of list elements (such as A[i] <= B[j]) or are guarding conditional comparisons also to be counted (such as "if (i < list1.size())"? '''Marius 11/22/2009 10:32pm'''
-Line 44:
+Line 37:
-Hi, I have some question about my exercise Sheet 1, I saw in my exercise page that my name, my upload solution and code in gray color and other persons in blue color. I try to click my attachment file in my exercise page, and I have some message that there are no attachment. any body know about this issue?because my exercise uploaded on Monday 26oct '''Triatmoko 30Oct09 22:18'''
+Hi Björn + all: good question. One simple way to deal with this would be to use a comma expression like ''(numComparisons++, A[i] <= B[j])''. A list of expressions, separated by commas, gets evaluated from left to right, and the value of the whole expression is simply the value of the last expression. At least that is the case in C++, but most programming languages have the same or a very similar construct. An alternative that essentially does the same thing, would be to do the comparison in a separate function, which, besides doing the comparison, also increase the comparisons counter. If you do that, you should absolutely make that that function is ''inline'' though since otherwise you pay the price of a function call for every comparison, which is likely to spoil your performance. '''Hannah 22Nov09 6:30pm'''
-Line 46:
+Line 39:
-Hi Björn + all: For Exercise 4 from Exercise Sheet 1 you had to write code that is at least able to process 2-word queries. If your code can indeed only handle 2-word queries and not an arbitrary number of query words, that is also fine for this exercise, you won't get less points because of that. Your second question is also very valid. You should put the various functionalities into modules / classes of their own, so that you can easily combine them for the three different binaries required for Exercises 1 - 3. Each of your three programs will then be quite short, just putting together the right things. I should have added it to my list of evil coding NoNos: never ever duplicate code, but instead put it in a class / module of its own. I hope this answers your questions, if not please ask again. Sorry for the late answer, but I was super busy until now, hardly had time to breathe. '''Hannah 30Oct09 19:08'''
+Hi, is there a good way to count comparisons if locial connectives are used? As far as I know, && will only check until the first comparison is false and || if it until some is true. Unfortunately I cannot think of a way to count those comparisons without rewriting the code and nesting the if statements. I'm not too familiar with compilers but I hope that this changes to the code won't effect the performance.'''Björn 22Nov09 6:20pm'''
-Line 48:
+Line 41:
-I have a question concerning exercise 2. There was no concrete task to produce "query processing code" on ex sheet 1. Are there any requirements that have to be fulfilled? Should it be able to handle two word queries? n-word queries? Additionally there is something else I want to ask: I think it surely isn't bad practice to write a more generic webserver and use it for exercises 1-3. Apart from that it says "change your code" some times in the exercises. How should your submission behave w.r.t. the exercises? Different src files / executables for each exercise? One program that solve each exercises depending on startup parameters? Anything else? '''Björn 30ct09 2:25pm'''
+Hi Claudius + all: you are right, it's not very precise, the intended meaning is something like "compare the tables for the two algorithms" or "for each of the four measurements, compare the tables for the two algorithms". '''Hannah 22Nov09 3:39pm'''
-Line 50:
+Line 43:
-I now reorganized the page. Old stuff went to separate pages (links above). The idea is that the front page is always for the current lecture / exercises. The problem with your exercise page should be solved now, Ivo. '''Hannah 30ct09 00:05am'''
+I am confused: In Exercise 4, you write: "Compare the two tables...", but when I take both algorithms and all 4 measurements (running time, ratio, ...), I get 8 4x4 tables (for all the combinations of 10^3^, 10^4^, 10^5^, 10^6^). What I'm doing wrong? '''Claudius 22Nov09 3:30pm'''
-Line 52:
+Line 45:
-Having problems to access my exercise page after loging in - IvoChichkovExercises, '''Ivo 29Oct 22:56pm'''
+Hi Björn + all: very good question and thanks for pointing that out. You should indeed always search the elements of the smaller list in the larger list, and the first thing your (advanced) list intersection algorithm should do is figure out which of the two lists is the smaller one. That is, your 4 x 4 table will be ''symmetric'', and actually only contains 10 different values (the 6 below the diagonal, which are the same as the ones above the diagonal, and the 4 on the diagonal). '''Hannah 22Nov09 2:50pm'''
-Line 54:
+Line 47:
-Sorry to bother you. I added the Link to exercise sheet 2 with the linked pdf. Needed this to find the sheet as fast as possible. '''Marius 29Oct 10:04 p.m.'''
+For the exp/bin-search intersection algorithm it clearly matters that it searches for the elements of the smaller list in the larger one. A good implementation will certainly take care of that. Should our implementation also do that or ignore it in order to get 16 measurements that are really different? '''Björn 22Nov09 1:00pm'''
-Line 56:
+Line 49:
-Is there a webpage for exercise sheet 2 somewhere? '''Johannes 29Oct 07:45 pm'''
+Ok, no problem, I'm happy when it's clear now. '''Hannah 22Nov09 0:24am'''

You're right, I misread your comment, sorry. I was thinking of 10MB per lists processed in 1 second, resulting in 20MB/s and was wondering where the 100MB/s are coming from. '''Thomas 22Nov09 00:20am'''

Hi Thomas, I am at a loss of words here. I am saying a car is driving 20 kilometers and it needs 10 minutes for that, so its average speed was 120 km / hours. And you are saying how can the speed of a car be 120 km / hours, when it only drives 20 kilometers. Well, what should I say. Besides, in my example I clearly said that the two lists ''together'' occupy 10 MB, not 10 MB per list. Please read again what I wrote. '''Hannah 22Nov09 0:16am'''

Why should two lists of 10MB size result in 100MB processed, if each list is only iterated over once to do the intersection (O(m+n) complexity)? The data processed after all is just 20MB, no matter how the algorithm is implemented (even if it iterates a thousand times over every list, it still just processed 20MB of data). '''Thomas 21Nov09 12:00am'''

By the way, whenever I talk about "lists" here or on the exercise sheets or in the lecture, I am not referring to a particular data structure (in particular I am NOT talking about a linked list), but "list of elements"  is just "series of elements". And well, "inverted list" is just common terminology. To implement a "list of doc ids" or anything like that you should of course always use an array or a vector or a data structure like that. '''Hannah 21Nov09 8:30pm'''

Hi Marius + all, let me explain it by an example. Your two input lists occupy a certain amount of memory. Every programming language has built-in functions for this. For example, if your list entries are ints, then for C++ you can use sizeof(int) to get the number of bytes occupied by one entry.  Multiply by the number of list elements to get the number of bytes occupied by one list. One Megabyte (MB) is 1024 * 1024 bytes. Now assume your two lists together occupy 10 MB. Assume your code takes 0.1 seconds to intersect these two lists. Then the "MB processed per second" is 100 MB / second. '''Hannah 21Nov09 8:26pm'''

Hi, in exercise 3, what do you mean by "MB processed per second"? Is a MB the equivalent to 4096 processed integers? And when is a MB to be considered as processed? When it's written to the intersected list or in the comparisons, already? '''Marius 21Nov09 7:33pm'''

The slides + all my hand-writing on it are now online, see the link ''Recording Lecture 5 (no audio)'' above. '''Hannah 20Nov09 3:24am'''

The recording of todays lecture again did not work. I am very sorry for that (and very angry that there are so many problems with this software). Anyway, the end result of the lecture, that is the slides with all the writing on it are available and I will put them online as soon as possible. '''Hannah 19Nov09 11:23pm'''

There is a typo in Exercise 5 of the new sheet. The two occurrences of ''n'' should be ''m''. '''Hannah 19Nov09 11:22pm'''