Differences between revisions 289 and 361 (spanning 72 versions)

Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.

Here are PDFs of the slides of the lectures so far: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6.

Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): Recording Lecture 1, Recording Lecture 3, Recording Lecture 4, Recording Lecture 5 (no audio), Recording Lecture 6 (with audio for a change).

Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5, Exercise Sheet 6.

Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4, Solutions and Comments 5.

Exercise Sheet 6

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. You can download the player for free here.

Here are the rules for the exercises as explained in Lecture 2.

Here you can upload your solutions for Exercise Sheet 6.

Questions or comments below this line, most recent on top please

Hi Björn + all, it doesn't really matter, but I (and probably most humans) find ratios > 1 more intuitive. Just compare 8 and 0.125, which one is more catchy. Hannah 30Nov09 11:59pm

Does it matter which way round we express the ratios? Depending on how we build the quotient, we get different values (all smaller or all greater 1). Or is that up to us? Should be possible to compare our results anyway, I assume. Björn 30Nov 23:36

To Björn: You can assume you have gaps of arbitrary size. Marjan 30Nov 14:43

To Claudius: The whole collection with all words. Marjan 30Nov 14:43

Is there a limit on how large gaps may be in exercise 3? I'm not sure for which case the two entropies actually fulfill the equation. Gaps that "make sense" (ther sum is not larger than n-1), gaps that are at most n, or arbitrary gaps? Björn 30Nov09 14:31

In Exercise 2, you ask for the costs of scanning the inv. lists of all words in the "collection". Do you mean the collection of words, matching the prefix or the the whole collection with all words in the inv. index? Claudius 30Nov09 2:16pm

Hi Dragos, just three-letter prefixes are fine. I have no plans yet for future exercises with a "*" in the middle. Hannah 29Nov09 11:10pm

For exercise 1, should we allow the "*" to be in any place ? Or just three letter prefix is sufficient ? I am asking because it would be good to know if we might need on later Exercise Sheets searches that allow multiple "*" in different positions, so that we do it now. Dragos 29 Nov 22:55

Hi Björn, by ratio I simply mean the quotient, that is, how much bigger the one is then the other. For example, if, for a particular prefix, the total size from (1) is one million, and the size from (2) is ten thousand, then, for that prefix, the ratio between the two is one hundred. Hannah 29Nov09 7:48pm

Hello, I wonder what's meant with the ratio demanded in exercise 1. If i have n lists with a maximum length of "a" and a total length of "b". Isn't the ratio something like "a:b"? At least that is what I thought. But adding a colon does not seem to be sufficient for a part of the exercise. Sorry for the meaningless question but I don't want to miss points because I'm not sure how to understand the word ratio. Björn 11-29 19:39

To all: about the selection of the ten prefixes. The idea was that you pick a meaningful variety by hand, that is, such prefixes which one could imagine that one would really type them. The exact selection doesn't really matter, but do avoid extreme cases like a prefix yzq with one completion and an inverted list of three doc ids. Hannah 29Nov09 6:28pm

To Marius + all: yes, I am sorry, "cost" was very imprecise here, I actually simply meant the time your code takes. Hannah 29Nov09 6:19pm

So you say that you mean by "costs" the running time? Or do you understand something else when you say we have to calculate the costs? Marius 11/29/09 4:58pm

Notice about Problem 2: You should use precise timers when measuring the running times. If your collection is very small and you round up your times, it's easy to get 0 ms when merging or scanning the inverted lists. I recommend using microsecond scale. Marjan 29Nov09 16:47

To Florian: Yes, you can do anything you want to find those words (as long as you produce the required outputs). Marjan 29Nov09 16:44

For exercise 1, should we use one of the methods presented in the lecture to find all words in the collection with the prefixes or can we do just anything to get them (though it might not be as efficient)? Florian 29Nov09 03:51pm

When you scan, please make sure that you do something very simple with the elements, like summing up all doc ids, and then outputting that sum. Otherwise a clever compiler might figure out that it can remove the whole loop, because it is not producing a result that is used anywhere. Hannah 28Nov09 11:48pm

To Mirko: Yes, scanning means one pass over the elements. Marjan 28Nov09 19:19

Hi, about exercise2: is by "scanning" meant that one looks at every element exactly once? (=> costs of scanning a list are just the size of the list) Mirko 28Nov, 19:12

-  ⇤ ← Revision 289 as of 2009-11-16 23:43:08 → 
  Size: 11266
  Editor: dslb-188-098-089-088
  Comment:
+   ← Revision 361 as of 2009-11-30 23:59:57 → ⇥
  Size: 7114
  Editor: Hannah Bast
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]].
+Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]], [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]].
 Line 5:
-Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Lecture 1]] [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Lecture 3]] [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Lecture 4]].
+Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (no audio)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-6.lpd|Recording Lecture 6 (with audio for a change)]].
 Line 7:
-Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]].
+Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]].
 Line 9:
-Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]]
+Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]].
 Line 11:
-= Exercise Sheet 3 =
+= Exercise Sheet 6 =
-Line 17:
+Line 16:
-[[SearchEnginesWS0910/ExerciseSheet4|Here you can upload your solutions for Exercise Sheet 4]].
+[[SearchEnginesWS0910/ExerciseSheet6|Here you can upload your solutions for Exercise Sheet 6]].
-Line 20:
+Line 19:
-I accidentally loaded the ZIP file instead of the PDF(My eyes are heavy...). But i couldn't overwrite the pdf file on the wiki now. So I loaded it in my_name_ex1.pdf. I hope this won't occur any problems. '''Jonas 16Nov09 11.41pm''
-Line 22:
+Line 20:
-To Florian: This was very well explained in the lecture (it is still there on the slides). It means the speed-up achieved by reading and decompressing your compressed list compared to when the list is read in uncompressed format. As your inverted list is randomly generated, you might have different speed-ups for different inverted lists. In order to have any speed-up, of course, your compression scheme should really work and reduce the size of the inverted list + should not be too inefficient. One extreme example for inefficient code would be using strings instead of bit-shifting for your coding. '''Marjan 8:38pm'''
+Hi Björn + all, it doesn't really matter, but I (and probably most humans) find ratios > 1 more intuitive. Just compare 8 and 0.125, which one is more catchy. '''Hannah 30Nov09 11:59pm'''
-Line 24:
+Line 22:
-What is meant with the best speedup for Exercise 4 which we should add on the solution page? The best speedup for just reading the data from disk or the best speedup for reading and decompressing the list? '''Florian 16Nov09 8:12pm'''
+Does it matter which way round we express the ratios? Depending on how we build the quotient, we get different values (all smaller or all greater 1). Or is that up to us? Should be possible to compare our results anyway, I assume. '''Björn 30Nov 23:36'''
-Line 26:
+Line 24:
-To Dragos: Gap encoding + Elias code is not trivial at all and you can use it. Gap encoding + Byte code is also fine. '''Marjan 16Nov09 2:15pm'''
+To Björn: You can assume you have gaps of arbitrary size. '''Marjan 30Nov 14:43'''
-Line 28:
+Line 26:
-Is it ok to use Elias code for the compressing from Exercise 4(2)? Or it's too 'trivial'?:)'''Dragos 16Nov09 14.14'''
+To Claudius: The whole collection with all words. '''Marjan 30Nov 14:43'''
-Line 30:
+Line 28:
-To Florian + all: A single number doesn't really make sense, does it? For the discussion part of the exercise, think in terms of probability distributions, as we did in the lecture (when discussing which probability distribution a certain encoding is optimal for). For the example, give a sequence of numbers. '''Hannah 15Nov09 9:41pm'''
+Is there a limit on how large gaps may be in exercise 3? I'm not sure for which case the two entropies actually fulfill the equation. Gaps that "make sense" (ther sum is not larger than n-1), gaps that are at most n, or arbitrary gaps? '''Björn 30Nov09 14:31'''
-Line 32:
+Line 30:
-To Johannes + all: Yes, good idea. I will anyway at some point in the next weeks hand out a sheet where you have the opportunity to give feedback on the lectures and the exercises. But yes, why not give me that feedback on the current exercise sheet already now. Let me refine your proposal a bit. It would be useful for me if you would provide ''two'' grades: one for the hardness (pick one of: too hard for me, challenging but feasible, not very hard) and one for the amount of work (pick one of: too much for me, a lot but feasible, not more than for other lectures). It would also be helpful if you would not just give a grade but put your opinion into words. It's no problem if you are critical but please stay polite. I will take your comments seriously, don't worry. '''Hannah 15Nov09 9:33am'''
+In Exercise 2, you ask for the costs of scanning the inv. lists of all words in the "collection". Do you mean the collection of words, matching the prefix or the the whole collection with all words in the inv. index? '''Claudius 30Nov09 2:16pm'''
-Line 34:
+Line 32:
-In Exercise we should "give an example of data for which k = x is the best choice". What is meant by "an example of data" here? A single number or a set of numbers or anything else? '''Florian 15Nov09 8:52pm'''
+Hi Dragos, just three-letter prefixes are fine. I have no plans yet for future exercises with a "*" in the middle. '''Hannah 29Nov09 11:10pm'''
-Line 36:
+Line 34:
-I'd like to suggest that everyone grades the exercise sheet from 1 (for "way to easy") to 10 ("way to hard"). This might provide the professor with the feedback she asks for in the lecture. How about that idea? '''Johannes 2009-11-15T20:40L'''
+For exercise 1, should we allow the "*" to be in any place ? Or just three letter prefix is sufficient ? I am asking because it would be good to know if we might need on later Exercise Sheets searches that allow multiple "*" in different positions, so that we do it now. '''Dragos 29 Nov 22:55'''
-Line 38:
+Line 36:
-To Florian + all: yes, sorry, I forgot to mention this in the lecture. Marjan already explained how to clear the disk cache. Let me add to this an explanation what the disk cache actually is. Whenever you read a (part of a) file from disk, the operating system of your computed will use whatever memory is currently unused to store that (part of the) file there. When you read it again and the (part of the) file hasn't changed and the memory used to store it has not been used otherwise in the meantime, than that data is read right from memory, which is much faster than reading it from disk. Usually that effect is desirable, because it speeds up things, but when you do experiments, it is undesirable, because it leads to unrealistically good running times, especially when carrying out an experiment many times in a row. '''Hannah 15Nov09 8:10pm'''
+Hi Björn, by ratio I simply mean the quotient, that is, how much bigger the one is then the other. For example, if, for a particular prefix, the total size from (1) is one million, and the size from (2) is ten thousand, then, for that prefix, the ratio between the two is one hundred. '''Hannah 29Nov09 7:48pm'''
-Line 40:
+Line 38:
-To Florian: Indeed, we were running out of time and there was no room for this in the lecture. I can suggest to you few ways how to clear the disk cache: before carrying out your final experiment, read a large amount of data (let's say close to the amount of RAM you have) from disk - this will ensure that your data (the inverted list) is cleared from the disk cache and replaced by something else (thus an actual reading from disk get's timed, and not reading from RAM). Another way is to restart your computer before doing the timing. '''Marjan 15Nov09 7:27pm'''
+Hello, I wonder what's meant with the ratio demanded in exercise 1. If i have n lists with a maximum length of "a" and a total length of "b". Isn't the ratio something like "a:b"? At least that is what I thought. But adding a colon does not seem to be sufficient for a part of the exercise. Sorry for the meaningless question but I don't want to miss points because I'm not sure how to understand the word ratio. '''Björn 11-29 19:39'''
-Line 42:
+Line 40:
-In exercise 4 it says: "Important note: Whenver you measure running times for reading data from disk, you have to clear the disk cache before, as discussed in the lecture". I think that this was not discussed in the lecture? What do we have to do here? '''Florian 15Nov09 7:15pm'''
+To all: about the selection of the ten prefixes. The idea was that you pick a meaningful variety by hand, that is, such prefixes which one could imagine that one would really type them. The exact selection doesn't really matter, but do avoid extreme cases like a prefix ''yzq'' with one completion and an inverted list of three doc ids. '''Hannah 29Nov09 6:28pm'''
-Line 44:
+Line 42:
-@Bit shifting: The syntax for that is actually the same, irrespectively of whether you use Java, C++, perl, python, or whatever. The >> operator shifts to the right, the << operator shifts to the left, the & operator ands the bits of the two operands and the | operator ors the bits of the two operands. Very simple. You will also find zillions of example programs on the web by typing something like ''java bit shifting'' into Google or whatever your favorite search engine is. '''Hannah 15Nov09 1:16'''
+To Marius + all: yes, I am sorry, "cost" was very imprecise here, I actually simply meant the time your code takes. '''Hannah 29Nov09 6:19pm'''
-Line 46:
+Line 44:
-Hi Marius + all: For Exercise 4, an inverted list of size m with doc ids from the range [1..n] is simply a sorted list of m numbers from the range [1..n]. I leave it to you, whether your lists potentially contain duplicates (as in  3, 5, 5, 8, 12, ...) or whether you generate them in a way that they don't contain duplicates (as in 3, 5, 8, 17, ...). It doesn't really matter for the exercise whether your list has duplicated or not. In any case, consider simple flat lists like in the two examples I gave (and like all the examples I gave in this and past lectures), not lists of lists or anything. '''Hannah 15Nov09 1:12am'''
+So you say that you mean by "costs" the running time? Or do you understand something else when you say we have to calculate the costs? '''Marius 11/29/09 4:58pm'''
-Line 48:
+Line 46:
-@Mirko: Sure, but an inverted list is a list of words where the Doc-IDs are attached to each words in which the words occur. So for Example: If word no. 5 occurs in Doc1, Doc2 and Doc3 and word no. 2 occurs in Doc5, the list would look like: 5 -> Doc1, Doc2, Doc3; 2 -> Doc5. Or am I mistaken? My question then is, how long should these attached lists be in average case? I mean, one could imagine that we got 1mil. documents over 3 words, so these lists could get very large...
+Notice about Problem 2: You should use precise timers when measuring the running times. If your collection is very small and you round up your times, it's easy to get 0 ms when merging or scanning the inverted lists. I recommend using microsecond scale. '''Marjan 29Nov09 16:47'''
-Line 50:
+Line 48:
-EDIT: Oh ok. Now, I see your point. It's not an index, it's a list. Okay. So, what is an inverted list with Doc-IDs, then?
+To Florian: Yes, you can do anything you want to find those words (as long as you produce the required outputs). '''Marjan 29Nov09 16:44'''
-Line 52:
+Line 50:
-EDIT EDIT: And to your question, Mirko, take a look at http://snippets.dzone.com/posts/show/93. Especially at Comment no. 2. Maybe this helps... I think, Java supports StreamWriters/Readers that are able to write/read bytes. '''Marius 11/14/2009 08:46pm'''
+For exercise 1, should we use one of the methods presented in the lecture to find all words in the collection with the prefixes or can we do just anything to get them (though it might not be as efficient)? '''Florian 29Nov09 03:51pm'''
-Line 54:
+Line 52:
-EDIT EDIT EDIT: Sorry, me again. Well, I bothered Wikipedia which redirects from http://en.wikipedia.org/wiki/Inverted_list to Inverted Index. So it seems to me, this is being used as a synonym. Actually, I think I'm confused enough, now. I'll better wait for any responses... ;-) '''Marius 11/14/2009 9:08 pm'''
+When you scan, please make sure that you do something very simple with the elements, like summing up all doc ids, and then outputting that sum. Otherwise a clever compiler might figure out that it can remove the whole loop, because it is not producing a result that is used anywhere. '''Hannah 28Nov09 11:48pm'''
-Line 56:
+Line 54:
-@ Marius: i think we are supposed to generate one inverted __list__ of size m, with doc ids from 1..n (therefore n>=m, because no duplicates?).
+To Mirko: Yes, scanning means one pass over the elements. '''Marjan 28Nov09 19:19'''
-Line 58:
+Line 56:
-Now a question from my side: ex.4, programming the compression in __java__, is there any __good__ tutorial about how to handle the bit-stuff? (otherwise, i think, it would cost me too much time..) '''Mirko 14Nov09, 19:18'''

Hi, do you have any suggestions what the best numbers for m and n in exercise 4 should look like? Or are we supposed to mess around a bit with ints and longs? And: How long should the list of documents in the inverted index be? '''Marius 14Nov09 6:40pm'''

And just to clarify what a single-cycle permutation is. Here is an example for an array of size 5 with a permutation that is a single cycle: 5 4 1 3 2. Why single cycle? Well, A[1] = 5, A[5] = 2, A[2] = 4, A[4] = 3, A[3] = 1. (My indices in this example are 1,...,5 and not 0,...,4.) Here is an example of a permutation with three cycles: 2 1 4 3 5. The first cycle is A[1] = 2, A[2] =1. The second cycle is A[3] = 4, A[4] = 3. The third cycle is A[5] = 5. '''Hannah 12Nov09 8:04pm'''

Hi Daniel + all, I don't quite understand your question and your example (if your array is 1 5 3 4 2, why is A[1] = 3?). In case you refer to the requirement of the exercise that the permutation consists only of a single cycle. That is because your code should go over each element exactly once (it should, of course, stop after n iterations, where n is the size of the array). If your permutation has more than one cycle, it is hard to achieve that. Also note that for both (1) and (2), the sum of the array values should be sum_i=1,...,n i = n * (n+1) / 2. '''Hannah 12Nov09 7:54pm'''

Hi, I just looked at the new exercise sheet 4, in exercise 1 we should generate a permutation and sum the resulting array up, am I wrong or doesn't iterating method two iterate throw the whole array in every situation. for ex.: n= 5 permutation: 1 5 3 4 2, then A[1] = 3, A[A[1]]= A[3] = 1, A[1] = 3 ... '''Daniel 12Nov09 19:44pm'''
+Hi, about exercise2: is by "scanning" meant that one looks at every element exactly once? (=> costs of scanning a list are just the size of the list) '''Mirko 28Nov, 19:12'''