Differences between revisions 74 and 281 (spanning 207 versions)

Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.

Here are PDFs of the slides of the lectures so far: Lecture 1, Lecture 2, Lecture 3, Lecture 4.

Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): Lecture 1 Lecture 3 Lecture 4.

Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4.

Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3

Exercise Sheet 3

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. You can download the player for free here.

Here are the rules for the exercises as explained in Lecture 2.

Here you can upload your solutions for Exercise Sheet 4.

Questions or comments below this line, most recent on top please

To Florian + all: A single number doesn't really make sense, does it? For the discussion part of the exercise, think in terms of probability distributions, as we did in the lecture (when discussing which probability distribution a certain encoding is optimal for). For the example, give a sequence of numbers. Hannah 15Nov09 9:41pm

To Johannes + all: Yes, good idea. I will anyway at some point in the next weeks hand out a sheet where you have the opportunity to give feedback on the lectures and the exercises. But yes, why not give me that feedback on the current exercise sheet already now. Let me refine your proposal a bit. It would be useful for me if you would provide two grades: one for the hardness (pick one of: too hard for me, challenging but feasible, not very hard) and one for the amount of work (pick one of: too much for me, a lot but feasible, not more than for other lectures). It would also be helpful if you would not just give a grade but put your opinion into words. It's no problem if you are critical but please stay polite. I will take your comments seriously, don't worry. Hannah 15Nov09 9:33am

In Exercise we should "give an example of data for which k = x is the best choice". What is meant by "an example of data" here? A single number or a set of numbers or anything else? Florian 15Nov09 8:52pm

I'd like to suggest that everyone grades the exercise sheet from 1 (for "way to easy") to 10 ("way to hard"). This might provide the professor with the feedback she asks for in the lecture. How about that idea? Johannes 2009-11-15T20:40L

To Florian + all: yes, sorry, I forgot to mention this in the lecture. Marjan already explained how to clear the disk cache. Let me add to this an explanation what the disk cache actually is. Whenever you read a (part of a) file from disk, the operating system of your computed will use whatever memory is currently unused to store that (part of the) file there. When you read it again and the (part of the) file hasn't changed and the memory used to store it has not been used otherwise in the meantime, than that data is read right from memory, which is much faster than reading it from disk. Usually that effect is desirable, because it speeds up things, but when you do experiments, it is undesirable, because it leads to unrealistically good running times, especially when carrying out an experiment many times in a row. Hannah 15Nov09 8:10pm

To Florian: Indeed, we were running out of time and there was no room for this in the lecture. I can suggest to you few ways how to clear the disk cache: before carrying out your final experiment, read a large amount of data (let's say close to the amount of RAM you have) from disk - this will ensure that your data (the inverted list) is cleared from the disk cache and replaced by something else (thus an actual reading from disk get's timed, and not reading from RAM). Another way is to restart your computer before doing the timing. Marjan 15Nov09 7:27pm

In exercise 4 it says: "Important note: Whenver you measure running times for reading data from disk, you have to clear the disk cache before, as discussed in the lecture". I think that this was not discussed in the lecture? What do we have to do here? Florian 15Nov09 7:15pm

@Bit shifting: The syntax for that is actually the same, irrespectively of whether you use Java, C++, perl, python, or whatever. The >> operator shifts to the right, the << operator shifts to the left, the & operator ands the bits of the two operands and the | operator ors the bits of the two operands. Very simple. You will also find zillions of example programs on the web by typing something like java bit shifting into Google or whatever your favorite search engine is. Hannah 15Nov09 1:16

Hi Marius + all: For Exercise 4, an inverted list of size m with doc ids from the range [1..n] is simply a sorted list of m numbers from the range [1..n]. I leave it to you, whether your lists potentially contain duplicates (as in 3, 5, 5, 8, 12, ...) or whether you generate them in a way that they don't contain duplicates (as in 3, 5, 8, 17, ...). It doesn't really matter for the exercise whether your list has duplicated or not. In any case, consider simple flat lists like in the two examples I gave (and like all the examples I gave in this and past lectures), not lists of lists or anything. Hannah 15Nov09 1:12am

@Mirko: Sure, but an inverted list is a list of words where the Doc-IDs are attached to each words in which the words occur. So for Example: If word no. 5 occurs in Doc1, Doc2 and Doc3 and word no. 2 occurs in Doc5, the list would look like: 5 -> Doc1, Doc2, Doc3; 2 -> Doc5. Or am I mistaken? My question then is, how long should these attached lists be in average case? I mean, one could imagine that we got 1mil. documents over 3 words, so these lists could get very large...

EDIT: Oh ok. Now, I see your point. It's not an index, it's a list. Okay. So, what is an inverted list with Doc-IDs, then?

EDIT EDIT: And to your question, Mirko, take a look at http://snippets.dzone.com/posts/show/93. Especially at Comment no. 2. Maybe this helps... I think, Java supports StreamWriters/Readers that are able to write/read bytes. Marius 11/14/2009 08:46pm

EDIT EDIT EDIT: Sorry, me again. Well, I bothered Wikipedia which redirects from http://en.wikipedia.org/wiki/Inverted_list to Inverted Index. So it seems to me, this is being used as a synonym. Actually, I think I'm confused enough, now. I'll better wait for any responses... Marius 11/14/2009 9:08 pm

@ Marius: i think we are supposed to generate one inverted list of size m, with doc ids from 1..n (therefore n>=m, because no duplicates?).

Now a question from my side: ex.4, programming the compression in java, is there any good tutorial about how to handle the bit-stuff? (otherwise, i think, it would cost me too much time..) Mirko 14Nov09, 19:18

Hi, do you have any suggestions what the best numbers for m and n in exercise 4 should look like? Or are we supposed to mess around a bit with ints and longs? And: How long should the list of documents in the inverted index be? Marius 14Nov09 6:40pm

And just to clarify what a single-cycle permutation is. Here is an example for an array of size 5 with a permutation that is a single cycle: 5 4 1 3 2. Why single cycle? Well, A[1] = 5, A[5] = 2, A[2] = 4, A[4] = 3, A[3] = 1. (My indices in this example are 1,...,5 and not 0,...,4.) Here is an example of a permutation with three cycles: 2 1 4 3 5. The first cycle is A[1] = 2, A[2] =1. The second cycle is A[3] = 4, A[4] = 3. The third cycle is A[5] = 5. Hannah 12Nov09 8:04pm

Hi Daniel + all, I don't quite understand your question and your example (if your array is 1 5 3 4 2, why is A[1] = 3?). In case you refer to the requirement of the exercise that the permutation consists only of a single cycle. That is because your code should go over each element exactly once (it should, of course, stop after n iterations, where n is the size of the array). If your permutation has more than one cycle, it is hard to achieve that. Also note that for both (1) and (2), the sum of the array values should be sum_i=1,...,n i = n * (n+1) / 2. Hannah 12Nov09 7:54pm

Hi, I just looked at the new exercise sheet 4, in exercise 1 we should generate a permutation and sum the resulting array up, am I wrong or doesn't iterating method two iterate throw the whole array in every situation. for ex.: n= 5 permutation: 1 5 3 4 2, then A[1] = 3, A[A[1]]= A[3] = 1, A[1] = 3 ... Daniel 12Nov09 19:44pm

-  ⇤ ← Revision 74 as of 2009-10-26 22:46:25 → 
  Size: 6920
  Editor: dip-120-083
  Comment:
+   ← Revision 281 as of 2009-11-16 13:30:11 → ⇥
  Size: 9898
  Editor: HSI-KBW-078-042-029-035
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-= Exercise Sheet 1 =
[[attachment:SearchEnginesWS0910/ExerciseSheet1/lecture-1.pdf|Here is a PDF of the slides of Lecture 1]].
+Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]].
-Line 6:
+Line 5:
-[[attachment:SearchEnginesWS0910/ExerciseSheet1/exercise-1.pdf|Here is a PDF of Exercise Sheet 1]].
+Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Lecture 1]] [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Lecture 3]] [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Lecture 4]].
-Line 8:
+Line 7:
-[[SearchEnginesWS0910/StudentIntros|Introduce yourself on this page please (Exercise 1)]].
+Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]].
-Line 10:
+Line 9:
-[[SearchEnginesWS0910/ExerciseSheet1|Upload your results to Exercise Sheet 1 on this page please]].
+Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]]

= Exercise Sheet 3 =

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. [[http://www.lecturnity.de/de/download/lecturnity-player|You can download the player for free here]].

[[SearchEnginesWS0910/Rules|Here are the rules for the exercises as explained in Lecture 2]].

[[SearchEnginesWS0910/ExerciseSheet4|Here you can upload your solutions for Exercise Sheet 4]].
-Line 13:
+Line 20:
-I could not able to Create my page like donaldduckExcercise instruction shown in page. getting some error.
ConvertError
ExpatError: mismatched tag: line 376, column 881 (see dump in /usr/local/var/moin/farm/teaching/data/expaterror.log). '''Waleed Butt'''
-Line 17:
+Line 21:
+To Florian + all: A single number doesn't really make sense, does it? For the discussion part of the exercise, think in terms of probability distributions, as we did in the lecture (when discussing which probability distribution a certain encoding is optimal for). For the example, give a sequence of numbers. '''Hannah 15Nov09 9:41pm'''
-Line 18:
+Line 23:
-To Johann + all: I briefly commented on the rules for the exercises in the first lecture and will talk about it again in the next lecture. The exercises are graded (one or two points per exercise), you will get a mark for them in the end, and that mark will be 40% of your final mark for the course (the other 60% will be an exam at the end, we'll see whether that will be oral or written). Groups work is not allowed (the point of the exercises is missed otherwise), and if you copy or otherwise cheat, you are out. You don't have to come to the lectures if you think you don't need it. You don't have to come to the tutorials if you think you don't need it. But you absolutely must do the exercises, and you must do them yourself. '''Hannah 26Oct09 2:02pm'''
+To Johannes + all: Yes, good idea. I will anyway at some point in the next weeks hand out a sheet where you have the opportunity to give feedback on the lectures and the exercises. But yes, why not give me that feedback on the current exercise sheet already now. Let me refine your proposal a bit. It would be useful for me if you would provide ''two'' grades: one for the hardness (pick one of: too hard for me, challenging but feasible, not very hard) and one for the amount of work (pick one of: too much for me, a lot but feasible, not more than for other lectures). It would also be helpful if you would not just give a grade but put your opinion into words. It's no problem if you are critical but please stay polite. I will take your comments seriously, don't worry. '''Hannah 15Nov09 9:33am'''
-Line 20:
+Line 25:
-To Johann + all: I am sorry but solving the exercises is ''mandatory'', however attending the tutorials is not, i.e. if you understand everything there is no need for you to come to the tutorials. '''Marjan 26Oct 1:45pm'''
+In Exercise we should "give an example of data for which k = x is the best choice". What is meant by "an example of data" here? A single number or a set of numbers or anything else? '''Florian 15Nov09 8:52pm'''
-Line 22:
+Line 27:
-'''Notice:''' Please note that I recommend using English when producing your handouts. Please use German if you are absolutely not confident with English. In fact, either language is fine - it's just that there is a chance that I don't understand everything (if it's written in German). Thanks! '''Marjan 26Oct 1:41pm'''
+I'd like to suggest that everyone grades the exercise sheet from 1 (for "way to easy") to 10 ("way to hard"). This might provide the professor with the feedback she asks for in the lecture. How about that idea? '''Johannes 2009-11-15T20:40L'''
-Line 24:
+Line 29:
-Hi there, i don't know if i missed something but i also asked a fellow student who did not know! Are the Exercises mandatory and do we get some kind of points for them (where we need a specific amount to be allowed to participate the exam). I am asking because i don't know if i can finish all of the assigned tasks till tonight. '''Johann 26Oct09 1:39pm'''
+To Florian + all: yes, sorry, I forgot to mention this in the lecture. Marjan already explained how to clear the disk cache. Let me add to this an explanation what the disk cache actually is. Whenever you read a (part of a) file from disk, the operating system of your computed will use whatever memory is currently unused to store that (part of the) file there. When you read it again and the (part of the) file hasn't changed and the memory used to store it has not been used otherwise in the meantime, than that data is read right from memory, which is much faster than reading it from disk. Usually that effect is desirable, because it speeds up things, but when you do experiments, it is undesirable, because it leads to unrealistically good running times, especially when carrying out an experiment many times in a row. '''Hannah 15Nov09 8:10pm'''
-Line 26:
+Line 31:
-Hi Claudius + all: it's up to you how many you compute, the more you can find the better. The obvious algorithm is of course to try out all pairs of words. This will find all pairs with one hit, but it's obviously a quadratic algorithm and so will take a very long time even for a relatively small collection. See if you can find a smarter algorithm. Be assured, there is one. '''Hannah 26Oct09 1:26pm'''
+To Florian: Indeed, we were running out of time and there was no room for this in the lecture. I can suggest to you few ways how to clear the disk cache: before carrying out your final experiment, read a large amount of data (let's say close to the amount of RAM you have) from disk - this will ensure that your data (the inverted list) is cleared from the disk cache and replaced by something else (thus an actual reading from disk get's timed, and not reading from RAM). Another way is to restart your computer before doing the timing. '''Marjan 15Nov09 7:27pm'''
-Line 28:
+Line 33:
-Hi! In Exercise 4: Do we have to compute all possible pairs of query-words with one hit or only one pair of words?  '''Claudius 26Oct09 12:50am'''
+In exercise 4 it says: "Important note: Whenver you measure running times for reading data from disk, you have to clear the disk cache before, as discussed in the lecture". I think that this was not discussed in the lecture? What do we have to do here? '''Florian 15Nov09 7:15pm'''
-Line 30:
+Line 35:
-To Björn: No, it does not. You do not need to include positional information in your inverted index. But you're right: indexes are usually positional and the Zipf's law refers to the size of the inverted lists when positions are included. '''Marjan 25Oct 7:45pm'''
+@Bit shifting: The syntax for that is actually the same, irrespectively of whether you use Java, C++, perl, python, or whatever. The >> operator shifts to the right, the << operator shifts to the left, the & operator ands the bits of the two operands and the | operator ors the bits of the two operands. Very simple. You will also find zillions of example programs on the web by typing something like ''java bit shifting'' into Google or whatever your favorite search engine is. '''Hannah 15Nov09 1:16'''
-Line 32:
+Line 37:
-Thank you for the response. Does this also concern the inverted index? I'm wondering because the slides only contain examples where the list of documents does not contain duplicate entries. Personally, I could imagine a practical use for both variants. That's why i'm asking. Thanks again in advance. '''Björn 25Oct09 7:20pm'''
+Hi Marius + all: For Exercise 4, an inverted list of size m with doc ids from the range [1..n] is simply a sorted list of m numbers from the range [1..n]. I leave it to you, whether your lists potentially contain duplicates (as in  3, 5, 5, 8, 12, ...) or whether you generate them in a way that they don't contain duplicates (as in 3, 5, 8, 17, ...). It doesn't really matter for the exercise whether your list has duplicated or not. In any case, consider simple flat lists like in the two examples I gave (and like all the examples I gave in this and past lectures), not lists of lists or anything. '''Hannah 15Nov09 1:12am'''
-Line 34:
+Line 39:
-Hi Björn + all: occurrence always means individual occurrence, that is, if a collection has two documents and the word x occurs once in the first document and twice in the second document then there are three occurrences of this word overall. Ok? '''Hannah 25Oct09 2:12pm'''
+@Mirko: Sure, but an inverted list is a list of words where the Doc-IDs are attached to each words in which the words occur. So for Example: If word no. 5 occurs in Doc1, Doc2 and Doc3 and word no. 2 occurs in Doc5, the list would look like: 5 -> Doc1, Doc2, Doc3; 2 -> Doc5. Or am I mistaken? My question then is, how long should these attached lists be in average case? I mean, one could imagine that we got 1mil. documents over 3 words, so these lists could get very large...
-Line 36:
+Line 41:
-To Zhongjie + all: you are right, you can't add the #acl line to the document yourself, so I just did it for you. I will change the instructions on the upload page accordingly. Sorry for this initial confusion, but hey, good that we have a Wiki. '''Hannah 25Oct09 2:06pm'''
+EDIT: Oh ok. Now, I see your point. It's not an index, it's a list. Okay. So, what is an inverted list with Doc-IDs, then?
-Line 38:
+Line 43:
-Hey everyone! Whenever the exercise is talking about frequencies and occurences. Does it talk about occurences in different documents or should we considre multiple occurences in the same document. Thanks a lot. '''Björn 25Oct09 12:49pm'''
+EDIT EDIT: And to your question, Mirko, take a look at http://snippets.dzone.com/posts/show/93. Especially at Comment no. 2. Maybe this helps... I think, Java supports StreamWriters/Readers that are able to write/read bytes. '''Marius 11/14/2009 08:46pm'''
-Line 40:
+Line 45:
-To Johannes: How did you solve the problem? I have logged in my account as my name, and my email links only this account. But I still get the error message 'You can't change ACLs on this page since you have no admin rights on it! ' when I try to enter '#acl ZhongjieCai:read,write -All:read ' to the first line of the page... '''Zhongjie 25Oct09 01:15am'''
+EDIT EDIT EDIT: Sorry, me again. Well, I bothered Wikipedia which redirects from http://en.wikipedia.org/wiki/Inverted_list to Inverted Index. So it seems to me, this is being used as a synonym. Actually, I think I'm confused enough, now. I'll better wait for any responses... ;-) '''Marius 11/14/2009 9:08 pm'''
-Line 42:
+Line 47:
-Problem solved. To everybody: don't try to create multiple users with the same e-mail address. '''Johannes 25Oct09 00:21am'''
+@ Marius: i think we are supposed to generate one inverted __list__ of size m, with doc ids from 1..n (therefore n>=m, because no duplicates?).
-Line 44:
+Line 49:
-Oh, I see, well that *must* be your user name. Sorry for not making that clear earlier. Please create an account with that user name and try again. '''Hannah 25Oct09 00:02am'''
+Now a question from my side: ex.4, programming the compression in __java__, is there any __good__ tutorial about how to handle the bit-stuff? (otherwise, i think, it would cost me too much time..) '''Mirko 14Nov09, 19:18'''
-Line 46:
+Line 51:
-No that is not my user name. '''Johannes 24Oct09 11:58pm'''
+Hi, do you have any suggestions what the best numbers for m and n in exercise 4 should look like? Or are we supposed to mess around a bit with ints and longs? And: How long should the list of documents in the inverted index be? '''Marius 14Nov09 6:40pm'''
-Line 48:
+Line 53:
-Hi Johannes, if you are logged in as JohannesStork you should be able to see it, did you try that? '''Hannah 24Oct09 11:59pm'''
+And just to clarify what a single-cycle permutation is. Here is an example for an array of size 5 with a permutation that is a single cycle: 5 4 1 3 2. Why single cycle? Well, A[1] = 5, A[5] = 2, A[2] = 4, A[4] = 3, A[3] = 1. (My indices in this example are 1,...,5 and not 0,...,4.) Here is an example of a permutation with three cycles: 2 1 4 3 5. The first cycle is A[1] = 2, A[2] =1. The second cycle is A[3] = 4, A[4] = 3. The third cycle is A[5] = 5. '''Hannah 12Nov09 8:04pm'''
-Line 50:
+Line 55:
-I don't know if this is the right place to ask, but I can't access my exercise page. It says "Sie dürfen diese Seite nicht ansehen." '''Johannes 24Oct09 11:50pm'''
+Hi Daniel + all, I don't quite understand your question and your example (if your array is 1 5 3 4 2, why is A[1] = 3?). In case you refer to the requirement of the exercise that the permutation consists only of a single cycle. That is because your code should go over each element exactly once (it should, of course, stop after n iterations, where n is the size of the array). If your permutation has more than one cycle, it is hard to achieve that. Also note that for both (1) and (2), the sum of the array values should be sum_i=1,...,n i = n * (n+1) / 2. '''Hannah 12Nov09 7:54pm'''
-Line 52:
+Line 57:
-Good question, Johannes. Please upload the source code separately, either as a .zip or .tgz archive. I have modified the instructions on the upload page accordingly. Sorry if that means additional work for you, we weren't expecting anybody to submit so early ... '''Hannah 24Oct09 11:43pm'''

Shall we put the whole source into the PDF? What about tar.gz? '''Johannes 24Oct09 5:18pm'''

Hi Johannes + all, the slides are now availabe as PDF, see the link above. '''Hannah 23Oct09 17:04'''

'''Note about Exercise 5:''' One can assume that a more general model of the word frequencies is given than that given in the lecture, i.e. eps * N * (1 / i^alpha). Now both parameters (eps and alpha) can be estimated simultaneously. '''Marjan 23Oct09 3:29pm'''

Can you provide the slides as PDF? '''Johannes 23Oct09 10:05am'''

Please note that the deadline for uploading your solutions of the exercises is always Monday, 23:59 (sharp). '''Marjan 22Oct09 6:15pm'''

When you add a question or comment here, please end it with your name and the date and time in bold face, just like I did now. '''Hannah 22Oct09 01:59am'''
+Hi, I just looked at the new exercise sheet 4, in exercise 1 we should generate a permutation and sum the resulting array up, am I wrong or doesn't iterating method two iterate throw the whole array in every situation. for ex.: n= 5 permutation: 1 5 3 4 2, then A[1] = 3, A[A[1]]= A[3] = 1, A[1] = 3 ... '''Daniel 12Nov09 19:44pm'''