Differences between revisions 237 and 364 (spanning 127 versions)

Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.

Here are PDFs of the slides of the lectures so far: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6.

Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): Recording Lecture 1, Recording Lecture 3, Recording Lecture 4, Recording Lecture 5 (no audio), Recording Lecture 6 (with audio for a change).

Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5, Exercise Sheet 6.

Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4, Solutions and Comments 5.

Exercise Sheet 6

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. You can download the player for free here.

Here are the rules for the exercises as explained in Lecture 2.

Here you can upload your solutions for Exercise Sheet 6.

Questions or comments below this line, most recent on top please

I computed the ratios as it was shown in the example here on the wiki, that was dividing the values in the same order as they are mentioned (first through second). For me that meant values > 1 for exercise 1 and values < 1 for exercise 2, but I wanted to compute them in the same way. Florian 1Dec09 01:22am

But if everybody computes the ratios as they want, the numbers in the table won't make sense! It also does not make sense if somebody computes the ration for Problem 1 in one way and then for Problem 2 to in another (I already don't know how the ratios are computed for the few uploaded solutions!) Marjan 30Nov09 00:25

Hi Björn + all, it doesn't really matter, but I (and probably most humans) find ratios > 1 more intuitive. Just compare 8 and 0.125, which one is easier to grasp? Hannah 30Nov09 11:59pm

Does it matter which way round we express the ratios? Depending on how we build the quotient, we get different values (all smaller or all greater 1). Or is that up to us? Should be possible to compare our results anyway, I assume. Björn 30Nov 23:36

To Björn: You can assume you have gaps of arbitrary size. Marjan 30Nov 14:43

To Claudius: The whole collection with all words. Marjan 30Nov 14:43

Is there a limit on how large gaps may be in exercise 3? I'm not sure for which case the two entropies actually fulfill the equation. Gaps that "make sense" (ther sum is not larger than n-1), gaps that are at most n, or arbitrary gaps? Björn 30Nov09 14:31

In Exercise 2, you ask for the costs of scanning the inv. lists of all words in the "collection". Do you mean the collection of words, matching the prefix or the the whole collection with all words in the inv. index? Claudius 30Nov09 2:16pm

Hi Dragos, just three-letter prefixes are fine. I have no plans yet for future exercises with a "*" in the middle. Hannah 29Nov09 11:10pm

For exercise 1, should we allow the "*" to be in any place ? Or just three letter prefix is sufficient ? I am asking because it would be good to know if we might need on later Exercise Sheets searches that allow multiple "*" in different positions, so that we do it now. Dragos 29 Nov 22:55

Hi Björn, by ratio I simply mean the quotient, that is, how much bigger the one is then the other. For example, if, for a particular prefix, the total size from (1) is one million, and the size from (2) is ten thousand, then, for that prefix, the ratio between the two is one hundred. Hannah 29Nov09 7:48pm

Hello, I wonder what's meant with the ratio demanded in exercise 1. If i have n lists with a maximum length of "a" and a total length of "b". Isn't the ratio something like "a:b"? At least that is what I thought. But adding a colon does not seem to be sufficient for a part of the exercise. Sorry for the meaningless question but I don't want to miss points because I'm not sure how to understand the word ratio. Björn 11-29 19:39

To all: about the selection of the ten prefixes. The idea was that you pick a meaningful variety by hand, that is, such prefixes which one could imagine that one would really type them. The exact selection doesn't really matter, but do avoid extreme cases like a prefix yzq with one completion and an inverted list of three doc ids. Hannah 29Nov09 6:28pm

To Marius + all: yes, I am sorry, "cost" was very imprecise here, I actually simply meant the time your code takes. Hannah 29Nov09 6:19pm

So you say that you mean by "costs" the running time? Or do you understand something else when you say we have to calculate the costs? Marius 11/29/09 4:58pm

Notice about Problem 2: You should use precise timers when measuring the running times. If your collection is very small and you round up your times, it's easy to get 0 ms when merging or scanning the inverted lists. I recommend using microsecond scale. Marjan 29Nov09 16:47

To Florian: Yes, you can do anything you want to find those words (as long as you produce the required outputs). Marjan 29Nov09 16:44

For exercise 1, should we use one of the methods presented in the lecture to find all words in the collection with the prefixes or can we do just anything to get them (though it might not be as efficient)? Florian 29Nov09 03:51pm

When you scan, please make sure that you do something very simple with the elements, like summing up all doc ids, and then outputting that sum. Otherwise a clever compiler might figure out that it can remove the whole loop, because it is not producing a result that is used anywhere. Hannah 28Nov09 11:48pm

To Mirko: Yes, scanning means one pass over the elements. Marjan 28Nov09 19:19

Hi, about exercise2: is by "scanning" meant that one looks at every element exactly once? (=> costs of scanning a list are just the size of the list) Mirko 28Nov, 19:12

-  ⇤ ← Revision 237 as of 2009-11-12 19:44:56 → 
  Size: 9427
  Editor: HSI-KBW-091-089-113-212
  Comment:
+   ← Revision 364 as of 2009-12-01 01:22:46 → ⇥
  Size: 7765
  Editor: dip-001
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]].
+Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]], [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]].
 Line 5:
-Here are the recordings of some of the lectures so far (Lecture 1 still missing, in Lecture 2 the microphone signal did not come through): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture1/Search_Engines,_Lecture_3,_5Nov09_1_05_11_2009_16_16_20.html|Lecture 3]]
+Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (no audio)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-6.lpd|Recording Lecture 6 (with audio for a change)]].
 Line 7:
-Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]].
+Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]].
 Line 9:
-Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Exercise Sheet 1]], [[SearchEnginesWS0910/ExerciseSheet2|Exercise Sheet 2]].
+Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]].

= Exercise Sheet 6 =
The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. [[http://www.lecturnity.de/de/download/lecturnity-player|You can download the player for free here]].
-Line 13:
+Line 16:
-= Exercise Sheet 3 =
Above, you find a link to a published recording of Lecture 3. Please try if that works for you.

[[SearchEnginesWS0910/ExerciseSheet3|Here you can upload your solutions for Exercise Sheet 3]].
+[[SearchEnginesWS0910/ExerciseSheet6|Here you can upload your solutions for Exercise Sheet 6]].
 Line 20:
-Hi, I just looked at the new exercise sheet 4, in exercise 1 we should generate a permutation and sum the resulting array up, am I wrong or doesn't iterating method two iterate throw the whole array in every situation. for ex.: n= 5 permutation: 1 5 3 4 2, then A[1] = 3, A[A[1]]= A[3] = 1, A[1] = 3 ... '''Daniel 12Nov09 19:44pm'''
+I computed the ratios as it was shown in the example here on the wiki, that was dividing the values in the same order as they are mentioned (first through second). For me that meant values > 1 for exercise 1 and values < 1 for exercise 2, but I wanted to compute them in the same way. '''Florian 1Dec09 01:22am'''
 Line 22:
-Hi Paresh + all. If you do and-ish retrieval, all documents which contain at least one of the query words are relevant. If you do and retrieval, only the documents which contain all of the query words are relevant. Independent of which scoring scheme you use, you can do it either way. Just clearly state which way you are doing it, and things will be fine. '''Hannah 10Nov09 2:46am'''
+But if everybody computes the ratios as they want, the numbers in the table won't make sense! It also does not make sense if somebody computes the ration for Problem 1 in one way and then for Problem 2 to in another (I already don't know how the ratios are computed for the few uploaded solutions!) '''Marjan 30Nov09 00:25'''
 Line 24:
-Hi, If i use and-ish retrieval how is the relevance of hits is measured? The hits which have one of the keywords are relevant or only those who have all the keywords are relevant. I am confused!! '''Paresh 10 Nov 1:31AM '''
+Hi Björn + all, it doesn't really matter, but I (and probably most humans) find ratios > 1 more intuitive. Just compare 8 and 0.125, which one is easier to grasp? '''Hannah 30Nov09 11:59pm'''
 Line 26:
-Hi Dragos + all, pick three documents which got a score of zero. For any reasonable query and a not too small collection there should be such documents. '''Hannah 10Nov09 0:24am'''
+Does it matter which way round we express the ratios? Depending on how we build the quotient, we get different values (all smaller or all greater 1). Or is that up to us? Should be possible to compare our results anyway, I assume. '''Björn 30Nov 23:36'''
 Line 28:
-Hello :). For exercise 3, should we find three-relevant documents that were not returned at all, or that were low ranked ? I guess that the goal is to discuss why the low-rank, right ? :) '''Dragos 9Nov 11.56 PM'''
 
Hi Dragos, if you want you can use your implementation from Exercise 1 (extended to deal with scores, of course). If your implementation could only deal with two-word queries, you have to find a suitable two-word query (see what I wrote below), but that should be possible. I don't understand what this has to do with the vector-space model though? The vector space model is just a way to get a formula for ranking documents. Processing the queries is still done via inverted lists. Understand that inverted lists are just an efficient way to store the non-zero entries of a sparse matrix. You would never store the term-document matrix as a two-dimensional array, that would be huge ... and a big waste, because most entries are actually zero (that's what "sparse" means). '''Hannah 9Nov09 2:31am'''
+To Björn: You can assume you have gaps of arbitrary size. '''Marjan 30Nov 14:43'''
-Line 32:
+Line 30:
-For exercises 2 and 3, should the query be multi-word, or the exercises refer to the two-word query implemented in Ex Sheet 1 ? I am asking because it's clearly more simple to use the "easy" method(presented in the lecture) for two-word queries and the "vector space model" for a multi-word query. Thank you in advance ! '''Dragos 9Nov 00.27 '''
+To Claudius: The whole collection with all words. '''Marjan 30Nov 14:43'''
-Line 34:
+Line 32:
-To Björn + all: I said non-trivial so that you don't take a super-specific query, which has only one or two matching documents, which you can easily retrieve with 100% precision and 100% recall with the right query. An example would be an article with a very specific title like "On the influence of Blancmanges from Skyron on Scottish tennis playing skills". Then, if your collection is not super large, the query "blancmanges skyron scottish" will be perfect. Don't pick a query like that. Also do not just formulate a query, but also write down the search request that you had in mind, so that you have a yardstick to determine what is relevant for that query and what is not. [[attachment:trec_queries.txt|Here are some example of queries from TREC, the big IR benchmark conference]]. Each query there has a so-called "title" (what you would typically enter as query words), a "description" (a short description of what the query is about), and a "narrative" (a long description of what the query is about). '''Hannah 08Nov09 3:28pm'''
+Is there a limit on how large gaps may be in exercise 3? I'm not sure for which case the two entropies actually fulfill the equation. Gaps that "make sense" (ther sum is not larger than n-1), gaps that are at most n, or arbitrary gaps? '''Björn 30Nov09 14:31'''
-Line 36:
+Line 34:
-Concerning the exercises: What is a non-trivial query? A query that does contain multiple documents or does it have to consist of multiple words, too? Anything else? '''Björn 8Nov09 14:26'''
+In Exercise 2, you ask for the costs of scanning the inv. lists of all words in the "collection". Do you mean the collection of words, matching the prefix or the the whole collection with all words in the inv. index? '''Claudius 30Nov09 2:16pm'''
-Line 38:
+Line 36:
-The recording works for mac os with flip4mac: http://www.microsoft.com/windows/windowsmedia/player/wmcomponents.mspx  '''Jonas 8Nov09 14:06'''
+Hi Dragos, just three-letter prefixes are fine. I have no plans yet for future exercises with a "*" in the middle. '''Hannah 29Nov09 11:10pm'''
-Line 40:
+Line 38:
-Maybe you could consider putting it on electures - it seems to be the standard place for putting recordings and slides online as far as I know and there are already some solutions for putting lecturnity files online in a platform-independent manner. I don't know if it's practical but it's also nice having all lectures in one place i'd say...  http://electures.informatik.uni-freiburg.de/  '''Alex 7Nov09 18:05'''
+For exercise 1, should we allow the "*" to be in any place ? Or just three letter prefix is sufficient ? I am asking because it would be good to know if we might need on later Exercise Sheets searches that allow multiple "*" in different positions, so that we do it now. '''Dragos 29 Nov 22:55'''
-Line 42:
+Line 40:
-In Linux I see no suitable plugin. I would like to download the .lpd file too. We can test it with our old Lecturnity versions (i have 2.0) and if it doesn't work we can download 4.x at http://www.lecturnity.de/de/download/lecturnity-player/ '''Waldemar 7Nov09 12:49'''
+Hi Björn, by ratio I simply mean the quotient, that is, how much bigger the one is then the other. For example, if, for a particular prefix, the total size from (1) is one million, and the size from (2) is ten thousand, then, for that prefix, the ratio between the two is one hundred. '''Hannah 29Nov09 7:48pm'''
-Line 44:
+Line 42:
-Thanks, Paresh, yes I can do that. So do all students have access to the latest version (should be at least 4.x otherwise it will not work I think) of Lecturnity? '''Hannah 6Nov09 11:35pm'''
+Hello, I wonder what's meant with the ratio demanded in exercise 1. If i have n lists with a maximum length of "a" and a total length of "b". Isn't the ratio something like "a:b"? At least that is what I thought. But adding a colon does not seem to be sufficient for a part of the exercise. Sorry for the meaningless question but I don't want to miss points because I'm not sure how to understand the word ratio. '''Björn 11-29 19:39'''
-Line 46:
+Line 44:
-Hi, yes the recording is working properly after downloading the plug-in. kindly upload the rest files. Also it will be helpful if you could give links to .lpd files since it is easier to download and play them in lecturnity player than browser and one can play them at any time. '''Paresh 6 Nov09 11:25pm'''
+To all: about the selection of the ten prefixes. The idea was that you pick a meaningful variety by hand, that is, such prefixes which one could imagine that one would really type them. The exact selection doesn't really matter, but do avoid extreme cases like a prefix ''yzq'' with one completion and an inverted list of three doc ids. '''Hannah 29Nov09 6:28pm'''
-Line 48:
+Line 46:
-To Mirko + all: whenever we write "prove", we mean a proof in the mathematical sense. For the exercises, the challenge is often two-fold. You first have to turn the statement of the exercise into a formal statement. Then you have to prove that statement. For Exercise 4 you will first have to specify the order in which the inverted lists should be sorted. Then you have to prove that the document with the i-th largest score (formed by max aggregation), where i <= k, is indeed among one of the k first entries wrt to the specified order, in at least one of the inverted lists. '''Hannah 3Nov09 10:29pm'''
+To Marius + all: yes, I am sorry, "cost" was very imprecise here, I actually simply meant the time your code takes. '''Hannah 29Nov09 6:19pm'''
-Line 50:
+Line 48:
-About Exercise4: I actually dont know how to to write down (but i think i know how/why it works) the prove of top-k retrieval with the maximum-score. Is it okay to describe it in words or do we have to formalize it in a certain way?  '''Mirko 5Nov09 22:21pm'''
+So you say that you mean by "costs" the running time? Or do you understand something else when you say we have to calculate the costs? '''Marius 11/29/09 4:58pm'''
-Line 52:
+Line 50:
-Ok, I have played around a bit with lecturnity myself, and published Lecture 3, see the link above. For Marjan it worked, he only needed to install some Windows Media plugin for his Firefox. Please also try, and tell me if there are problems. Also tell me if everything goes fine. (It's enough if one or two people tell me.) If it does I will also publish Lecture 1. Lecture 2, as I said, is lost to the world forever (well, at least the audio), since audio recording did not work that day. '''Hannah 3Nov09 10:06pm'''
+Notice about Problem 2: You should use precise timers when measuring the running times. If your collection is very small and you round up your times, it's easy to get 0 ms when merging or scanning the inverted lists. I recommend using microsecond scale. '''Marjan 29Nov09 16:47'''
-Line 54:
+Line 52:
-Dear Marius + all: Yes, the lectures are recorded, except for Lecture 2, where there were technical problems (no signal from the microphone). I always copy the Lecturnity files to my machine after the lecture, but don't know yet how how to publish them on the web so that they are easily viewable by others. I will meet with our group's technician tomorrow, and ask him about this. Stay tuned! '''Hannah 5Nov09 8:36pm'''
+To Florian: Yes, you can do anything you want to find those words (as long as you produce the required outputs). '''Marjan 29Nov09 16:44'''
-Line 56:
+Line 54:
-Hi, I noticed that you record your lectures. Is it somehow possible to download these recordings or will they be released later? '''Marius Nov5th, 4:54 p.m.'''
+For exercise 1, should we use one of the methods presented in the lecture to find all words in the collection with the prefixes or can we do just anything to get them (though it might not be as efficient)? '''Florian 29Nov09 03:51pm'''
-Line 58:
+Line 56:
-Hi Waleed, when you create a conflict, it's your responsibility to remove it and not leave a mess behind. If the instructions given when the conflict occurs do not suffice, try to find more information on the Wiki help pages. '''Hannah 3Nov09 9:00pm'''
+When you scan, please make sure that you do something very simple with the elements, like summing up all doc ids, and then outputting that sum. Otherwise a clever compiler might figure out that it can remove the whole loop, because it is not producing a result that is used anywhere. '''Hannah 28Nov09 11:48pm'''
-Line 60:
+Line 58:
-I uploaded my Files and put a new row on table in the excercies sheet 2 page but when i pressed save button it shows me conflict. my version and other version of list. how can i remove conflict? does my assignment is submitted properly or not? '''Waleed''' 3Nov09
+To Mirko: Yes, scanning means one pass over the elements. '''Marjan 28Nov09 19:19'''

Hi, about exercise2: is by "scanning" meant that one looks at every element exactly once? (=> costs of scanning a list are just the size of the list) '''Mirko 28Nov, 19:12'''