Differences between revisions 273 and 385 (spanning 112 versions)

Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.

Here are PDFs of the slides of the lectures so far: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6, Lecture 7.

Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): Recording Lecture 1, Recording Lecture 3, Recording Lecture 4, Recording Lecture 5 (no audio), Recording Lecture 6 (with audio for a change), Recording Lecture 7 (AVI).

Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5, Exercise Sheet 6, Exercise Sheet 7.

Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4, Solutions and Comments 5, Solutions and Comments 6.

Exercise Sheet 6

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the Lecturnity recordings (.lpd files) you need the Lecturnity Player, which you can download here. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend VLC.

Here are the rules for the exercises as explained in Lecture 2.

Here you can upload your solutions for Exercise Sheet 7.

Questions or comments below this line, most recent on top please

Just let your java server provide the css and html too. Works for me and is easy to do in java. Johannes 2009-12-06T1114L

I didn't know that jQuery's find does not work on Internet Explorer, and I am actually surprised to hear that. It somewhat shatters my previous belief that jQuery just works on any of the major browsers (all of which implement JavaScript a little differently, which makes the use of raw JavaScript so cumbersome). I will try to find(!) out why that is so. Sorry, if you had trouble because of this, but well, that's (web application writing) life. Hannah 6Dec09 0:26

In the lecture, all the files prefix-search.html, prefix-search.js, prefix-search.css, and prefix-search.php were served by an Apache web sever running on one and the same machine stromboli.informatik.uni-freiburg.de. The $.(get) in the prefix-search.js was sending the query to the prefix-search.php. As Björn pointed out, Firefox asks that the html (which is what the user loads by typing the URL or clicking a link, and which in turn loads the js) be served via port 80 by a machine on the same domain as the prefix-search.php. For our machine domain refers to uni-freiburg.de, that is, the php could have been located on any other machine with a URL ending in uni-freiburg.de, too. Otherwise, you get a so-called cross-scripting error. This is *not* part of the JavaScript standard, however, and different browsers implement it differently. This is also what Manuela found. I leave it to you how you get around the cross-scripting problem. The preferred solution is to have all files served by web servers on machine on the same domain, as just explained. If you find other solutions that work, that is also fine, but please explain what led you to this solution, just like Manuela did below. Hannah 06Dec09 0:22

From what a fellow student told me in the lecture (thanks, alex) the problem with GETing the javascript comes from the fact that (for security issues) the HttpXmlRequest is only allowed be send to a ressource on the same domain that you got the HTML from. Firefox turns it into an OPTIONS request. This might also be the reason why it worked in the lecture where the html and the php were both served by the same apache, but does not work if your html is not on the apache, too (Also explains the observations posted below). Personally, I'm planning on letting my webserver provide all, the html, css, js (by letting it return files from a subfolder depending on the path in the GET request) and the xml if the GET request does not start with a prefix for that folder. Otherwise it should work if you do it just as we did in the lecture and have HTML (+ css + js) and PHP in your apache's folder. I haven't started yet but I can let you know if this works for me. Anyway, IF it does, credit goes to Alexander Gutjahr who told about this javascript issue, of course. Björn 05Dec 22:12

I'm a bit confused about the exercise. For exercise 1 I extended the Java webserver from exercise sheet 2 with the prefix search of the last exercise sheet. The webserver returns the results of the prefix search as a XML document. Should I have used an webserver like apache? But I also had some problems with sending the JQuery request to the server. The webserver runs on port 80. I started with Firefox. Firefox sends an OPTIONS request to the server and so the JQuery get-function doesn't work. The same happend as I used Google Chrome. Because the Java webserver can't handle PHP I can't do it like in the lecture. So I tried Internet Explorer and this browser sends a GET request by using the JQuery get-function. I assumed I can follow with the exercise, but though I did it like in the lecture, nothing happend. I used the alert-function to check that I really get the XML document from the server (and I got it). Now I know, that the find-function doesn't work with Internet Explorer. After this I tried Safari. Safari sends a GET-request and also the find-function works. Now I can follow and build the tables like described on the exercise sheet. Is it OK to go on like that? Manuela 05Dec09 19:24

Hi Alex, can you be more specific about what exactly did not work for you and what you had to do to make it work? In particular, what do you mean by "the server directory"? Do you mean apache's document root? Then where have your files been before? In a subdirectory of the root? And what do you mean by a GET request being turned into an OPTIONS request, and how did you arrive at the conclusion that this is what happens? It should not matter if the .php file is in a different directory than the .js file. My feeling is that your problem lies elsewhere, but it's hard to tell from the information you gave so far. Hannah 05Dec09 18:02

@whom it may concern: for me the access-rights stuff did not work exactly as in the lecture - i had to move the whole site (.html .js ...) into the server directory. Maybe it's new to firefox 3.5 but i could not access any file on the server from a .js not being in the server directory - it always turned my GET-Requests into OPTIONS-Requests and nothing was returned - so the php-solution does not seem to work, even if my server was able to execute php. Were we supposed to do it like this anyway or is it completely wrong this way?.. alex 5Dec09 17:56

Ok, the recording of Lecture 7 is now available as AVI. But beware, it's quite big: around 300 MB. Hannah 3Dec09 22:46

To play the .camrec recording you need the full Camtasia Studio (you can download a 30-day test version if you want). I will soon upload an .avi version instead. Hannah 3Dec09 21:56

For your reference and convenience, here is a tar archive of the files which we wrote together in Lecture 7 (prefix-search.html, prefix-search.css, prefix-search.js, prefix-search.php). Hannah 3Dec09 21:35

-  ⇤ ← Revision 273 as of 2009-11-15 20:49:44 → 
  Size: 8761
  Editor: eth0-9
  Comment:
+   ← Revision 385 as of 2009-12-06 11:17:14 → ⇥
  Size: 9233
  Editor: p549FADA7
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]].
+Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]], [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]], [[attachment:SearchEnginesWS0910/lecture-7.pdf|Lecture 7]].
 Line 5:
-Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Lecture 1]] [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Lecture 3]] [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Lecture 4]].
+Here are .lpd files of the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone): [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (no audio)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-6.lpd|Recording Lecture 6 (with audio for a change)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-7.avi|Recording Lecture 7 (AVI)]].
 Line 7:
-Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]].
+Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]], [[attachment:SearchEnginesWS0910/exercise-7.pdf|Exercise Sheet 7]].
 Line 9:
-Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]]
+Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]], [[SearchEnginesWS0910/ExerciseSheet6|Solutions and Comments 6]].
 Line 11:
-= Exercise Sheet 3 =
The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the recordings (it's .lpd files) you need the Lecturnity Player. [[http://www.lecturnity.de/de/download/lecturnity-player|You can download the player for free here]].
+= Exercise Sheet 6 =

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the Lecturnity recordings (.lpd files) you need the [[http://www.lecturnity.de/de/download/lecturnity-player|Lecturnity Player, which you can download here]]. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend [[http://www.videolan.org/vlc|VLC]].
-Line 16:
+Line 17:
-[[SearchEnginesWS0910/ExerciseSheet4|Here you can upload your solutions for Exercise Sheet 4]].
+[[SearchEnginesWS0910/ExerciseSheet7|Here you can upload your solutions for Exercise Sheet 7]].
-Line 19:
+Line 20:
-@"clearing the disc cache" on linux machines: As superuser run
$ sync; sh -c 'echo 3 > /proc/sys/vm/drop_caches'
More information at http://www.linuxinsight.com/proc_sys_vm_drop_caches.html '''Jonas'''''' 15Nov09 8:50pm'''
-Line 23:
+Line 21:
-I'd like to suggest that everyone grades the exercise sheet from 1 (for "way to easy") to 10 ("way to hard"). This might provide the professor with the feedback she asks for in the lecture. How about that idea? '''Johannes 2009-11-15T20:40L'''
+Just let your java server provide the css and html too. Works for me and is easy to do in java. '''Johannes 2009-12-06T1114L'''
-Line 25:
+Line 23:
-To Florian + all: yes, sorry, I forgot to mention this in the lecture. Marjan already explained how to clear the disk cache. Let me add to this an explanation what the disk cache actually is. Whenever you read a (part of a) file from disk, the operating system of your computed will use whatever memory is currently unused to store that (part of the) file there. When you read it again and the (part of the) file hasn't changed and the memory used to store it has not been used otherwise in the meantime, than that data is read right from memory, which is much faster than reading it from disk. Usually that effect is desirable, because it speeds up things, but when you do experiments, it is undesirable, because it leads to unrealistically good running times, especially when carrying out an experiment many times in a row. '''Hannah 15Nov09 8:10pm'''
+I didn't know that jQuery's ''find'' does not work on Internet Explorer, and I am actually surprised to hear that. It somewhat shatters my previous belief that jQuery just works on any of the major browsers (all of which implement JavaScript a little differently, which makes the use of raw JavaScript so cumbersome). I will try to find(!) out why that is so. Sorry, if you had trouble because of this, but well, that's (web application writing) life. '''Hannah 6Dec09 0:26'''
-Line 27:
+Line 25:
-To Florian: Indeed, we were running out of time and there was no room for this in the lecture. I can suggest to you few ways how to clear the disk cache: before carrying out your final experiment, read a large amount of data (let's say close to the amount of RAM you have) from disk - this will ensure that your data (the inverted list) is cleared from the disk cache and replaced by something else (thus an actual reading from disk get's timed, and not reading from RAM). Another way is to restart your computer before doing the timing. '''Marjan 15Nov09 7:27pm'''
+In the lecture, all the files prefix-search.html, prefix-search.js, prefix-search.css, and prefix-search.php were served by an Apache web sever running on one and the same machine ''stromboli.informatik.uni-freiburg.de''. The $.(get) in the prefix-search.js was sending the query to the prefix-search.php. As Björn pointed out, Firefox asks that the html (which is what the user loads by typing the URL or clicking a link, and which in turn loads the js) be served via port 80 by a machine on the same domain as the prefix-search.php. For our machine ''domain'' refers to ''uni-freiburg.de'', that is, the php could have been located on any other machine with a URL ending in ''uni-freiburg.de'', too. Otherwise, you get a so-called ''cross-scripting'' error. This is *not* part of the JavaScript standard, however, and different browsers implement it differently. This is also what Manuela found. I leave it to you how you get around the cross-scripting problem. The preferred solution is to have all files served by web servers on machine on the same domain, as just explained. If you find other solutions that work, that is also fine, but please explain what led you to this solution, just like Manuela did below. '''Hannah 06Dec09 0:22'''
-Line 29:
+Line 27:
-In exercise 4 it says: "Important note: Whenver you measure running times for reading data from disk, you have to clear the disk cache before, as discussed in the lecture". I think that this was not discussed in the lecture? What do we have to do here? '''Florian 15Nov09 7:15pm'''
+From what a fellow student told me in the lecture (thanks, alex) the problem with GETing the javascript comes from the fact that (for security issues) the HttpXmlRequest is only allowed be send to a ressource on the same domain that you got the HTML from. Firefox turns it into an OPTIONS request. This might also be the reason why it worked in the lecture where the html and the php were both served by the same apache, but does not work if your html is not on the apache, too (Also explains the observations posted below). Personally, I'm  planning on letting my webserver provide all, the html, css, js (by letting it return files from a subfolder depending on the path in the GET request) and the xml if the GET request does not start with a prefix for that folder. Otherwise it should work if you do it just as we did in the lecture and have HTML (+ css + js) and PHP in your apache's folder. I haven't started yet but I can let you know if this works for me. Anyway, IF it does, credit goes to Alexander Gutjahr who told about this javascript issue, of course. '''Björn 05Dec 22:12'''
-Line 31:
+Line 29:
-@Bit shifting: The syntax for that is actually the same, irrespectively of whether you use Java, C++, perl, python, or whatever. The >> operator shifts to the right, the << operator shifts to the left, the & operator ands the bits of the two operands and the | operator ors the bits of the two operands. Very simple. You will also find zillions of example programs on the web by typing something like ''java bit shifting'' into Google or whatever your favorite search engine is. '''Hannah 15Nov09 1:16'''
+I'm a bit confused about the exercise. For exercise 1 I extended the Java webserver from exercise sheet 2 with the prefix search of the last exercise sheet. The webserver returns the results of the prefix search as a XML document. Should I have used an webserver like apache? But I also had some problems with sending the JQuery request to the server. The webserver runs on port 80. I started with Firefox. Firefox sends an OPTIONS request to the server and so the JQuery get-function doesn't work. The same happend as I used Google Chrome. Because the Java webserver can't handle PHP I can't do it like in the lecture. So I tried Internet Explorer and this browser sends a GET request by using the JQuery get-function. I assumed I can follow with the exercise, but though I did it like in the lecture, nothing happend. I used the alert-function to check that I really get the XML document from the server (and I got it). Now I know, that the find-function doesn't work with Internet Explorer. After this I tried Safari. Safari sends a GET-request and also the find-function works. Now I can follow and build the tables like described on the exercise sheet. Is it OK to go on like that? '''Manuela 05Dec09 19:24'''
-Line 33:
+Line 31:
-Hi Marius + all: For Exercise 4, an inverted list of size m with doc ids from the range [1..n] is simply a sorted list of m numbers from the range [1..n]. I leave it to you, whether your lists potentially contain duplicates (as in  3, 5, 5, 8, 12, ...) or whether you generate them in a way that they don't contain duplicates (as in 3, 5, 8, 17, ...). It doesn't really matter for the exercise whether your list has duplicated or not. In any case, consider simple flat lists like in the two examples I gave (and like all the examples I gave in this and past lectures), not lists of lists or anything. '''Hannah 15Nov09 1:12am'''
+Hi Alex, can you be more specific about what exactly did not work for you and what you had to do to make it work? In particular, what do you mean by "the server directory"? Do you mean apache's document root? Then where have your files been before? In a subdirectory of the root? And what do you mean by a GET request being turned into an OPTIONS request, and how did you arrive at the conclusion that this is what happens? It should not matter if the .php file is in a different directory than the .js file. My feeling is that your problem lies elsewhere, but it's hard to tell from the information you gave so far. '''Hannah 05Dec09 18:02'''
-Line 35:
+Line 33:
-@Mirko: Sure, but an inverted list is a list of words where the Doc-IDs are attached to each words in which the words occur. So for Example: If word no. 5 occurs in Doc1, Doc2 and Doc3 and word no. 2 occurs in Doc5, the list would look like: 5 -> Doc1, Doc2, Doc3; 2 -> Doc5. Or am I mistaken? My question then is, how long should these attached lists be in average case? I mean, one could imagine that we got 1mil. documents over 3 words, so these lists could get very large...
+@whom it may concern: for me the access-rights stuff did not work exactly as in the lecture - i had to move the whole site (.html .js ...) into the server directory. Maybe it's new to firefox 3.5 but i could not access any file on the server from a .js not being in the server directory - it always turned my GET-Requests into OPTIONS-Requests and nothing was returned - so the php-solution does not seem to work, even if my server was able to execute php. Were we supposed to do it like this anyway or is it completely wrong this way?..  '''alex 5Dec09 17:56'''
-Line 37:
+Line 35:
-EDIT: Oh ok. Now, I see your point. It's not an index, it's a list. Okay. So, what is an inverted list with Doc-IDs, then?
+Ok, the recording of Lecture 7 is now available as AVI. But beware, it's quite big: around 300 MB. '''Hannah 3Dec09 22:46'''
-Line 39:
+Line 37:
-EDIT EDIT: And to your question, Mirko, take a look at http://snippets.dzone.com/posts/show/93. Especially at Comment no. 2. Maybe this helps... I think, Java supports StreamWriters/Readers that are able to write/read bytes. '''Marius 11/14/2009 08:46pm'''
+To play the .camrec recording you need the full Camtasia Studio (you can download a 30-day test version if you want). I will soon upload an .avi version instead. '''Hannah 3Dec09 21:56'''
-Line 41:
+Line 39:
-EDIT EDIT EDIT: Sorry, me again. Well, I bothered Wikipedia which redirects from http://en.wikipedia.org/wiki/Inverted_list to Inverted Index. So it seems to me, this is being used as a synonym. Actually, I think I'm confused enough, now. I'll better wait for any responses... ;-) '''Marius 11/14/2009 9:08 pm'''

@ Marius: i think we are supposed to generate one inverted __list__ of size m, with doc ids from 1..n (therefore n>=m, because no duplicates?).

Now a question from my side: ex.4, programming the compression in __java__, is there any __good__ tutorial about how to handle the bit-stuff? (otherwise, i think, it would cost me too much time..) '''Mirko 14Nov09, 19:18'''

Hi, do you have any suggestions what the best numbers for m and n in exercise 4 should look like? Or are we supposed to mess around a bit with ints and longs? And: How long should the list of documents in the inverted index be? '''Marius 14Nov09 6:40pm'''

And just to clarify what a single-cycle permutation is. Here is an example for an array of size 5 with a permutation that is a single cycle: 5 4 1 3 2. Why single cycle? Well, A[1] = 5, A[5] = 2, A[2] = 4, A[4] = 3, A[3] = 1. (My indices in this example are 1,...,5 and not 0,...,4.) Here is an example of a permutation with three cycles: 2 1 4 3 5. The first cycle is A[1] = 2, A[2] =1. The second cycle is A[3] = 4, A[4] = 3. The third cycle is A[5] = 5. '''Hannah 12Nov09 8:04pm'''

Hi Daniel + all, I don't quite understand your question and your example (if your array is 1 5 3 4 2, why is A[1] = 3?). In case you refer to the requirement of the exercise that the permutation consists only of a single cycle. That is because your code should go over each element exactly once (it should, of course, stop after n iterations, where n is the size of the array). If your permutation has more than one cycle, it is hard to achieve that. Also note that for both (1) and (2), the sum of the array values should be sum_i=1,...,n i = n * (n+1) / 2. '''Hannah 12Nov09 7:54pm'''

Hi, I just looked at the new exercise sheet 4, in exercise 1 we should generate a permutation and sum the resulting array up, am I wrong or doesn't iterating method two iterate throw the whole array in every situation. for ex.: n= 5 permutation: 1 5 3 4 2, then A[1] = 3, A[A[1]]= A[3] = 1, A[1] = 3 ... '''Daniel 12Nov09 19:44pm'''
+For your reference and convenience, here is a [[attachment:prefix-search.tar|tar archive of the files which we wrote together in Lecture 7 (prefix-search.html, prefix-search.css, prefix-search.js, prefix-search.php)]]. '''Hannah 3Dec09 21:35'''