Welcome to the Wiki page of the course '''Search Engines, WS 2009 / 2010'''. Lecturer: [[http://ad.informatik.uni-freiburg.de/staff/bast|Hannah Bast]]. Tutorials: [[http://ad.informatik.uni-freiburg.de/staff/celikik|Marjan Celikik]]. Course web page: [[http://ad.informatik.uni-freiburg.de/teaching/winter-term-2009-2010/suchmaschinen-vorlesung|click here]].

Here are PDFs of the slides of the lectures so far: [[attachment:SearchEnginesWS0910/lecture-1.pdf|Lecture 1]], [[attachment:SearchEnginesWS0910/lecture-2.pdf|Lecture 2]], [[attachment:SearchEnginesWS0910/lecture-3.pdf|Lecture 3]], [[attachment:SearchEnginesWS0910/lecture-4.pdf|Lecture 4]], [[attachment:SearchEnginesWS0910/lecture-5.pdf|Lecture 5]], [[attachment:SearchEnginesWS0910/lecture-6.pdf|Lecture 6]], [[attachment:SearchEnginesWS0910/lecture-7.pdf|Lecture 7]], [[attachment:SearchEnginesWS0910/lecture-8.pdf|Lecture 8]], [[attachment:SearchEnginesWS0910/lecture-9.pdf|Lecture 9]].

Here are the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-1.lpd|Recording Lecture 1 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-3.lpd|Recording Lecture 3 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-4.lpd|Recording Lecture 4 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-5.lpd|Recording Lecture 5 (LPD without audio)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-6.lpd|Recording Lecture 6 (LPD)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-7.avi|Recording Lecture 7 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-8.avi|Recording Lecture 8 (AVI)]], [[http://vulcano.informatik.uni-freiburg.de/lecturnity/lecture-9.avi|Recording Lecture 9 (AVI)]].

Here are PDFs of the exercise sheets so far: [[attachment:SearchEnginesWS0910/exercise-1.pdf|Exercise Sheet 1]], [[attachment:SearchEnginesWS0910/exercise-2.pdf|Exercise Sheet 2]], [[attachment:SearchEnginesWS0910/exercise-3.pdf|Exercise Sheet 3]], [[attachment:SearchEnginesWS0910/exercise-4.pdf|Exercise Sheet 4]], [[attachment:SearchEnginesWS0910/exercise-5.pdf|Exercise Sheet 5]], [[attachment:SearchEnginesWS0910/exercise-6.pdf|Exercise Sheet 6]], [[attachment:SearchEnginesWS0910/exercise-7.pdf|Exercise Sheet 7]], [[attachment:SearchEnginesWS0910/exercise-8.pdf|Exercise Sheet 8]], [[attachment:SearchEnginesWS0910/exercise-9.pdf|Exercise Sheet 9]].

Here are your solutions and comments on the previous exercise sheets: [[SearchEnginesWS0910/ExerciseSheet1|Solutions and Comments 1]], [[SearchEnginesWS0910/ExerciseSheet2|Solutions and Comments 2]], [[SearchEnginesWS0910/ExerciseSheet3|Solutions and Comments 3]], [[SearchEnginesWS0910/ExerciseSheet4|Solutions and Comments 4]], [[SearchEnginesWS0910/ExerciseSheet5|Solutions and Comments 5]], [[SearchEnginesWS0910/ExerciseSheet6|Solutions and Comments 6]], [[SearchEnginesWS0910/ExerciseSheet7|Solutions and Comments 7]], [[SearchEnginesWS0910/ExerciseSheet8|Solutions and Comments 8]].

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the Lecturnity recordings (.lpd files) you need the [[http://www.lecturnity.de/de/download/lecturnity-player|Lecturnity Player, which you can download here]]. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend [[http://www.videolan.org/vlc|VLC]].

[[SearchEnginesWS0910/Rules|Here are the rules for the exercises as explained in Lecture 2]].

[[SearchEnginesWS0910/MidTermExam|Here is everything about the mid-term exam]].

[[SearchEnginesWS0910/ExerciseSheet9|Here you can upload your solutions for Exercise Sheet 9. Note that the deadline is Thursday 14Jan10 at 4 pm that is just before the lecture (where you will get the next exercise sheet, so the only effect of giving you more time would be procrastination).]]

== Questions and comments about Exercise Sheet 9 below this line (most recent on top) ==

To Marjan: No, I am using Windows and Microsoft Visual Studio 2008. Unfortunately I haven't done much with C++ for quite a long time now, so I am not that familiar with it anymore. '''Florian 13Jan10 20:27'''

Did anyone use PHP? I'm using a 4 GB RAM machine and I changed the maximum memory value in the php.ini file to "-1". But even so I can't use n = 10^7 and more, because I get an out of memory failure when the script allocates more than 1,8 GB. The same code works in Java with less memory usage. '''Manuela 13 Jan 10 20:13'''

To Björn: You should use {{{cout << smth. << endl << flush;}}} so that the buffer gets flushed to the console before executing the problematic line/segment of code. '''Marjan'''

Tp Florian: We are using the Timer.h all the time and it works well. Are you using linux and gcc? '''Marjan'''

The c++ timer class from Marjan does not work for me, after including it I get 10 errors. I tried to fix them but didn't manage to. Any other proposals for an easy time measuring in c++? '''Florian 13 Jan 10 20:03'''

Hi. I tried adding a printf stmt at the first line of my main method. The output in the console only appears if I outcomment the method call for fixing the utf8 string (for the 10 to the 9 case) which is far later in the main method. It also works if I lower the numbers to 10 to the 6 in that example (which leaves me with two tests using 10 to the 6). EDIT: just saw the other hint for gdb, trying this out now.'''Björn'''

Hi Björn, it would be helpful if you would tell us which line of the source code causes the segmentation fault. Since the code we are talking about here is quite small, the easiest way to do this is just place some cout / printf statements in the code, and then see which one is the last one that prints. A more sophisticated method is to run gdb <binary>, where binary has been compiled with the options -O 0 -g, that will also tell you at which line the code crashes. '''Hannah 13Jan10 19:27'''

I seem to get Segmentation Faults in C++ if I include the 10^9 case. I have no idea why, especially since normally I'm all into java and really don't know anything about C++ ... PS: my solution in Java semms to work on the same machine '''Björn 13.1. 19:14'''

Normally I use VLC too but unfortunately my graphics driver crashed when I tried to watch one of the lecture recordings with VLC, so I needed the codec for another player. '''Florian 13Jan10 18:45'''

Well, I would use what I use to watch all my other movies, too, and that is the [[http://www.videolan.org/vlc|VLC video player]]. It's also written above, in the paragraph starting ''"The recordings of all lectures ..."'' '''Hannah 13Jan10 18:32'''

What video codec do I need to watch the avi lecture recordings? I have only the sound here on my temporary notebook which I must use while my regular notebook is on repair. '''Florian 13 Jan 10 17:55'''
Ok, I found it myself, it is the "!TechSmith Screen-Capture Codec" -> http://www.techsmith.de/download/codecs.asp '''Florian 13 Jan 10 18:16'''

Thanks, Johannes. And does anyone know where the upload page for exercise 9 is? '''Zhongjie 13 Jan 10 17:21'''

Instead of just casting you should do: intValue = 0xFF & (int) byteValue; in Java. '''Johannes 2010-01-13T1509'''

Regarding "Java OutOfMemory Exception" : I'm not so familiar with Java, but I have found a possible way to set the JVM runtime heap size by command line arguments: java -Xms512m -Xmx1536m classname. This way you set your JVM to initially use 512m memory and to use maximum of 1536m memory. I think most of our pool's computers have more than that physical memory. And Now you should use byte array to store your random sequence and cast it to integer when doing comparisons. I hope this could do some help. '''Zhongjie 13Jan10 10:00'''

Oh, and BTW2, it doesn't matter which random generator you use, anything half-way random is fine. '''Hannah 13Jan10 00:18'''

BTW, it doesn't matter whether you measure elapsed time (actual real time used) or CPU time (only the CPU time used), both are fine. In case you measure elapsed time, make sure though that the machine is not under heavy load by some other job. '''Hannah 13Jan10 00:17'''

Looks like only PHP is missing. Here is how you get a random number in PHP: http://php.net/manual/en/function.rand.php, and here is how you measure the elapsed time in microseconds: http://de2.php.net/manual/en/function.microtime.php. '''Hannah 13Jan10 0:15'''

Yes, exactly, you shouldn't use ''time'' because then you time the whole program, and we don't want to time the random number generation, because that is probably slower than the rest and we don't want to compare random number generation times in different programming languages. As for the ''out of memory'' error. Please try it on a machine with >= 4 GB. If you also get out of memory errors there, or absolutely can't get access to such a machine, you can do as you suggested and reduce the input size to 10^7 or even less if necessary. '''Hannah 12Jan10 22:42'''

Using the time program results in counting the time of the random number generator too. The random number generator are to different so that this may be a bad idea.  '''Johannes 2010-01-12T2220'''

I have two questions: 1. Can I stop the time by using the time function at the bash shell, like presented at the lecture? 2. I've wrote the three programs but at n = 10^9^ all programs go out of memory. I use arrays to generate the byte sequences. Is it ok to reduce n to 10^8^ or 10^7^ for the time tests? '''Alex 12.Jan 21:29'''

Python:

(Command line arguments) import sys

length = int(sys.argv[1])

(Random numbers) import random

def getRandomByteString(string, length):

 . for i in range(length):
  . string.append(random.randint(0, 255))

'''Johannes 12Jan10 20:55'''

Here's some code for C++ and JAVA and PERL (most of you probably know it):

A precise [[attachment:Timer.h|Timer class in C++]] (use timer.start() to start the timer, timer.stop() to stop it and timer.usecs() for microseconds. For more documentation see the source)

Some standard C++ code for generating random bytes:

{{{srand(time(NULL));}}}  // you call it only once

{{{unsigned char x = rand() % 256;}}}  // generates a random number in 0...255, not perfectly uniformly at random though

{{{System.nanoTime();}}}  // time in nanoseconds

{{{Random randomGenerator = new Random();}}}

{{{char x = randomGenerator.nextInt(256);}}}  // generates a random number in 0...255

{{{$x = int(rand(256));}}}  // returns a random number in 0..255 in perl

{{{$start = [Time::HiRes::gettimeofday()];}}}

{{{$finish = Time::HiRes::tv_interval($start);}}}  // elapsed time in perl

'''Marjan 12Jan10 20:17'''

Yes, sorry I forgot, I was travelling over the last days and today just returned from a rather butchery removal of one of my wisdom teeth and I am completely groggy now. I'll try to add the code later in the evening when I have recovered a bit, or maybe Marjan can do it in the meantime. If anyone of you knows the code for one of C++, JAVA, Perl, Python, PHP, please feel free to post it, too. Remember, the idea of a Wiki is that everybody can contribute. '''Hannah 12Jan10 19:43'''

Do we get code for timing and random numbers? '''Johannes 12Jan10 18:46'''

Hi Marius, the exercise is simpler than that. Leave every valid UTF-8 sequence intact, and for every invalid UTF-8 sequence replace it by what you want, e.g. zeros. You don't have to guess what an invalid sequence means, or convert from one format to another, or anything like that. '''Hannah 12Jan10 18:42'''

The solutions of the mid-term are posted. '''Marjan 15:38 12.01.2010'''

So do I understand it correclty that by saying "repairing" a string you intend to say that only the encoding in the end has to be valid? Because by string repair I understand that if the encoded letter is not UTF-8 we have to reencode it into UTF-8. So for example, if you get a UTF-32 char ݮ it would have to be encoded into UTF-8. What I want to say is that you could possibly "damage" the character, when you just repair the bits of the encoding what would cause in changing the semantics of the whole string. I hope it's clear what I want to ask... ;) '''Marius 01/12/2010 2:40 p.m.'''

Please note that the new deadline for the exercise sheets is Thursday, 4 pm, that is, just before the lecture. '''Hannah 12Jan10 11:58'''

Thanks for the clarification. I understand it now entirely. I didn't think the exercise was stupid, but I thought my solution was. Simply changing the first bit to 0 for each and every byte would have been stupid in a way that it "works" but hasn't to do much with the real possiblities of utf-8. Now, I'll do it just like you described it. '''Björn 10.1. 19:46'''

Hi Björn + all: indeed, you don't have to do anything particularly fancy for repairing the string. When you have repaired the string up to some point, and the next character is an invalid start of a UTF-8 multi-byte sequence, you may just replace it by 0 or something like that. If your UTF-8 sequence started successfully and the first byte indicates that it's a k-byte sequence, and one of the next k-1 bytes is invalid, you can replace the whole k-byte sequence by anything valid you want. The one thing you are not allowed to do is change a valid sequence, those should stay as they are! I agree that this is not very difficult, but it's also not trivial. In particular, you do need the solution for Exercise 1. Note that the point of exercises 2 - 4 is not to write a complex or difficult or tricky program but to compare a relatively simple (but not too simple) program in three different programming languages. I do not understand though, why you think the programming task is stupid. Maybe there is still some other misunderstanding there. It's a very common application that you have a long UTF-8 string which is invalid at a few places, and you want to feed it to another application which will crash when given a string that contains invalid UTF-8 sequences. You then want to make the string valid leaving the parts that were already valid intact as much as possible. If this is not clear, please ask again. '''Hannah 10Jan10 14:11'''

I have a question concerning exercises 2-4: Is there any restriction saying that something indicates the length of a single UTF-8 sequence inside my large, random sequence? From how I read the description of the exercise, it seems as if it was valid to simply change each and every byte (when necessary) to be a 1-byte sequence. This way I have to change very few bits and end up with n UTF-8 sequences for ascii characters. However, this seems to be quite stupid and really easy. Maybe i missed / misread something. I would be thankful for comments. '''Björn 10.Jan 13:57'''

Note that the first lecture after the winter break will be on Thursday, January 7. Apart from a new regular topic, in that lecture I will also summarize your feedback from Exercise 4 of Sheet 8, and give my comments on it + explain various new rules which hopefully suit you better. '''Hannah 19Dec09 3:50am'''

We have already corrected your exams. For the summary of the results etc. follow the link above. There it will also say where you find your individual results. '''Hannah 19Dec09 3:58'''