SearchEnginesWS0910 - AD Teaching Wiki

Welcome to the Wiki page of the course Search Engines, WS 2009 / 2010. Lecturer: Hannah Bast. Tutorials: Marjan Celikik. Course web page: click here.

Here are PDFs of the slides of the lectures so far: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6, Lecture 7, Lecture 8, Lecture 9.

Here are the recordings of the lectures so far (except Lecture 2, where we had problems with the microphone), LPD = Lecturnity recording: Recording Lecture 1 (LPD), Recording Lecture 3 (LPD), Recording Lecture 4 (LPD), Recording Lecture 5 (LPD without audio), Recording Lecture 6 (LPD), Recording Lecture 7 (AVI), Recording Lecture 8 (AVI), Recording Lecture 9 (AVI).

Here are PDFs of the exercise sheets so far: Exercise Sheet 1, Exercise Sheet 2, Exercise Sheet 3, Exercise Sheet 4, Exercise Sheet 5, Exercise Sheet 6, Exercise Sheet 7, Exercise Sheet 8, Exercise Sheet 9.

Here are your solutions and comments on the previous exercise sheets: Solutions and Comments 1, Solutions and Comments 2, Solutions and Comments 3, Solutions and Comments 4, Solutions and Comments 5, Solutions and Comments 6, Solutions and Comments 7, Solutions and Comments 8.

The recordings of all lectures are now available, see above. Lecture 2 is missing because we had technical problems there. To play the Lecturnity recordings (.lpd files) you need the Lecturnity Player, which you can download here. I put the Camtasia recordings as .avi files, which you can play with any ordinary video player; I would recommend VLC.

Here are the rules for the exercises as explained in Lecture 2.

Here is everything about the mid-term exam.

Questions and comments about Exercise Sheet 9 below this line (most recent on top)

Thanks, Johannes. And does anyone know where the upload page for exercise 9 is? Zhongjie 13 Jan 10 17:21

Instead of just casting you should do: intValue = 0xFF & (int) byteValue; in Java. Johannes 2010-01-13T1509

Regarding "Java OutOfMemory Exception" : I'm not so familiar with Java, but I have found a possible way to set the JVM runtime heap size by command line arguments: java -Xms512m -Xmx1536m classname. This way you set your JVM to initially use 512m memory and to use maximum of 1536m memory. I think most of our pool's computers have more than that physical memory. And Now you should use byte array to store your random sequence and cast it to integer when doing comparisons. I hope this could do some help. Zhongjie 13Jan10 10:00

Oh, and BTW2, it doesn't matter which random generator you use, anything half-way random is fine. Hannah 13Jan10 00:18

BTW, it doesn't matter whether you measure elapsed time (actual real time used) or CPU time (only the CPU time used), both are fine. In case you measure elapsed time, make sure though that the machine is not under heavy load by some other job. Hannah 13Jan10 00:17

Looks like only PHP is missing. Here is how you get a random number in PHP: http://php.net/manual/en/function.rand.php, and here is how you measure the elapsed time in microseconds: http://de2.php.net/manual/en/function.microtime.php. Hannah 13Jan10 0:15

Yes, exactly, you shouldn't use time because then you time the whole program, and we don't want to time the random number generation, because that is probably slower than the rest and we don't want to compare random number generation times in different programming languages. As for the out of memory error. Please try it on a machine with >= 4 GB. If you also get out of memory errors there, or absolutely can't get access to such a machine, you can do as you suggested and reduce the input size to 10^7 or even less if necessary. Hannah 12Jan10 22:42

Using the time program results in counting the time of the random number generator too. The random number generator are to different so that this may be a bad idea. Johannes 2010-01-12T2220

I have two questions: 1. Can I stop the time by using the time function at the bash shell, like presented at the lecture? 2. I've wrote the three programs but at n = 10⁹ all programs go out of memory. I use arrays to generate the byte sequences. Is it ok to reduce n to 10⁸ or 10⁷ for the time tests? Alex 12.Jan 21:29

Python:

(Command line arguments) import sys

length = int(sys.argv[1])

(Random numbers) import random

def getRandomByteString(string, length):

for i in range(length):
- string.append(random.randint(0, 255))

Johannes 12Jan10 20:55

Here's some code for C++ and JAVA and PERL (most of you probably know it):

A precise Timer class in C++ (use timer.start() to start the timer, timer.stop() to stop it and timer.usecs() for microseconds. For more documentation see the source)

Some standard C++ code for generating random bytes:

srand(time(NULL)); // you call it only once

unsigned char x = rand() % 256; // generates a random number in 0...255, not perfectly uniformly at random though

System.nanoTime(); // time in nanoseconds

Random randomGenerator = new Random();

char x = randomGenerator.nextInt(256); // generates a random number in 0...255

$x = int(rand(256)); // returns a random number in 0..255 in perl

$start = [Time::HiRes::gettimeofday()];

$finish = Time::HiRes::tv_interval($start); // elapsed time in perl

Marjan 12Jan10 20:17

Yes, sorry I forgot, I was travelling over the last days and today just returned from a rather butchery removal of one of my wisdom teeth and I am completely groggy now. I'll try to add the code later in the evening when I have recovered a bit, or maybe Marjan can do it in the meantime. If anyone of you knows the code for one of C++, JAVA, Perl, Python, PHP, please feel free to post it, too. Remember, the idea of a Wiki is that everybody can contribute. Hannah 12Jan10 19:43

Do we get code for timing and random numbers? Johannes 12Jan10 18:46

Hi Marius, the exercise is simpler than that. Leave every valid UTF-8 sequence intact, and for every invalid UTF-8 sequence replace it by what you want, e.g. zeros. You don't have to guess what an invalid sequence means, or convert from one format to another, or anything like that. Hannah 12Jan10 18:42

The solutions of the mid-term are posted. Marjan 15:38 12.01.2010

So do I understand it correclty that by saying "repairing" a string you intend to say that only the encoding in the end has to be valid? Because by string repair I understand that if the encoded letter is not UTF-8 we have to reencode it into UTF-8. So for example, if you get a UTF-32 char ݮ it would have to be encoded into UTF-8. What I want to say is that you could possibly "damage" the character, when you just repair the bits of the encoding what would cause in changing the semantics of the whole string. I hope it's clear what I want to ask... Marius 01/12/2010 2:40 p.m.

Please note that the new deadline for the exercise sheets is Thursday, 4 pm, that is, just before the lecture. Hannah 12Jan10 11:58

Thanks for the clarification. I understand it now entirely. I didn't think the exercise was stupid, but I thought my solution was. Simply changing the first bit to 0 for each and every byte would have been stupid in a way that it "works" but hasn't to do much with the real possiblities of utf-8. Now, I'll do it just like you described it. Björn 10.1. 19:46

Hi Björn + all: indeed, you don't have to do anything particularly fancy for repairing the string. When you have repaired the string up to some point, and the next character is an invalid start of a UTF-8 multi-byte sequence, you may just replace it by 0 or something like that. If your UTF-8 sequence started successfully and the first byte indicates that it's a k-byte sequence, and one of the next k-1 bytes is invalid, you can replace the whole k-byte sequence by anything valid you want. The one thing you are not allowed to do is change a valid sequence, those should stay as they are! I agree that this is not very difficult, but it's also not trivial. In particular, you do need the solution for Exercise 1. Note that the point of exercises 2 - 4 is not to write a complex or difficult or tricky program but to compare a relatively simple (but not too simple) program in three different programming languages. I do not understand though, why you think the programming task is stupid. Maybe there is still some other misunderstanding there. It's a very common application that you have a long UTF-8 string which is invalid at a few places, and you want to feed it to another application which will crash when given a string that contains invalid UTF-8 sequences. You then want to make the string valid leaving the parts that were already valid intact as much as possible. If this is not clear, please ask again. Hannah 10Jan10 14:11

I have a question concerning exercises 2-4: Is there any restriction saying that something indicates the length of a single UTF-8 sequence inside my large, random sequence? From how I read the description of the exercise, it seems as if it was valid to simply change each and every byte (when necessary) to be a 1-byte sequence. This way I have to change very few bits and end up with n UTF-8 sequences for ascii characters. However, this seems to be quite stupid and really easy. Maybe i missed / misread something. I would be thankful for comments. Björn 10.Jan 13:57

Note that the first lecture after the winter break will be on Thursday, January 7. Apart from a new regular topic, in that lecture I will also summarize your feedback from Exercise 4 of Sheet 8, and give my comments on it + explain various new rules which hopefully suit you better. Hannah 19Dec09 3:50am

We have already corrected your exams. For the summary of the results etc. follow the link above. There it will also say where you find your individual results. Hannah 19Dec09 3:58

AD Teaching Wiki: SearchEnginesWS0910 (last edited 2010-01-13 17:21:29 by remote218-142)