Exercise Sheet 9

The rules for uploading are the same as always. If you forgot them, you can read them again here.

Your solutions (files can only be read by the uploader and by us)

Name	Solution (PDF)	Code (ZIP or TGZ)
Achille Nana	PDF	ZIP
Alexander Gutjahr	PDF	ZIP
Alexander Nutz	PDF	ZIP
Alexander Schneider	PDF	ZIP
Björn Buchhold	PDF	ZIP
Claudius Korzen	PDF	ZIP
Daniel Schauenberg	PDF	tar.gz
Dragos Sorescu	PDF	ZIP
Eric Lacher	PDF	ZIP
Florian Bäurle	PDF	ZIP
Jens Silva Santisteban	PDF	ZIP
Johannes A. Stork	PDF	tar.gz
Jonas Krisch	PDF	ZIP
Manuela Ortlieb	PDF	ZIP
Marius Greitschus	PDF	TGZ
Markus Gruetzner	PDF	ZIP
Matthias Frorath	PDF	ZIP
MatthiasSauer	PDF	tar.gz
Mirko Brodesser	PDF	ZIP
Paresh Paradkar	PDF	ZIP
Waleed Butt	PDF	ZIP
Zhongjie Cai	PDF	ZIP
Richard Zahoransky	PDF	ZIP

These were the questions and comments for Exercise Sheet 9

Here is a nanoseconds-timer for C++:

#include <sys/time.h>

...

timespec ts;

clock_gettime(CLOCK_REALTIME, &ts);

The timespec struct has two members: tv_sec and tv_nsec. So you need to take the time at the beginning and at the end. Then you simply take the difference. To get the nanoseconds at a certain point in time you need to multiply the tv_sec (seconds) value by 10e9 and add the tv_nsec value (nanoseconds). By using this method you will need to link against librt (-lrt) when compiling. Jens 15Jan10 13:40

I had the same problem as you Björn, but my OS didn't tell me anything. It said only that my program doesn't work anymore and it wanted to find a solution for me. Of course it didn't. Using the heap was the solution. To Johannes, I'm sure that the PHP ini file is used, because with the standard values it is not possible to run the script. It would only allow 128 MB and a running time of 60 seconds. The error message is only: Fatal error: Out of memory (allocated 1970536448) (tried to allocate 24 bytes) Manuela 15Jan10 11:25

Right, an expert would have used an std::vector<unsigned char> anyway. Johannes 2010-01-14T1119

Maybe someone who is no expert in C++ (like me) may find this helpful: I used to get segmentation faults with array size of more than 10 to the 6. It was because I had them allocated on the stack. I google'd this link: http://stackoverflow.com/questions/1847789/segmentation-fault-on-large-array-sizes , changed it in my code and that 1 second of work did it for me. works for 10 to the 9 now as well. I'm sure c++ expert know this anyway, but it was new to me and maybe it helps someone else. GoognightBjörn 14.1. 1:56

Sure, characterization by enumeration is also fine, as long as you don't write down (or print) lists of hundreds of bytes. Hannah 14Jan10 00:18

Are the invalid members of a finite set also described by completely naming all valid members of the same set? If so, can I do that for Ex 1? Seems more easy to me. Johannes 2010-01-13T2352

Manuela, state the complete error message and ensure that the php ini file is used. Johannes 2010-01-13T2340

Yes, thanks for the correction. I am still under the influence of my wisdom tooth surgery from Tuesday, and my brain works only 50% or so. I hope, but cannot promise, that it will be better tomorrow in the lecture. BTW, tomorrows lecture will be about Latent Semantic Indexing (LSI), a fancy and surprising method to use Eigenvector decomposition from linear algebra to automatically find synonyms in large document collection, without having to understand what the words mean, just from their statistical co-occurrences. Fascinating stuff. Hannah 13Jan10 21:47

Thanks a lot for the alternative C++ time measure code. Just a little correction if someone else needs it, the difference is the wrong way round, so the code below will produce negative times and by using a floating point type one can also yield results < 1 second. Florian 13Jan10 21:09 :

double time_elapsed = (double)(clock() - time) / CLOCKS_PER_SEC;

Don't worry too much about the out of memory issues. If you don't get it to work for the largest array in reasonable time with a particular language, just leave out that measurement, you won't get points subtracted. Of course, you shouldn't leave all measurements blank, and you should try a little bit to make it work, say half an hour. Hannah 13Jan10 20:51

A simple alternative way to measure CPU time in C++ is via the clock function (man 3 clock) Hannah 13Jan10 20:49 :

#include <time.h>

clock_t time = clock();
... // code for which time is to be measured
int time_elapsed = (int)(time - clock())/CLOCKS_PER_SEC;

To Marjan: No, I am using Windows and Microsoft Visual Studio 2008. Unfortunately I haven't done much with C++ for quite a long time now, so I am not that familiar with it anymore. Florian 13Jan10 20:27

Did anyone use PHP? I'm using a 4 GB RAM machine and I changed the maximum memory value in the php.ini file to "-1". But even so I can't use n = 10^7 and more, because I get an out of memory failure when the script allocates more than 1,8 GB. The same code works in Java with less memory usage. Manuela 13 Jan 10 20:13

To Björn: You should use cout << smth. << endl << flush; so that the buffer gets flushed to the console before executing the problematic line/segment of code. Marjan

Tp Florian: We are using the Timer.h all the time and it works well. Are you using linux and gcc? Marjan

The c++ timer class from Marjan does not work for me, after including it I get 10 errors. I tried to fix them but didn't manage to. Any other proposals for an easy time measuring in c++? Florian 13 Jan 10 20:03

Hi. I tried adding a printf stmt at the first line of my main method. The output in the console only appears if I outcomment the method call for fixing the utf8 string (for the 10 to the 9 case) which is far later in the main method. It also works if I lower the numbers to 10 to the 6 in that example (which leaves me with two tests using 10 to the 6). EDIT: just saw the other hint for gdb, trying this out now.Björn

Hi Björn, it would be helpful if you would tell us which line of the source code causes the segmentation fault. Since the code we are talking about here is quite small, the easiest way to do this is just place some cout / printf statements in the code, and then see which one is the last one that prints. A more sophisticated method is to run gdb <binary>, where binary has been compiled with the options -O 0 -g, that will also tell you at which line the code crashes. Hannah 13Jan10 19:27

I seem to get Segmentation Faults in C++ if I include the 10^9 case. I have no idea why, especially since normally I'm all into java and really don't know anything about C++ ... PS: my solution in Java semms to work on the same machine Björn 13.1. 19:14

Normally I use VLC too but unfortunately my graphics driver crashed when I tried to watch one of the lecture recordings with VLC, so I needed the codec for another player. Florian 13Jan10 18:45

Well, I would use what I use to watch all my other movies, too, and that is the VLC video player. It's also written above, in the paragraph starting "The recordings of all lectures ..." Hannah 13Jan10 18:32

What video codec do I need to watch the avi lecture recordings? I have only the sound here on my temporary notebook which I must use while my regular notebook is on repair. Florian 13 Jan 10 17:55 Ok, I found it myself, it is the "TechSmith Screen-Capture Codec" -> http://www.techsmith.de/download/codecs.asp Florian 13 Jan 10 18:16

Thanks, Johannes. And does anyone know where the upload page for exercise 9 is? Zhongjie 13 Jan 10 17:21

Instead of just casting you should do: intValue = 0xFF & (int) byteValue; in Java. Johannes 2010-01-13T1509

Regarding "Java OutOfMemory Exception" : I'm not so familiar with Java, but I have found a possible way to set the JVM runtime heap size by command line arguments: java -Xms512m -Xmx1536m classname. This way you set your JVM to initially use 512m memory and to use maximum of 1536m memory. I think most of our pool's computers have more than that physical memory. And Now you should use byte array to store your random sequence and cast it to integer when doing comparisons. I hope this could do some help. Zhongjie 13Jan10 10:00

Oh, and BTW2, it doesn't matter which random generator you use, anything half-way random is fine. Hannah 13Jan10 00:18

BTW, it doesn't matter whether you measure elapsed time (actual real time used) or CPU time (only the CPU time used), both are fine. In case you measure elapsed time, make sure though that the machine is not under heavy load by some other job. Hannah 13Jan10 00:17

Looks like only PHP is missing. Here is how you get a random number in PHP: http://php.net/manual/en/function.rand.php, and here is how you measure the elapsed time in microseconds: http://de2.php.net/manual/en/function.microtime.php. Hannah 13Jan10 0:15

Yes, exactly, you shouldn't use time because then you time the whole program, and we don't want to time the random number generation, because that is probably slower than the rest and we don't want to compare random number generation times in different programming languages. As for the out of memory error. Please try it on a machine with >= 4 GB. If you also get out of memory errors there, or absolutely can't get access to such a machine, you can do as you suggested and reduce the input size to 10^7 or even less if necessary. Hannah 12Jan10 22:42

Using the time program results in counting the time of the random number generator too. The random number generator are to different so that this may be a bad idea. Johannes 2010-01-12T2220

I have two questions: 1. Can I stop the time by using the time function at the bash shell, like presented at the lecture? 2. I've wrote the three programs but at n = 10⁹ all programs go out of memory. I use arrays to generate the byte sequences. Is it ok to reduce n to 10⁸ or 10⁷ for the time tests? Alex 12.Jan 21:29

Python:

(Command line arguments) import sys

length = int(sys.argv[1])

(Random numbers) import random

def getRandomByteString(string, length):

for i in range(length):
- string.append(random.randint(0, 255))

Johannes 12Jan10 20:55

Here's some code for C++ and JAVA and PERL (most of you probably know it):

A precise Timer class in C++ (use timer.start() to start the timer, timer.stop() to stop it and timer.usecs() for microseconds. For more documentation see the source)

Some standard C++ code for generating random bytes:

srand(time(NULL)); // you call it only once

unsigned char x = rand() % 256; // generates a random number in 0...255, not perfectly uniformly at random though

System.nanoTime(); // time in nanoseconds

Random randomGenerator = new Random();

char x = randomGenerator.nextInt(256); // generates a random number in 0...255

$x = int(rand(256)); // returns a random number in 0..255 in perl

$start = [Time::HiRes::gettimeofday()];

$finish = Time::HiRes::tv_interval($start); // elapsed time in perl

Marjan 12Jan10 20:17

Yes, sorry I forgot, I was travelling over the last days and today just returned from a rather butchery removal of one of my wisdom teeth and I am completely groggy now. I'll try to add the code later in the evening when I have recovered a bit, or maybe Marjan can do it in the meantime. If anyone of you knows the code for one of C++, JAVA, Perl, Python, PHP, please feel free to post it, too. Remember, the idea of a Wiki is that everybody can contribute. Hannah 12Jan10 19:43

Do we get code for timing and random numbers? Johannes 12Jan10 18:46

Hi Marius, the exercise is simpler than that. Leave every valid UTF-8 sequence intact, and for every invalid UTF-8 sequence replace it by what you want, e.g. zeros. You don't have to guess what an invalid sequence means, or convert from one format to another, or anything like that. Hannah 12Jan10 18:42

The solutions of the mid-term are posted. Marjan 15:38 12.01.2010

So do I understand it correclty that by saying "repairing" a string you intend to say that only the encoding in the end has to be valid? Because by string repair I understand that if the encoded letter is not UTF-8 we have to reencode it into UTF-8. So for example, if you get a UTF-32 char ݮ it would have to be encoded into UTF-8. What I want to say is that you could possibly "damage" the character, when you just repair the bits of the encoding what would cause in changing the semantics of the whole string. I hope it's clear what I want to ask... Marius 01/12/2010 2:40 p.m.

Please note that the new deadline for the exercise sheets is Thursday, 4 pm, that is, just before the lecture. Hannah 12Jan10 11:58

Thanks for the clarification. I understand it now entirely. I didn't think the exercise was stupid, but I thought my solution was. Simply changing the first bit to 0 for each and every byte would have been stupid in a way that it "works" but hasn't to do much with the real possiblities of utf-8. Now, I'll do it just like you described it. Björn 10.1. 19:46

Hi Björn + all: indeed, you don't have to do anything particularly fancy for repairing the string. When you have repaired the string up to some point, and the next character is an invalid start of a UTF-8 multi-byte sequence, you may just replace it by 0 or something like that. If your UTF-8 sequence started successfully and the first byte indicates that it's a k-byte sequence, and one of the next k-1 bytes is invalid, you can replace the whole k-byte sequence by anything valid you want. The one thing you are not allowed to do is change a valid sequence, those should stay as they are! I agree that this is not very difficult, but it's also not trivial. In particular, you do need the solution for Exercise 1. Note that the point of exercises 2 - 4 is not to write a complex or difficult or tricky program but to compare a relatively simple (but not too simple) program in three different programming languages. I do not understand though, why you think the programming task is stupid. Maybe there is still some other misunderstanding there. It's a very common application that you have a long UTF-8 string which is invalid at a few places, and you want to feed it to another application which will crash when given a string that contains invalid UTF-8 sequences. You then want to make the string valid leaving the parts that were already valid intact as much as possible. If this is not clear, please ask again. Hannah 10Jan10 14:11

I have a question concerning exercises 2-4: Is there any restriction saying that something indicates the length of a single UTF-8 sequence inside my large, random sequence? From how I read the description of the exercise, it seems as if it was valid to simply change each and every byte (when necessary) to be a 1-byte sequence. This way I have to change very few bits and end up with n UTF-8 sequences for ascii characters. However, this seems to be quite stupid and really easy. Maybe i missed / misread something. I would be thankful for comments. Björn 10.Jan 13:57

Note that the first lecture after the winter break will be on Thursday, January 7. Apart from a new regular topic, in that lecture I will also summarize your feedback from Exercise 4 of Sheet 8, and give my comments on it + explain various new rules which hopefully suit you better. Hannah 19Dec09 3:50am

We have already corrected your exams. For the summary of the results etc. follow the link above. There it will also say where you find your individual results. Hannah 19Dec09 3:58

AD Teaching Wiki: SearchEnginesWS0910/ExerciseSheet9 (last edited 2010-01-14 23:12:07 by Hannah Bast)