Questions and comments about Exercise Sheet 9 below this line (most recent on top)

Yes, sorry I forgot, I was travelling over the last days and today just returned from a rather butchery removal of one of my wisdom teeth and I am completely groggy now. I'll try to add the code later in the evening when I have recovered a bit, or maybe Marjan can do it in the meantime. If anyone of you knows the code for one C++, JAVA, Perl, Python, PHP, please feel free to post it, too. Remember, the idea of a Wiki is that everybody can contribute. Hannah 12Jan10 19:43

Do we get code for timing and random numbers? Johannes 12Jan10 18:46

Hi Marius, the exercise is simpler than that. Leave every valid UTF-8 sequence intact, and for every invalid UTF-8 sequence replace it by what you want, e.g. zeros. You don't have to guess what an invalid sequence means, or convert from one format to another, or anything like that. Hannah 12Jan10 18:42

The solutions of the mid-term are posted. Marjan 15:38 12.01.2010

So do I understand it correclty that by saying "repairing" a string you intend to say that only the encoding in the end has to be valid? Because by string repair I understand that if the encoded letter is not UTF-8 we have to reencode it into UTF-8. So for example, if you get a UTF-32 char ݮ it would have to be encoded into UTF-8. What I want to say is that you could possibly "damage" the character, when you just repair the bits of the encoding what would cause in changing the semantics of the whole string. I hope it's clear what I want to ask... ;) Marius 01/12/2010 2:40 p.m.

Please note that the new deadline for the exercise sheets is Thursday, 4 pm, that is, just before the lecture. Hannah 12Jan10 11:58

Thanks for the clarification. I understand it now entirely. I didn't think the exercise was stupid, but I thought my solution was. Simply changing the first bit to 0 for each and every byte would have been stupid in a way that it "works" but hasn't to do much with the real possiblities of utf-8. Now, I'll do it just like you described it. Björn 10.1. 19:46

Hi Björn + all: indeed, you don't have to do anything particularly fancy for repairing the string. When you have repaired the string up to some point, and the next character is an invalid start of a UTF-8 multi-byte sequence, you may just replace it by 0 or something like that. If your UTF-8 sequence started successfully and the first byte indicates that it's a k-byte sequence, and one of the next k-1 bytes is invalid, you can replace the whole k-byte sequence by anything valid you want. The one thing you are not allowed to do is change a valid sequence, those should stay as they are! I agree that this is not very difficult, but it's also not trivial. In particular, you do need the solution for Exercise 1. Note that the point of exercises 2 - 4 is not to write a complex or difficult or tricky program but to compare a relatively simple (but not too simple) program in three different programming languages. I do not understand though, why you think the programming task is stupid. Maybe there is still some other misunderstanding there. It's a very common application that you have a long UTF-8 string which is invalid at a few places, and you want to feed it to another application which will crash when given a string that contains invalid UTF-8 sequences. You then want to make the string valid leaving the parts that were already valid intact as much as possible. If this is not clear, please ask again. Hannah 10Jan10 14:11

I have a question concerning exercises 2-4: Is there any restriction saying that something indicates the length of a single UTF-8 sequence inside my large, random sequence? From how I read the description of the exercise, it seems as if it was valid to simply change each and every byte (when necessary) to be a 1-byte sequence. This way I have to change very few bits and end up with n UTF-8 sequences for ascii characters. However, this seems to be quite stupid and really easy. Maybe i missed / misread something. I would be thankful for comments. Björn 10.Jan 13:57

