AD Teaching Wiki
  • Comments
  • Immutable Page
  • Menu
    • Navigation
    • RecentChanges
    • FindPage
    • Local Site Map
    • Help
    • HelpContents
    • HelpOnMoinWikiSyntax
    • Display
    • Attachments
    • Info
    • Raw Text
    • Print View
    • Edit
    • Load
    • Save
  • Login

FrontPage

Revision 73 as of 2009-11-18 18:17:09
AD Teaching Wiki:
  • SearchEnginesWS0910
  • ExerciseSheet4

Exercise Sheet 4

The rules for uploading are the same as always. If you forgot them, you can read them again here.

Your solutions (files can only be read by the uploader and by us)

Along with the PDF of your solutions and your source code, please also provide the following two figures, so that all can see them: the maximal time difference you measured between (1) and (2) for Exercise 1, the best speed-up you get from using compression for Exercise 4 (next to last column), and the programming language you used (last column).

No.

Name

Solution (PDF)

Code (ZIP or TGZ)

Factor for Ex. 1

Speed-up for Ex. 4

Progr. Language

1

Björn Buchhold

PDF(2nd upload, corrected version)

ZIP

sequential access better by roughly factor 1.7 see pdf for details

compression better by factor ~5.3 (comparing reading vs reading + decompression)

Java

2

Richard Zahoransky

PDF

ZIP

sequential access better by factor of 37

---

Java

3

Florian Bäurle

PDF

ZIP

max. time difference: ~189,14 ms -> seq. access better by factor ~1,8

compression better by factor ~9 (depends heavily on m and n)

C#

4

Mirko Brodesser

PDF

ZIP

seq. access better by factor ~14

compression better by factor ~2

Java

5

Eric Lacher

PDF

ZIP

seq. access better by factor ~2 to 70, depends on increment value (see pdf)

-

Java

6

Thomas Liebetraut

PDF

tgz

seq. access better by factor ~2

-

Python

7

Jonas Krisch

PDF

ZIP

-

-

Java

8

Triatmoko

Pdf

Rar

-

-

Borland Delphi 7

9

Matthias Sauer

PDF

ZIP

seq. access better by factor ~15

compression worse by factor ~2

C#

9

Michael Pereira

[[|PDF]]

ZIP

C++

10

Daniel Schauenberg

PDF

tar.gz

permutation access takes ~150%

--

python

11

Jonas Sternisko

PDF

tar.gz

permutation access about 10-times slower (geometric mean)

--

Python

12

Zhongjie Cai

PDF

ZIP

sequence access faster by factor 1.1~1.8

compression faster by 2~3

VB.Net

13

Matthias Frorath

PDF

ZIP

sequence access faster by factor 1.25~2.5

-

Python

14

Paresh Paradkar

PDF

ZIP

Sequential Access better by factor 1.9893

-

Java

15

Ivo M.

PDF

tar

-

-

16

MariusGreitschus

PDF

tar.gz

sequencial access better by factor of ~48

Encoding speeds up at a factor of ~1.04

C#

17

Claudius Korzen

PDF

ZIP

seq. access better by factor on average ~37

-

Java

16

Johannes Stork

PDF

tar.gz

1181 ms

1.44

Java

19

Daniel Frey

PDF

tar.gz

363 ms

-

C++

20

Achille Nana

PDF

ZIP

s. PDF

s.PDF

JAVA

21

Alexander Nutz

PDF

ZIP

for i=6: 7 ; for i=7: 43

-

Java

22

Jens Silva Santisteban

PDF

ZIP

109 ms

-

Java

23

Manuela Ortlieb

PDF

ZIP

s. PDF

s. PDF

Java

24

Dragos Sorescu

PDF

ZIP

approx. 2.184

approx. 1.6

Perl

25

Alexander Gutjahr

PDF

tar.gz

at 10^8 x100

-

Java

22

Jens Silva Santisteban

PDF

ZIP

109 ms

-

Java

23

Manuela Ortlieb

PDF

ZIP

s. PDF

s. PDF

Java

24

Waleed butt

PDF

ZIP

Seq access 2.01 time faster

Compression is better in twice in access and file size

C# .Net

25

Dragos Sorescu

PDF

ZIP

approx. 2.184

aaprox 1.6

Perl

26

Alexander Gutjahr

PDF

tar.gz

at 10^8 x100

-

Java

27

Alexander Schneider

PDF

ZIP

see pdf/zip

see pdf/zip

Java

28

Markus Gruetzner

PDF

ZIP

complex

-

Java

These were the questions and comments on Exercise Sheet 4

To all: Who made the conflict should please fix it! Marjan 17Nov 2:11pm

I accidentally loaded the ZIP file instead of the PDF(My eyes are heavy...). But i couldn't overwrite the pdf file on the wiki now. So I loaded it in my_name_ex1.pdf. I hope this won't occur any problems. The link to the file is correct, so it should be ok. Jonas 16Nov09 11.41pm

To Florian: This was very well explained in the lecture (it is still there on the slides). It means the speed-up achieved by reading and decompressing your compressed list compared to when the list is read in uncompressed format. As your inverted list is randomly generated, you might have different speed-ups for different inverted lists. In order to have any speed-up, of course, your compression scheme should really work and reduce the size of the inverted list + should not be too inefficient. One extreme example for inefficient code would be using strings instead of bit-shifting for your coding. Marjan 8:38pm

What is meant with the best speedup for Exercise 4 which we should add on the solution page? The best speedup for just reading the data from disk or the best speedup for reading and decompressing the list? Florian 16Nov09 8:12pm

To Dragos: Gap encoding + Elias code is not trivial at all and you can use it. Gap encoding + Byte code is also fine. Marjan 16Nov09 2:15pm

Is it ok to use Elias code for the compressing from Exercise 4(2)? Or it's too 'trivial'?:)Dragos 16Nov09 14.14

To Florian + all: A single number doesn't really make sense, does it? For the discussion part of the exercise, think in terms of probability distributions, as we did in the lecture (when discussing which probability distribution a certain encoding is optimal for). For the example, give a sequence of numbers. Hannah 15Nov09 9:41pm

To Johannes + all: Yes, good idea. I will anyway at some point in the next weeks hand out a sheet where you have the opportunity to give feedback on the lectures and the exercises. But yes, why not give me that feedback on the current exercise sheet already now. Let me refine your proposal a bit. It would be useful for me if you would provide two grades: one for the hardness (pick one of: too hard for me, challenging but feasible, not very hard) and one for the amount of work (pick one of: too much for me, a lot but feasible, not more than for other lectures). It would also be helpful if you would not just give a grade but put your opinion into words. It's no problem if you are critical but please stay polite. I will take your comments seriously, don't worry. Hannah 15Nov09 9:33am

In Exercise we should "give an example of data for which k = x is the best choice". What is meant by "an example of data" here? A single number or a set of numbers or anything else? Florian 15Nov09 8:52pm

I'd like to suggest that everyone grades the exercise sheet from 1 (for "way to easy") to 10 ("way to hard"). This might provide the professor with the feedback she asks for in the lecture. How about that idea? Johannes 2009-11-15T20:40L

To Florian + all: yes, sorry, I forgot to mention this in the lecture. Marjan already explained how to clear the disk cache. Let me add to this an explanation what the disk cache actually is. Whenever you read a (part of a) file from disk, the operating system of your computed will use whatever memory is currently unused to store that (part of the) file there. When you read it again and the (part of the) file hasn't changed and the memory used to store it has not been used otherwise in the meantime, than that data is read right from memory, which is much faster than reading it from disk. Usually that effect is desirable, because it speeds up things, but when you do experiments, it is undesirable, because it leads to unrealistically good running times, especially when carrying out an experiment many times in a row. Hannah 15Nov09 8:10pm

To Florian: Indeed, we were running out of time and there was no room for this in the lecture. I can suggest to you few ways how to clear the disk cache: before carrying out your final experiment, read a large amount of data (let's say close to the amount of RAM you have) from disk - this will ensure that your data (the inverted list) is cleared from the disk cache and replaced by something else (thus an actual reading from disk get's timed, and not reading from RAM). Another way is to restart your computer before doing the timing. Marjan 15Nov09 7:27pm

In exercise 4 it says: "Important note: Whenver you measure running times for reading data from disk, you have to clear the disk cache before, as discussed in the lecture". I think that this was not discussed in the lecture? What do we have to do here? Florian 15Nov09 7:15pm

@Bit shifting: The syntax for that is actually the same, irrespectively of whether you use Java, C++, perl, python, or whatever. The >> operator shifts to the right, the << operator shifts to the left, the & operator ands the bits of the two operands and the | operator ors the bits of the two operands. Very simple. You will also find zillions of example programs on the web by typing something like java bit shifting into Google or whatever your favorite search engine is. Hannah 15Nov09 1:16

Hi Marius + all: For Exercise 4, an inverted list of size m with doc ids from the range [1..n] is simply a sorted list of m numbers from the range [1..n]. I leave it to you, whether your lists potentially contain duplicates (as in 3, 5, 5, 8, 12, ...) or whether you generate them in a way that they don't contain duplicates (as in 3, 5, 8, 17, ...). It doesn't really matter for the exercise whether your list has duplicated or not. In any case, consider simple flat lists like in the two examples I gave (and like all the examples I gave in this and past lectures), not lists of lists or anything. Hannah 15Nov09 1:12am

@Mirko: Sure, but an inverted list is a list of words where the Doc-IDs are attached to each words in which the words occur. So for Example: If word no. 5 occurs in Doc1, Doc2 and Doc3 and word no. 2 occurs in Doc5, the list would look like: 5 -> Doc1, Doc2, Doc3; 2 -> Doc5. Or am I mistaken? My question then is, how long should these attached lists be in average case? I mean, one could imagine that we got 1mil. documents over 3 words, so these lists could get very large...

EDIT: Oh ok. Now, I see your point. It's not an index, it's a list. Okay. So, what is an inverted list with Doc-IDs, then?

EDIT EDIT: And to your question, Mirko, take a look at http://snippets.dzone.com/posts/show/93. Especially at Comment no. 2. Maybe this helps... I think, Java supports StreamWriters/Readers that are able to write/read bytes. Marius 11/14/2009 08:46pm

EDIT EDIT EDIT: Sorry, me again. Well, I bothered Wikipedia which redirects from http://en.wikipedia.org/wiki/Inverted_list to Inverted Index. So it seems to me, this is being used as a synonym. Actually, I think I'm confused enough, now. I'll better wait for any responses... ;-) Marius 11/14/2009 9:08 pm

@ Marius: i think we are supposed to generate one inverted list of size m, with doc ids from 1..n (therefore n>=m, because no duplicates?).

Now a question from my side: ex.4, programming the compression in java, is there any good tutorial about how to handle the bit-stuff? (otherwise, i think, it would cost me too much time..) Mirko 14Nov09, 19:18

Hi, do you have any suggestions what the best numbers for m and n in exercise 4 should look like? Or are we supposed to mess around a bit with ints and longs? And: How long should the list of documents in the inverted index be? Marius 14Nov09 6:40pm

And just to clarify what a single-cycle permutation is. Here is an example for an array of size 5 with a permutation that is a single cycle: 5 4 1 3 2. Why single cycle? Well, A[1] = 5, A[5] = 2, A[2] = 4, A[4] = 3, A[3] = 1. (My indices in this example are 1,...,5 and not 0,...,4.) Here is an example of a permutation with three cycles: 2 1 4 3 5. The first cycle is A[1] = 2, A[2] =1. The second cycle is A[3] = 4, A[4] = 3. The third cycle is A[5] = 5. Hannah 12Nov09 8:04pm

Hi Daniel + all, I don't quite understand your question and your example (if your array is 1 5 3 4 2, why is A[1] = 3?). In case you refer to the requirement of the exercise that the permutation consists only of a single cycle. That is because your code should go over each element exactly once (it should, of course, stop after n iterations, where n is the size of the array). If your permutation has more than one cycle, it is hard to achieve that. Also note that for both (1) and (2), the sum of the array values should be sum_i=1,...,n i = n * (n+1) / 2. Hannah 12Nov09 7:54pm

Hi, I just looked at the new exercise sheet 4, in exercise 1 we should generate a permutation and sum the resulting array up, am I wrong or doesn't iterating method two iterate throw the whole array in every situation. for ex.: n= 5 permutation: 1 5 3 4 2, then A[1] = 3, A[A[1]]= A[3] = 1, A[1] = 3 ... Daniel 12Nov09 19:44pm

  • MoinMoin Powered
  • Python Powered
  • GPL licensed
  • Valid HTML 4.01