Exercise Sheet 5

The rules for uploading are the same as always. If you forgot them, you can read them again here.

Your solutions (files can only be read by the uploader and by us)

Along with your files please also fill in the rate in elements per second of the two algorithms when intersecting two lists of size 10^6 = 1 million. Please give these two numbers with the same precision as in the rows already there, for example, 8.5 million, but not 8.48834 million. For comparison, I also implemented the two algorithms and put my two numbers. Can you beat them? I did not do anything particularly fancy, expect paying attention that the code in the inner loop is as simple as possible, as discussed in the lecture. Please also add information about the processor of your machine. On a linux system you get this information with cat /proc/cpuinfo. The machine I used is quite new. I also ran the same code on a machine that is a few years old (AMD Opteron 2220 SE) and there the rates are about half of the ones on the new machine.

No.	Name	Solution (PDF)	Code (ZIP or TGZ)	elems / second simple alg	elems / second search alg	Progr. Language	Processor
0.	Hannah			252 million	213 million	C++	Intel Xeon X5560 2.8GHz
1.	Claudius Korzen	PDF	ZIP	9.6 million	1.6 million	Java	Intel Core Duo T2300 1.66 GHz
2	Mirko Brodesser	PDF	ZIP	21.0 million	12.8 million	Java	Intel Core 2 Duo 1.8Ghz
3	Ivo Malenica	PDF	tar	[not done]	[not done]
4	Thomas Liebetraut	PDF	tgz	0.5 million	0.1 million	Python
5	Eric Lacher	PDF(1/2) PDF(2/2)	ZIP	49.1 million	3.8 million	Java
6	Marius Greitschus	PDF	.tar.gz	71.2 million	2.3 million	C#	Intel Core 2 Duo 2800 MHz
7	Zhongjie Cai	PDF	ZIP	19.0 million	8.0 million	VB.Net	AMD Turion x2 RM-72 2.10GHz
8	Matthias Sauer	PDF	ZIP	80 million	65.0 million	C#	Intel i5 750 @ 2.67GHz
9	Florian Bäurle	PDF	ZIP	51.1 million	31.7 million	C#	Intel Core2Duo T7200 2.00GHz
10	Paresh Paradkar	PDF	ZIP	8 million	0.1 million	JAVA	Intel dual core T4300 2.1 GHz
11	Dragos Sorescu	PDF	ZIP	0.4 mil	0.1 mil	Perl	Intel Core 2 Duo T7300 2.0 GHz - but broken battery and charger
12	Matthias Frorath	PDF	ZIP	0.6 mil	0.5 mil	Python	AMD Athon 64 X2 5400+, 2.8 GHz
13	Markus Gruetzner	PDF	ZIP	63.3 M	24.2 M	Java	Intel(R) Pentium(R) 4 CPU 3.20GHz
14	Björn Buchhold	PDF	ZIP	65 million	27.3 million	Java	Intel Core 2 Duo 2.4Ghz
15	Johannes Stork	PDF	TGZ	26.9 million	1.0 million	Java	Turing maschine compliant 1.73GHz
16	Alexander Nutz	PDF	ZIP	92 million	13.3 million	Java	Intel Core2 Duo T7300 2.0GHz
17	Jens Silva Santisteban	PDF	ZIP	1,9 mio	1,8 mio	Java	Intel Core2 Duo T5600 1.83GHz
18	Jonas Krisch	PDF	ZIP			Java	Intel Core2 Duo T7300 2.0GHz
19	Manuela Ortlieb	PDF	ZIP	55.1 million	4.8 million	Java	Intel Core2 Duo 2.13 GHz
20	Alexander Schneider	PDF	ZIP	see the pdf	see the pdf	Java	Intel Core2Duo T7500 2.20GHz
21	Waleed Butt	PDF	ZIP	8 Million	14 Million	C#	Intel Core 2Duo 2.1 GHZ
22	Alexander Gutjahr	PDF	tar.gz	186 Million	303 Million	Java	Intel Core 2Duo 2.5 GHZ
23	Daniel Schauenberg	PDF	tar.gz	196 Million	370 Million	python	Intel Core 2 Duo 2.2 GHZ
24	Achille Nana	PDF	ZIP	-	-	JAVA	-
25	Richard Zahoransky	PDF	---	-	-	JAVA	-

These were the questions and comments on Exercise Sheet 5

Hi Marius, in principle youo are right, but as a rule of thumb, I would say that speed is roughly proportional to the processor speed, so processor speed is a good single number. Hannah 24Nov09 3:06pm

Hi, I would appreciate the Bus-speed of the systems in the list of the exercises, too, because the higher the access to the RAM, the better the values would be. Marius 11/24/2009 10:40am

Hi Dragos, yes, just iterate over the doc ids and pick each with probability m/n. This will give you a sorted list of doc ids of average length m. It does not have to be of size exactly m. Hannah 24Nov09 0:37am

About generating the docID lists: I am guessing I'm not wrong when I say that we don't have to make sure we have exactly m values in each list, we should just generate each element with a m/n probability, right? Thank you ! Dragos 24Nov 12.30 AM

Hi Claudius, yes, the inequality actually holds for "<", too. Hannah 23Nov09 4:05pm

In Exercise 5, could it be, that the inequality holds actually for "<", too? Because k may not be 0 and k <= m. Or am I wrong? Claudius 23Nov09 3:58pm

Hi Marius + all: I mean only "real" comparisons, that is, comparisons of list elements. If you use sentinels, then sentinels count as list elements. Hannah 22Nov 10:38

Hi, just to tie in with the last question: Do you mean by comparisons only the comparisons of list elements (such as A[i] <= B[j]) or are guarding conditional comparisons also to be counted (such as "if (i < list1.size())"? Marius 11/22/2009 10:32pm

Hi Björn + all: good question. One simple way to deal with this would be to use a comma expression like (numComparisons++, A[i] <= B[j]). A list of expressions, separated by commas, gets evaluated from left to right, and the value of the whole expression is simply the value of the last expression. At least that is the case in C++, but most programming languages have the same or a very similar construct. An alternative that essentially does the same thing, would be to do the comparison in a separate function, which, besides doing the comparison, also increase the comparisons counter. If you do that, you should absolutely make that that function is inline though since otherwise you pay the price of a function call for every comparison, which is likely to spoil your performance. Hannah 22Nov09 6:30pm

Hi, is there a good way to count comparisons if locial connectives are used? As far as I know, && will only check until the first comparison is false and || if it until some is true. Unfortunately I cannot think of a way to count those comparisons without rewriting the code and nesting the if statements. I'm not too familiar with compilers but I hope that this changes to the code won't effect the performance.Björn 22Nov09 6:20pm

Hi Claudius + all: you are right, it's not very precise, the intended meaning is something like "compare the tables for the two algorithms" or "for each of the four measurements, compare the tables for the two algorithms". Hannah 22Nov09 3:39pm

I am confused: In Exercise 4, you write: "Compare the two tables...", but when I take both algorithms and all 4 measurements (running time, ratio, ...), I get 8 4x4 tables (for all the combinations of 10³, 10⁴, 10⁵, 10⁶). What I'm doing wrong? Claudius 22Nov09 3:30pm

Hi Björn + all: very good question and thanks for pointing that out. You should indeed always search the elements of the smaller list in the larger list, and the first thing your (advanced) list intersection algorithm should do is figure out which of the two lists is the smaller one. That is, your 4 x 4 table will be symmetric, and actually only contains 10 different values (the 6 below the diagonal, which are the same as the ones above the diagonal, and the 4 on the diagonal). Hannah 22Nov09 2:50pm

For the exp/bin-search intersection algorithm it clearly matters that it searches for the elements of the smaller list in the larger one. A good implementation will certainly take care of that. Should our implementation also do that or ignore it in order to get 16 measurements that are really different? Björn 22Nov09 1:00pm

Ok, no problem, I'm happy when it's clear now. Hannah 22Nov09 0:24am

You're right, I misread your comment, sorry. I was thinking of 10MB per lists processed in 1 second, resulting in 20MB/s and was wondering where the 100MB/s are coming from. Thomas 22Nov09 00:20am

Hi Thomas, I am at a loss of words here. I am saying a car is driving 20 kilometers and it needs 10 minutes for that, so its average speed was 120 km / hours. And you are saying how can the speed of a car be 120 km / hours, when it only drives 20 kilometers. Well, what should I say. Besides, in my example I clearly said that the two lists together occupy 10 MB, not 10 MB per list. Please read again what I wrote. Hannah 22Nov09 0:16am

Why should two lists of 10MB size result in 100MB processed, if each list is only iterated over once to do the intersection (O(m+n) complexity)? The data processed after all is just 20MB, no matter how the algorithm is implemented (even if it iterates a thousand times over every list, it still just processed 20MB of data). Thomas 21Nov09 12:00am

By the way, whenever I talk about "lists" here or on the exercise sheets or in the lecture, I am not referring to a particular data structure (in particular I am NOT talking about a linked list), but "list of elements" is just "series of elements". And well, "inverted list" is just common terminology. To implement a "list of doc ids" or anything like that you should of course always use an array or a vector or a data structure like that. Hannah 21Nov09 8:30pm

Hi Marius + all, let me explain it by an example. Your two input lists occupy a certain amount of memory. Every programming language has built-in functions for this. For example, if your list entries are ints, then for C++ you can use sizeof(int) to get the number of bytes occupied by one entry. Multiply by the number of list elements to get the number of bytes occupied by one list. One Megabyte (MB) is 1024 * 1024 bytes. Now assume your two lists together occupy 10 MB. Assume your code takes 0.1 seconds to intersect these two lists. Then the "MB processed per second" is 100 MB / second. Hannah 21Nov09 8:26pm

Hi, in exercise 3, what do you mean by "MB processed per second"? Is a MB the equivalent to 4096 processed integers? And when is a MB to be considered as processed? When it's written to the intersected list or in the comparisons, already? Marius 21Nov09 7:33pm

The slides + all my hand-writing on it are now online, see the link Recording Lecture 5 (no audio) above. Hannah 20Nov09 3:24am

The recording of todays lecture again did not work. I am very sorry for that (and very angry that there are so many problems with this software). Anyway, the end result of the lecture, that is the slides with all the writing on it are available and I will put them online as soon as possible. Hannah 19Nov09 11:23pm

There is a typo in Exercise 5 of the new sheet. The two occurrences of n should be m. Hannah 19Nov09 11:22pm

AD Teaching Wiki: SearchEnginesWS0910/ExerciseSheet5 (last edited 2009-11-24 18:48:35 by Hannah Bast)