AD Teaching Wiki:

Exercise Sheet 1

Here are your introductions from Exercise 1

Instructions for uploading

1. (only necessary once, before you upload something for the first time) Assume your name is Donald Duck. (0) If you haven't already done so, create a Wiki account with your name DonaldDuck (click on "Login" on the top left, then click on "you can create one now"). Always be logged in when you are about to change anything on the Wiki. (1) Type the following URL in your browser: http://ad-wiki.informatik.uni-freiburg.de/teaching/SearchEnginesWS0910/DonaldDuckExercises. (2) Click on "create new empty page" and save the empty page. (3) We will then add asap the following line to your page: #acl DonaldDuck:read,write -All:read. This will ensure that only yourself and the organizers of the course can see your solutions to the exercises, the number of points you got, etc.

2. (assuming you already have created your page http://ad-wiki.informatik.uni-freiburg.de/teaching/SearchEnginesWS0910/DonaldDuckExercises as described above) (1) Recall that your name is not Donald Duck. (2) Go to your page DonaldDuckExercises. (2) Upload your solutions there as PDF (no other formats allowed), giving your file the name donald_duck_ex1.pdf. (3) Upload your code separately as ZIP or GZIPPED TAR archive, giving your file the name donald_duck_ex1.zip or donald_duck_ex1.tgz. (4) Put the corresponding links in the table below, as well as the other information requested. Follow the pattern of the lines already there.

Your solutions (files can only be read by the uploader and by us)

Name

Link to uploaded solution

Link to uploaded code

Name of collection

#Docs in collection

Zipf epsilon

Salek T

PDF

ZIP (docs)
ZIP (code)
ZIP (google lib)
ZIP (Apache lib)

Joke related documents from www.textfiles.com

2447 documents in ~36MB

-

Johannes Stork

PDF

TARGZ

RFCs and german news websites and www.textfiles.com

5540 and 5415 and 48799

0.1052 and 0.0762 and 0.0762

Christian Simon

PDF

ZIP

selected archives from www.textfiles.com

2865

0.052

Matthias Sauer

Included in Code zip

ZIP

non-selected archives from www.textfiles.com

4328

0.788

Zhongjie Cai

PDF

ZIP

RFC Documents and Text Stories

5549 and 1255

0.6364 and 0.5137

Waldemar Wittmann

PDF

TARGZ

RFC Documents

1459

0.08396

Florian Bäurle

PDF

ZIP

RFCs and selected files from www.textfiles.com

44618

0.08243

Marius Greitschus

PDF

.tar.gz

GNU Man-Pages

5051

0.098

Markus Gruetzner

PDF

ZIP

RFC

~5500

0.01299

Thomas Liebetraut

PDF

tgz

IRC logs

~3800

0.122

Claudius Korzen

PDF

ZIP

RFC's

1460

0.031

Daniel Schauenberg

PDF

tar.gz

Excerpt from RFCs

2000

0.0164

Alexander Gutjahr

PDF

tar.gz

RFCs 1- 2000

ca. 2000

0.06095

Björn Buchhold

PDF

ZIP

some RFCs

3100

0.017163

Ivo M.

PDF

tar

RFCs

5520

0.94

Mirko Brodesser

PDF

ZIP

some humor/fun files from textfiles.com

~1000

0.1

Triatmoko

[PDF

[RAR

Archives from www.textfiles.com

ca 2000

0.016

AlexanderNutz

PDF

ZIP

html-dateien von fünf-filmfreunde.de (blog über filme..)

~ 5000

~ 0.022

Jonas Krisch

PDF

ZIP

textfiles

~1500

0.154

Andre Borgeat

PDF

ZIP

Reuters-21578

~20000

Jonas Koenemann

PDF

ZIP

wegt from different pages

~1500

Paresh Paradkar

PDF

ZIP

Selective archives from www.ibibo.org

~1600

0.05966

AlexanderSchneider

PDF

ZIP

selected archives from http://textfiles.com/

~ 1000

~ 0.023

JensSilvaSantisteban

PDF

ZIP

RFCs and some other files from the web

~ 1400

~ 0.084

Daniel Frey

PDF

ZIP

archives from http://textfiles.com/

~ 50000

n.a.

JohannBetz

PDF

ZIP

All textual RFCs

5536

n.a.

Matthias Frorath

PDF

ZIP

Some files from textfiles.com

~ 1300

Ivo Chichkov

PDF

ZIP

text converted HTML files - eNews

~1500

0.154

Manuela Ortlieb

PDF

ZIP

text converted different eBooks

2288

0.001

Jonas Sternisko

PDF

.tgz

text mined with wget from different sources

27k+

0.223

Eric Lacher

PDF

ZIP

RFCs

about 6000

0.0912232

JohannLatocha

n.a.

ZIP

RFCs

1000

n.a.

Waleed Butt

PDF

ZIP

RFCs & textfiles

1500

0.04787

Michael Pereira

PDF

ZIP

text from textfiles.com

1k+

0.241002

Dragos Sorescu

PDF

ZIP

text from textfiles.com

1200 files, 20 MB

0.10129

Achille Nana

RAR

n.A.

RFC

1300

n.A.

Björn Geiger

PDF

ZIP

text from textfiles.com

2628 files

0.172003593...

These were the questions and comments on Exercise Sheet 1

Hi all, why I have different color in my name, file and others content. I see my name on gray color, but other persons in blue color.TRiatmoko 30oct 21.09

I am not able to view my page even after logging in. It says i am not allowed to view this page. My page name is PareshParadkarExcercises. Paresh 28Oct 11:30 am

Is there any result by now concerning an additional tutorial time schedule? The problem is, I've got a colliding lecture during the time of the exercise course, that I'd better not miss. E.g. Tuesdays at 4 p.m. would be fine, too. I don't know if there's anything planned, yet. Marius 27Oct 7:15pm

New deadline for the exercises: Tuesday, 14:00 Marjan 27Oct 2:56pm

I could not able to Create my page like donaldduckExcercise instruction shown in page. getting some error. ConvertError ExpatError: mismatched tag: line 376, column 881 (see dump in /usr/local/var/moin/farm/teaching/data/expaterror.log). Waleed Butt

To Johann + all: I briefly commented on the rules for the exercises in the first lecture and will talk about it again in the next lecture. The exercises are graded (one or two points per exercise), you will get a mark for them in the end, and that mark will be 40% of your final mark for the course (the other 60% will be an exam at the end, we'll see whether that will be oral or written). Groups work is not allowed (the point of the exercises is missed otherwise), and if you copy or otherwise cheat, you are out. You don't have to come to the lectures if you think you don't need it. You don't have to come to the tutorials if you think you don't need it. But you absolutely must do the exercises, and you must do them yourself. Hannah 26Oct09 2:02pm

To Johann + all: I am sorry but solving the exercises is mandatory, however attending the tutorials is not, i.e. if you understand everything there is no need for you to come to the tutorials. Marjan 26Oct 1:45pm

Notice: Please note that I recommend using English when producing your handouts. Please use German if you are absolutely not confident with English. In fact, either language is fine - it's just that there is a chance that I don't understand everything (if it's written in German). Thanks! Marjan 26Oct 1:41pm

Hi there, i don't know if i missed something but i also asked a fellow student who did not know! Are the Exercises mandatory and do we get some kind of points for them (where we need a specific amount to be allowed to participate the exam). I am asking because i don't know if i can finish all of the assigned tasks till tonight. Johann 26Oct09 1:39pm

Hi Claudius + all: it's up to you how many you compute, the more you can find the better. The obvious algorithm is of course to try out all pairs of words. This will find all pairs with one hit, but it's obviously a quadratic algorithm and so will take a very long time even for a relatively small collection. See if you can find a smarter algorithm. Be assured, there is one. Hannah 26Oct09 1:26pm

Hi! In Exercise 4: Do we have to compute all possible pairs of query-words with one hit or only one pair of words? Claudius 26Oct09 12:50am

To Björn: No, it does not. You do not need to include positional information in your inverted index. But you're right: indexes are usually positional and the Zipf's law refers to the size of the inverted lists when positions are included. Marjan 25Oct 7:45pm

Thank you for the response. Does this also concern the inverted index? I'm wondering because the slides only contain examples where the list of documents does not contain duplicate entries. Personally, I could imagine a practical use for both variants. That's why i'm asking. Thanks again in advance. Björn 25Oct09 7:20pm

Hi Björn + all: occurrence always means individual occurrence, that is, if a collection has two documents and the word x occurs once in the first document and twice in the second document then there are three occurrences of this word overall. Ok? Hannah 25Oct09 2:12pm

To Zhongjie + all: you are right, you can't add the #acl line to the document yourself, so I just did it for you. I will change the instructions on the upload page accordingly. Sorry for this initial confusion, but hey, good that we have a Wiki. Hannah 25Oct09 2:06pm

Hey everyone! Whenever the exercise is talking about frequencies and occurences. Does it talk about occurences in different documents or should we considre multiple occurences in the same document. Thanks a lot. Björn 25Oct09 12:49pm

To Johannes: How did you solve the problem? I have logged in my account as my name, and my email links only this account. But I still get the error message 'You can't change ACLs on this page since you have no admin rights on it! ' when I try to enter '#acl ZhongjieCai:read,write -All:read ' to the first line of the page... Zhongjie 25Oct09 01:15am

Problem solved. To everybody: don't try to create multiple users with the same e-mail address. Johannes 25Oct09 00:21am

Oh, I see, well that *must* be your user name. Sorry for not making that clear earlier. Please create an account with that user name and try again. Hannah 25Oct09 00:02am

No that is not my user name. Johannes 24Oct09 11:58pm

Hi Johannes, if you are logged in as JohannesStork you should be able to see it, did you try that? Hannah 24Oct09 11:59pm

I don't know if this is the right place to ask, but I can't access my exercise page. It says "Sie dürfen diese Seite nicht ansehen." Johannes 24Oct09 11:50pm

Good question, Johannes. Please upload the source code separately, either as a .zip or .tgz archive. I have modified the instructions on the upload page accordingly. Sorry if that means additional work for you, we weren't expecting anybody to submit so early ... Hannah 24Oct09 11:43pm

Shall we put the whole source into the PDF? What about tar.gz? Johannes 24Oct09 5:18pm

Hi Johannes + all, the slides are now availabe as PDF, see the link above. Hannah 23Oct09 17:04

Note about Exercise 5: One can assume that a more general model of the word frequencies is given than that given in the lecture, i.e. eps * N * (1 / i^alpha). Now both parameters (eps and alpha) can be estimated simultaneously. Marjan 23Oct09 3:29pm

Can you provide the slides as PDF? Johannes 23Oct09 10:05am

Please note that the deadline for uploading your solutions of the exercises is always Monday, 23:59 (sharp). Marjan 22Oct09 6:15pm

When you add a question or comment here, please end it with your name and the date and time in bold face, just like I did now. Hannah 22Oct09 01:59am

AD Teaching Wiki: SearchEnginesWS0910/ExerciseSheet1 (last edited 2009-11-13 22:59:21 by port-92-193-100-154)