Type: Interesting and well-defined problem. It will be relatively straightforward to get results of medium quality. The challenge is to achieve results of high quality. Machine learning will be crucial to achieve that. Exploring suitable methods is part of the challenge.
Goal: Extract a large number of scientist's homepages from the CommonCrawl web crawl. Extract the central information from these pages, including: name, profession, gender, affiliation.
Step 1: Download a large chunk of university webpages (e.g. all universities in Europe or in the US or even in the whole world).
Step 2: Design, implement and run a classifier that identifies people's homepages. A first version can be rule-based. For a classifier of high-quality machine learning will eventually be needed.
Step 3: Extract basic information, like the name, the gender, the profession, and the affiliation. A first version can be rule-based. For result with high precision and recall (which is the goal), machine learning will eventually be needed.