If you try to find a homepage of one of our staff by clicking through from the University of Freiburg main page, you will find that this is close to impossible. Thus anyone with finite patience will resort to googling, ecosiaing or binging for the name.
Now, clearly we shouldn't have to rely on giant, general purpose, search engines just to get contact information for a supervisor, some paper's author or even their secretary. So in this project you will built a domain specific search index for all the people involved with Universities be it scientists, IT support or secretaries. The resulting index should be created in a format that can be used with QLever, a SPARQL+Text search engine developed at our chair.
Like all proper search engines you will built a crawler, a software system that reads web pages and extracts search terms to be indexed. In this case the web pages will be restricted to Universities (and possibly similar institutions) and the search terms are restricted to names. To achieve this you will index (the URLs of) all the mentions of people's names your crawler can find on each university web page.
Possible Implementation Steps
Document what you have done and why you did things this way. Use this to backtrack where you went wrong/too complicated when you get stuck. Also collect good examples for relevant AND irrelevant mentions (pages) of persons, which you can later use to evaluate your system (all throughout the thesis/project)
- Collect a list of University web page URLs
You can make use of an existing Knowledge Base such as WikiData which allows listing Universities and their web pages using a relatively simple SPARQL query
You can use our local instances of the above Knowledge Bases
- Built a simple web crawler for extracting visible text from university web pages
You can either use a higher level crawling framework like Scrapy, just an HTML parser like Beautiful Soup (with lower level HTTP libraries such as requests) or completely roll your own (which might be really difficult, time consuming and frustrating, you have been warned).
- You probably want to restrict crawling to the (sub-)domain(s) of the university and to a configurable depth by using limited breadth first search.
- You will pronbably want to restrict your crawling to static HTML pages
- Evaluate the coverage of the crawler with respect to scientist homepages
- For each page of user visible text from the above step you will identify all mentions of names in the text
- In the first version you can make use of very simple rules like all capitalized words after "Dr\." or "Prof\. [Dr\.]?" and using a list of name parts (first, middle, last names)
- In later refinements you can make use of natural language techniques like POS tagging, parsing or machine learning
- For each mention your crawler finds you will save the name (possibly normalized) as well as the URL of the current page in a format that can be turned into a searchable index by one of the tools used at our chair (see above)
- To improve ranking you may also want to save information such as if the name is found in a heading
- You may optionally also store the context where the name was found. If you do, it too should be in a format QLever can make use of
- Develop a scoring system that should assign a score to each mention. Ideally when sorting according to this score pages that are most relevant to a name such as that person's homepage should come first.
- This could include things like the name being mentioned in a heading, BM25 like relevancy scores, the name also appearing in the URL and more
Evaluate your index using the search system and ranking evaluation techniques typical for Information Retrieval