Observation: the intersection for the extended lists takes about twice as long compared to the simple lists. We checked the correlation between intersection time and result list length; there is kind of a linear correlation but it's not very strong. TODO: this should be explored further at some point, to make sure that the factor of two is really inherent. (12Dec07)
Topic Description for a Master's Thesis
Note: the following could very well lead to a publication, too. In fact, it would make a very nice follow-up to the SIGIR'07 paper by Turpin et al., cited below.
A new method for snippet generation that improves over the method from the SIGIR'07 paper by Turpin et. al in three ways: (i) no more search in the document text is required, but all that information is computed already during the query processing; (ii) also work for advanced search features, where the words to be highlighted to not appear verbatim in the text, e.g., substring search, synonym search, semantic search; (iii) the semantics of all the search operators (e.g., proximity, or, join) do not have to be re-implemented for the excerpt generator, but only once for the query processor.
Possible titles: Efficient Excerpt Generation for Complex Queries, or Efficient Excerpt Generation for Advanced Search.
The method by Turpin et al
Stores each document in a compressed format, where each word is replaced by an id, and that id is encoded in way such that the more frequent words get smaller ids.
Given a query, transforms all words in the query to their id, and then find all matches of these ids in the document, compressed as described above.
Note: as described this only works for literal matches; it is not clear how to make it work for e.g. substring matches.
Our new method (variant one, still has some problems)
The basic idea is to compute, along with the set of matching documents, the positions of the individual query words in those documents.
Here is the straightforward way to do this, assuming we have precomputed inverted lists for each word, and assuming that each posting in such a list also specifies the position of the respective word occurrence.
Consider a query a b c. First intersect the precomputed inverted lists for a and b, and along with the resulting set of documents containing both a and b, also store the positions of both a and b in those documents.
We can continue by intersecting that list with the precomputed list for c, but now we have to store, along with each document that contains all of a, b, and c, the positions of all three query terms.
That is, we have to implement a data type postings list, that can hold positions for a variable number of words for each document. This is not practical, and in the paper, we should have experiments to demonstrate that this would slow down query processing time considerably (by a factor of two I would expect).
Our new method (variant two, the one one would implement)
For this variant, assume that the basic operation is again intersection (note for later: union would work just as well) of two lists of postings (with positional information), but the result list is of the same type, that is, it contains positional information for a single word, and let us say, it's the positions from the second list.
That way, if for a query a b c we first intersect the precomputed list for a with the precomputed list for b, and the result with c --- let us agree to write this as (a * b) * c --- we get not only the list of matching documents, but also all the positions of c in those documents.
Now if we also wanted the positions of b, we could, by commutativity and associativity of intersection, instead compute (a * c) * b, and similarly we could get the positions of a by computing (b * c) * a.
However, this would triple the query processing time (making the rough assumption that the processing time does not depend on the order of evaluation, which is true only in certain situations, for example, when the lists a * b, a * c, and b * c have about the same size).
Assuming that the list a * b * c will me much shorter than the total size of the lists of a, b, and c, we can easily do better as follows: Compute a * b * c once, in the order that gives the most favorable computation time. Then compute (a * b * c) * a and (a * b * c) * b and (a * b * c) * c
Our experiments should show that we can do this in essentially the same time, as needed for ordinary processing of a * b * c.
We should explore how this goes together with top-k techniques!
The main experiments to be done were already outlined above (show that variant one is not practical + show that variant two is realizable with negligible overhead).
We should have at least one web-scale collection, for example GOV2 or WT100G, with a real query log, for example [check recent papers using wt100g].
Then we should also have an experiment with Ester's semantic searches.
Maybe another experiment with a less heavily enhanced index, e.g., only synonyms, subwords, and spelling variants (index size should not increase by more than 50%, better less).
Fast Generation of Result Snippets in Web Search CompleteSearch/ExcerptGenerator/turpinetal07sigir.pdf PDF CompleteSearch/ExcerptGenerator/turpinetal07sigir.ppt SlidesBR Andrew Turpin and Yohannes Tsegay and David Hawking and Hugh WilliamsBR in Proceedings 30th Conference on Research and Development in Information Retrieval (SIGIR'07), pages 127 - 134.BR