For Participants
Programme
For Contributors
About
runs the Information Retrieval and Analysis group at Microsoft Research Cambridge (UK), which is concerned with core search processes such as term weighting, document scoring and ranking algorithms, and combination of evidence from different sources. The group works closely with product groups to transfer ideas and techniques. His main research interests are in the design and evaluation of retrieval systems. He is the author, jointly with Karen Sparck Jones, of a probabilistic theory of information retrieval, which has been moderately influential. A further development of that model, with Stephen Walker, led to the term weighting and document ranking function known as Okapi BM25, which is used in many experimental text retrieval systems. Prior to joining Microsoft, he was at City University London, where he retains a part-time position as Professor of Information Systems in the Department of Information Science. He was awarded the Tony Kent STRIX award by the Institute of Information Scientists in 1998. In 2000, he was awarded the Salton Award by ACM SIGIR. He is a Fellow of Girton College, Cambridge.
is a researcher working on Information Retrieval at Yahoo! Research Barcelona. He is interested in the applications of machine learning (ML) and natural language processing (NLP) for information retrieval (IR). More specifically, he is interested in developing measures of relevance (i.e. ranking functions) between linguistic objects such as search queries and web documents. From 2001 to 2006, Hugo worked at Microsoft Research (Cambridge, UK) with Stephen Robertson, mostly on probabilistic ranking methods for corporate and web search, but also on document classification, expert finding, relevance feedback, and dialogue generation for games. He also collaborated extensively with Microsoft product groups such as MSN-Search and SharePoint Portal Server. Prior to Microsoft Research, Hugo taught computer science and completed a Ph.D. at the LIP6 (U. Paris 6) on the application of dynamic probabilistic models to a wide range of Information Access problems.
The Probabilistic Relevance Model (PRM) is a formal framework for document retrieval developed in the 1970-80s which lead to the developement of one of the most successfull text-retrieval algorithms, BM25. In recent years, research in the PRM has lead to new retrieval models capable of taking into account document meta-data (specially structure and link-graph information). Again, this has lead to some of the most succesfull web-search and corporate-search algorihms, BM25F. However, this has come with a cost: the explosion in the number of free parameters has put optimisation and machine learning at the core of the development of ranking functions in the PRM model.
One of the reasons of the success of the PRM, we believe, is the powerfull combination of sound theoretical modeling and pragmatic parameter tuning steps that lie at its core. When it is well understood, the PRM model can provide a solid ground on which to analyse new IR problems and derive new solutions beyond BM25. However, there are very few in-depth descriptions of the PRM framework in the literature. This leads many researchers to treat BM25 as a black box, and develop their research on the \emph{outside} of this box.
This tutorial will present the Probabilistic Relevance Model from a conceptual point of view, describing in-depth the probabilistic modelling assumptions behind the model and the ranking algorithms that result from its aplication. Furthermore, it will explain recent developments in PRMs leading to ranking functions which integrate document structure, document meta-data and link-graph information. Finally, we will discuss the different free parameters of the ranking functions and how to set them or optimise them.
The tutorial will be divided in two parts lasting roughly 90 minutes each:
In the first part of this tutorial we will describe the intuitions and full derivation of the PRM model. We will then discuss how different modeling assumptions can be incorporated and how they lead to new ranking functions. In particular we will develop a series of document retrieval algorithms that can be derived from the PRM, from the binary independance model (BIM) to BM25. Free parameters will be introduced and discussed within the probabilistic inference and optimisation frameworks.
Topics:
In the second part of this tutorial, we will discuss models that take the PRM model beyond its original scope. In recent years there have been a number of important extensions to the original PRM dealing with document and hyperlink structure. In particular, the BM25F model was developed to retrieve structured documents with heterogeneous fields (such as title and anchor text). Furthermore, it has been shown that hyperlink document features as well as other non-textual document features can be naturally added to the PRM. Finally, all these extensions come at the cost of new free parameters in the model, which need to be set. Fortunately, recent developments in machine learning can be applied successfully to this tuning problem.
Topics:
The main course material will be a 20 page technical report on the PRM and BM25F and accompanying slides which will be published in June 2007. This will be compiled with the rest of the papers in the bibliography bellow.