fonctionnement Google

Information Retrieval (IR): Definition, Methods and Google

Information Retrieval (IR) is the science behind how Google and search engines work. Discover its history, key models (Boolean, vector, probabilistic), IR system metrics, and how AI transforms IR today.

Published on juillet 23, 2022 Reading 9 min By Stan De Jesus Oliveira

Définition de La recherche d’information (RI)

Information retrieval (IR) is the science of building computer systems that retrieve information in response to queries. IR is a broad field that encompasses not only web search engines but also digital libraries, enterprise search systems, and question-answering systems. The goal of any IR system is to find, within a collection of documents, those that are most relevant to a user’s information need. When you enter a search term in the search field, the search engine returns relevant information about your search term from the stored data.

But IR can also be associated with the science behind social networks, as they too are composed of systems such as information filtering and their recommendation systems.

We also speak of an information retrieval system (IRS), which refers to all the software enabling the functions necessary for information retrieval. The performance of an IRS is traditionally evaluated using controlled experiments carried out on standardised test corpora, as in the Cranfield experiments, which established the foundational framework for measuring IR system quality.

What Is a Corpus?

A corpus is a collection of documents, for example texts, images, videos, gathered for a specific purpose. The quality, representativeness, and size of a corpus are fundamental to the performance of the IR system built on it.

In general, a corpus is divided into at least two corpora:

A training corpus: it serves as a model to have a sufficient amount of information and enable the system to learn the statistical regularities of the domain;
A test corpus: it is used to verify the quality of the learning from the training corpus and evaluate how well the system generalises to unseen data.

How to Establish a Score for Information Retrieval?

Information retrieval needs a score to rank the most relevant documents. For example, in Google’s patents, they sometimes refer to an IR score representing a kind of overall rating of a web source to evaluate it against other documents in the search results and thus choose the order of relevance.

A trivial example of a better score would be, for the question “Search engine optimisation”, a document containing the word “optimisation” and “search engines” would theoretically provide a better answer, as opposed to a document containing only the word “search engines”.

More complex scoring methods take into account the semantic proximity of terms, using models such as BM25 or TF-IDF, which weight the importance of a term both within a document and across the entire corpus. These scoring systems evolved significantly from the 1970s onwards, with the work of researchers at the Cranfield experiments establishing the foundations for evaluating IR system performance.

Today information retrieval is far broader than the simple number of words in common relative to the query. Indeed, whether in the context of search engines or social networks, a certain arbitration of the best documents is also done through the analysis of user interaction (CTR).

The TREC and SIGIR conferences give a good overview of the diversity of research conducted today in the general field of IR.

Timeline of the most important events in the field of information retrieval (IR)

Measures of an IR System

If you want to understand patents or scientific papers on IR such as those from Google, you will absolutely need to know these words: precision, recall, noise, and silence.

An IR system is precise if almost all the documents returned are relevant.
An IR system has good recall if it returns most of the relevant documents in the corpus for a question.
An IR system is noisy if it returns too many documents of which few are relevant.
An IR system is silent if it does not return enough relevant documents.

The Main Methods for Information Retrieval

The first systems used by libraries allowed for Boolean searches, i.e. searches where the presence (or absence) of a term in a document led to the selection of the document.

The Boolean model:

Content can only be found using the operators “and”, “or”, “not”
Content is not sorted — there is no ranking of results.

This system quickly showed its limitations and from the 1970s onwards, experiments showed that automatic techniques could work correctly on corpora of a few thousand documents. These are referred to as vector or probabilistic models.

The ontological model:

It is not based on content evaluation but on the evaluation of the link structure between documents. We can mention as an example the algorithm of Google’s PageRank, developed by Larry Page and Sergey Brin.

The text statistics model (algebraic and probabilistic approach):

Term weighting in corpora is performed using WDF and IDF. These allow documents to be classified. The interest of term weighting was discovered by IBM in 1957 and popularised by Karen Spärck-Jones in 1972:

WDF: Within Document Frequency — relative frequency of a term in a document
IDF: Inverse Document Frequency — frequency at which a document appears in a database with a specific term
The vector model is also part of the text statistics model: each text corresponds to a point in space, the angles of the vectors (Salton cosine or cosine similarity) indicate the similarity of words with one another.
Okapi, also known as Okapi BM25 or BM25, is the base model of the probabilistic formula; the formula can be supplemented with a statistical model such as TF and IDF. This formula models the notion of document relevance in a corpus at the level of “quality evidence” of the text.

I discuss this in greater detail in the article on how Google understands a text.

Information Retrieval for Google

Google’s Main Information Retrieval Systems (IRS)

1. A web search system: generally called a crawler, it retrieves all documents on the Web.

2. The indexer: distils the information contained in the documents of the corpus into a format that lends itself to rapid access by the query processor. This generally involves extracting document features by breaking documents down into their constituent terms, extracting statistics relating to the presence of terms in documents and the corpus, and computing any query-independent evidence. Once the index is created, the system is ready to process queries.

3. A query processor: The query processor serves user queries by matching and ranking the documents in the index based on user input. Since the query processor interacts directly with the document index, they are often confused.

4. Ranking in web search: The main component of the query processor is the document ranking function. The ranking functions of modern search systems frequently incorporate many forms of documentary evidence. Some of this evidence consists of textual information — we speak of text-level evidence in the article on how Google judges and gives a quality score to a text. Other evidence, such as external document descriptions or recommendations, is gathered through an examination of a document’s context in the web graph (for example via the PageRank algorithm).

The Elements Supported for Google’s Retrieval Tasks

Many elements are retrieved to classify documents in information retrieval. Here are the main ones.

After analysing text-based documentary evidence (vector space, probabilistic ranking, statistical ranking…), other evidence can serve for retrieval tasks such as:

Metadata — Structured data
URLs
Document structure and tag information
Anchor text
Bibliometric measures
PageRank
Title (<title>)
…

A very good resource on this is the thesis by Trystan Upstill: Document ranking using web evidence. Now employed at Google.

Today (2022) in Google’s case, artificial intelligence techniques such as machine learning are at the heart of information retrieval. For example, BERT is a machine learning artificial intelligence model enabling natural language processing (NLP) in order to rank search result documents more intelligently. It represents a cutting-edge technology for information retrieval, enabling a much deeper semantic understanding of both queries and documents than traditional statistical methods alone.

Whatever the case, the classic methods of information retrieval such as indexing, the use of keywords, and word statistics still appear to be used to rank documents for search engines. Indeed, numerous SEO case studies have been able to demonstrate that a broad lexical field around the target topic and keyword brought better results in search engine optimisation, i.e. by optimising TF*IDF.