fonctionnement Google

How Google Understands Text and Gives It a Quality Score

From TF-IDF to BERT, discover how Google reads, vectorizes and scores your content. This guide covers IR theory, word embeddings, BM25, Word2Vec, co-occurrence, n-grams and what it all means for SEO.

Définition de Comment Google comprend un texte et lui donne un score de qualité

Google is not easily able to predict the quality of a web page from its content. It therefore uses probabilistic models to determine whether the page corresponds well to the query formulated by the user.

To do this, Google transforms words into vectors, as it needs a mathematical representation to make machines understand words.

Since 2018, BERT, an artificial intelligence for language processing (NLP), has made it possible to understand what an entity, a person, a date linked to their Knowledge Graph is, and understands how everything is interconnected. Also, its bidirectional model allows it to better grasp the meaning (semantics) of a word. Thus Google is able to better interpret a page thanks to language processing models.

Finally, Google applies a score by examining in particular the presence of rare words, which is why it is important to produce good content that brings added value while putting your content through the wringer of SEO tools.

The workings of Google are of such complexity that it is impossible to explain everything, nor even claim that anyone at Google knows each of the processes capable of making Google what it is.

Here we will focus on what is important for an SEO: understanding the evidence in a text that Google looks for to judge part of the quality of a page.

Information Retrieval (IR)

It is important for an SEO (Search Engine Optimizer) to understand the concept of information retrieval (IR), as this is the foundation of the search engine.

Information retrieval (IR) is the field that studies how to retrieve information from a corpus. This corpus is composed of documents from one or more databases, which are described by content or associated metadata. The databases can be relational or unstructured.

For example, web pages are fundamentally unstructured and are connected simply by hyperlinks.

Information retrieval is historically linked to information science and library science, which aim to represent documents in order to retrieve information from them, through the construction of indexes in the same way that Google does today (to a certain extent).

What is Natural Language Processing (NLP)?

NLP, which stands for Natural Language Processing, is a discipline that focuses essentially on the understanding, manipulation and generation of natural language by machines. Thus, NLP is truly at the interface between computer science and linguistics. It therefore concerns the ability of the machine to interact directly with humans.
What we are going to see just after is therefore Google’s idea to try to understand the meaning of words.

Word Embedding: Vectorisation of Words

Word Embedding is a method that Google uses for language processing; word embedding is simply the vectorisation of words.

But what is the point of transforming a word into a vector?

Let us use an example.

Let us imagine that we have 2 words in the French language:

  1. Pain (Bread)
  2. Chocolat (Chocolate)

And that the query (the keyword) on Google is as follows: “pain chocolat chocolat”

How will Google determine that one page is more relevant than another?

By analysing the document in relation to other documents on the SERP, relative to the formulated query.

Let us say that the 1st document contains: pain pain chocolat pain

The 2nd: chocolat chocolat chocolat chocolat chocolat pain
The 3rd: pain chocolat
The 4th: pain chocolat pain chocolat chocolat

This gives us:

Vector representation in a table
Now, let us plot our vectors in a vector space.
Representation of a vector space

Intuitively, we can see that document 4 is closest to the query, so it is the most relevant. Therefore, in this simplistic model, it is this one that will rank in 1st position on the 1st page of Google.

This gap between the query and the document is calculated using the cosine between the vectors, called the Salton Cosine.

Salton is one of the pioneers in the field of information retrieval in computer science. One of his most important contributions is the development of the vector model.

When we have more than 2 words, we enter a 3D vector space, then 4D, etc.

The vector space model is based on the implicit assumption that the relevance of a document to a query is correlated with the distance between the query and the document. In the vector space model, each document (and query) is represented in an n-dimensional Euclidean space with an orthogonal dimension for each term in the corpus.

For example, a cat will be represented by a vector [0.43 0.88 0.98 1.3]. If we do this for all words in the language, it then becomes possible to compare word vectors with each other by measuring the angle between vectors. This will then allow us to predict that the word “dog” is closer to the word “cat” than it is to the word “skyscraper”. A vector space would also allow answering equations such as king – man + woman = queen, or the equation Paris – France + Spain = Madrid.

The aim of all this is that previously Google was only looking to match keywords. This is not at all practical for information retrieval.

Imagine that you are on a search engine, and that you want to find a restaurant that serves hamburgers.

A classic search engine would only return websites that contain the exact word: hamburger.

Yet, a restaurant that sold cheeseburgers would suit us, would it not?

TF*IDF

TF.IDF was used by Google (and is probably still used). This idea comes from Karen Sparck-Jones, an idea which in reality dated back to 1972 before search engines even existed.

Moreover, Hans Peter Luhn of IBM in 1957, in the context of information retrieval, had already discovered the principle and the interest of term weighting.

In any case, we saw previously that the problem is that all words have the same weight and that moreover it is enough to simply add words multiple times in relation to the query to be the most relevant.

Thus, TF-IDF solves this problem.

TF-IDF: Term Frequency Inverse Document Frequency.

TF: Term Frequency – Words are called terms. So this translates to the frequency of words.

IDF: Inverse Document Frequency. – Terms that are common in a corpus are less likely to convey useful relevance information. A frequently used measure of term discrimination is used. This is called IDF.
Example:

If a query contains the term “SEO”, a text is more likely to meet the information need if it contains this term: this is called the term frequency within the document (TF).
Nevertheless, if the term “SEO” is itself very frequent within the corpus, i.e. it is present in many texts or just like definite articles, it is in fact what we call poorly discriminating. That is to say, it adds no value.

This is why it is proposed to increase the relevance of a term based on its rarity: what is called IDF. Thus, the presence of a rare term from the query in the content of a document increases the “score”.

Ultimately, the weight, score of content, is obtained by multiplying the two measures: TF * IDF

Once a set of potential pages has been identified as potentially answering the user’s query, Google must rank them by order of relevance. TF-IDF weighting is then commonly used.

TF*IDF is used in other applications for search engines or has been. Indeed, the TFIDF measure or similar variants has likely been used for example in scoring anchor text, notably via probabilistic models such as BM25.

TF IDF is not a component of Word2vec but can be used in language processing models such as Word2vec to obtain better weights for words.

Note: it is important to use all words related to the query, i.e. all TFs, and not just think about IDF words.

Document Length Normalisation

The term weighting function in the vector space model is often length-normalised, so that a term appearing in a short document is given more weight than a term appearing in a long document. This is called document length normalisation.

After observing relatively poor performance for the vector space model, Singhal (and others) hypothesised that the form of document length normalisation used in the model was inferior to that used in other models. To study this effect, they compared the length of known relevant documents with the length of documents otherwise retrieved by the retrieval system.

Their results indicated that longer documents were more likely to be relevant.

So ultimately, there is a certain length normalisation, but statistically long content is more relevant.

The idea for an SEO is therefore generally to create medium-length content. Even if this is in reality infinitely more complex, notably due to search intent.

Search intent comes from the BERT machine learning algorithm. Indeed, the latter detects whether a user wants short content, for example if they are searching for a definition. In that case, it is pointless to create content of more than 300 words. This is why it is generally preferable not to take into account whether your content is long enough or not, as long as it allows you to answer the user’s query.

OKAPI BM25

Okapi BM25 is a weighting method or rather a retrieval model based on the probabilistic retrieval framework used in information retrieval.

In information retrieval, Okapi BM25 (BM stands for Best Matching) is a ranking function used by search engines to rank documents based on their relevance to a search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s primarily by Stephen E. Robertson and Karen Sparck Jones.

The name of the ranking function is BM25. However, it is generally called “Okapi BM25”, since the Okapi information retrieval system, implemented at the City University of London in the 1980s and 1990s, was the first system to implement this function.

BM25 is a bag-of-words model that ranks documents based on the frequency of terms that appear in each document, regardless of the relationships that may exist between these terms or their relative proximity within the document. There is a whole family of functions assigning a score to each document for a given query.

Okapi BM25 is a more highly regarded variant than TF*IDF. Indeed, it is described as one of the state-of-the-art methods in term weighting and document scoring.

BM25 and its “more recent” variants, for example BM25t (a version of BM25 that can take into account document structure (Hn tags, bold text, etc.) and anchor text), represent state-of-the-art TF-IDF type retrieval functions used in document retrieval.

Okapi BM25F, representation of weighting based on semantic HTML structure

This is an example but not representative of reality at Google

The basic idea is to use an inverted index. This means keeping, for each word, a list of documents on the web that contain it.

Answering a query corresponds to retrieving matching documents (this is essentially done by cross-referencing lists for matching query words), processing documents (extracting quality signals corresponding to the query), ranking documents (using quality algorithms such as PageRank), then returning the top 10 documents.

The main advantage of BM25 that makes it popular is its efficiency. It performs very well in many retrieval tasks.

BM25 is therefore a better and more representative probabilistic model of the web for state-of-the-art TF-IDF type retrieval because they use semantic HTML elements. This does not mean that BM25 relies on the TF-IDF framework (or the vector space model). By this phrase, I mean that the BM25 score is calculated based on two main components: TF and IDF. However, there are certain techniques to normalise document length and satisfy the concavity constraint of term frequency (e.g. by considering logarithmic TF, instead of raw TF). Based on these heuristic techniques, BM25 often achieves better performance compared to TF-IDF.

TF-IDF in AI

In AI, TF, IDF and TF * IDF are used as a measure, used in the fields of information retrieval (IR) and machine learning, which can quantify the importance or relevance of string representations (words, phrases, lemmas, etc.) in a document among a set of documents (also called a corpus).

Term-Frequency (TF) consists of counting the number of occurrences of tokens present in the corpus for each text. Each text is then represented by a vector of occurrences. This is generally referred to as a Bag-Of-Word (BoW), or bag of words.

Representation of vectors from the Term-Frequency (TF) method

Representation of vectors from the Term-Frequency (TF) method

Term Frequency-Inverse Document Frequency (TF-IDF): this method consists of counting the number of occurrences of tokens present in the corpus for each text, which is then divided by the total number of occurrences of those same tokens throughout the entire corpus.

For term x present in document y, we can define its weight by the following relationship

The TF*IDF formula

Where:

  • tf_x,y is the frequency of term x in y;
  • df_x is the number of documents containing x;
  • N is the total number of documents.

This approach therefore allows obtaining for each text a vector representation that includes weight vectors (thanks to TFIDF) and not occurrence vectors.

Co-occurrence

SEO practitioners (Search Engine Optimizers) hear about occurrence and co-occurrence at every turn, but let us go back fundamentally to co-occurrences in vector spaces.

When two words – or other linguistic units – have a close or distant semantic relationship, the notion of co-occurrence is at the base of that of themes and lexical fields.

Quote from linguist John Firth:

You shall know a word by the company it keeps.

Corpus:

  • I’m riding in my car to the beach.
  • I’m riding in my jeep to the beach.
  • My car is a jeep.
  • My jeep is a car.
  • I ate a banana yesterday.
  • I ate a peach yesterday.

Let us imagine a vector of size k, where k is the number of distinct words.
We have 14 words, which gives us a vector of size 14:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Co-occurrence Matrix

To build a co-occurrence matrix, we must start with the complete vocabulary of words.

Let us break down the words from our group of sentences in the corpus above, decomposing them into unigrams:

a, ate, banana, beach, car, in, is, I’m, jeep, my, riding, to, the, yesterday

Then, when we look for the semantic link, the context of a word, we take a word from our corpus, for example “car”. Then we initialise a vector at 0 with as many 0s as there are words in the corpus. This will then allow us to find the compatibility of words with our word “car”.

In our corpus, for the sentence “I ate a banana yesterday”, by making a mathematical representation using vectors, you could see how effective this method is for realising that the words “ate, a, yesterday” have an infinitesimal probability of finding “banana” in another sentence of our corpus such as:

I ate a jeep yesterday

or

I ate a bicycle yesterday

Thus, what we call co-occurrence is a surrounding word of a word that helps provide context.

However, for example, polysemy is considerably more complex because there are multiple contexts.

Avocat (lawyer/avocado) is a polysemous word – what does the word refer to in context?

If we have co-occurrences such as: Marc is a highly regarded avocat
Or a co-occurrence such as: The avocats are ripe

In that case, the co-occurrence(s) could reflect that the word “avocat” is a profession in the 1st sentence or a fruit in the 2nd.

Except that in the Word2vec model, polysemous words were not really understood – we talk rather of “broad context”, like the fact that the probability of someone eating a Jeep is not possible. BERT will address this challenge.

Thus, a co-occurrence is not in itself in SEO a true content score weight as the presence of a rare word in TF * IDF processing would be, but simply a disambiguation of a topic for Google. But indeed, if Google understands your content well because the co-occurrences are relevant, your page could have more chance of ranking. But this is relatively natural. A co-occurrence could be a rare word, but this would only be a coincidence.

Word2vec

Thomas Mikolov was in charge of leading a research team at Google. This technology was developed in 2013. His idea was to transform words into vectors. That is when they created Word2vec, which can simply be translated as word vectorisation.

In the Word2Vec method, unlike the One Hot Encoding and TF-IDF methods, an unsupervised learning process is performed. Unlabelled data is trained via artificial neural networks to create the Word2Vec model which generates word vectors.

Word2vec is a two-layer neural network that processes text by “vectorizing” words. Its input is a text corpus and its output is a set of vectors: feature vectors that represent words in that corpus. Although Word2vec is not a deep neural network, it transforms text into a numerical form that deep neural networks can understand.

Indeed. The main hypothesis of these methods being to take into account the “context” in which the word was found, i.e. the words with which it is often used. This hypothesis is called the distributional hypothesis.

And what is interesting is that this context allows creating a space that brings together words that have not necessarily been found next to each other in a corpus! These vector representation methods have also made it possible to train word representation models on much larger corpora (hundreds of billions of words, for example…).

The Word2Vec algorithm is not a single algorithm but a combination of two techniques that use AI methods for natural language processing (NLP).

These two techniques are:

  • CBOW (Continuous bag of words)
  • SG (skip-gram – also called k-skip-n-gram)

The different methods for Word2vec language processing

In both cases, the neural network has two layers. The hidden layer contains several hundred neurons and constitutes, at the end of the representation, the word embedding allowing a word to be represented. The output layer allows implementing a classification task using a softmax.

We will not detail the training methods in detail. The first, called “Continuous Bag of Words” (CBOW), trains the neural network to predict a word based on its context.
In the second method, we try to predict the context based on the word. This is the “skip-gram” technique.

The skip-gram is therefore the inverse architecture of CBOW.

In other words, the input to the neural network in the CBOW framework takes a window around the word and tries to predict the output word. In the skip-gram framework, we try to do the reverse, i.e. predict the surrounding words over a predetermined window using the studied input word.

In general, skip-gram achieves better performance.

Once trained, such a model can detect synonym words or suggest additional words for a partial sentence without having to specify how the words are linked or why.

BERT

In natural language processing (NLP), BERT, an acronym for Bidirectional Encoder Representations from Transformers, is a language model developed by Google in 2018. This method has significantly improved performance in natural language processing.

The problem with previous ideas is that for example, the word “solution” and “phial” are represented by relatively close vectors in the vector space where they are defined.

Indeed, if in our text, the word “solution” refers to “I found the solution”. The vector should not be close to the vectors of the “biology” theme…

BERT was trained so that it can guess which word should appear according to a context. Moreover, being a bidirectional model, it analyses the meaning of a word in relation to sentences both upstream and downstream. This allows it to determine the context of a word.

For example:

BERT's understanding of semantics (vector space)

The sentence is “les bras se plient au coude” (arms bend at the elbow). Google indicates that the highest confidence for the word “bras” (arms) in this context is linked to the sentence “wave your arms around” and not “Germany sells arms to Saudi Arabia”. So BERT understands that “bras” in this context refers to the body limb. Because, in English, “arms” could also mean weapons. This is what is called in linguistics a polysemous word.

What does this change compared to what we have seen previously? BERT is much more technical, much more complex and much more representative of reality. Notably by taking into account entire sentences both upstream and downstream, and not just 5 words around a word.

BERT is infinitely more complex, but if you wish, an entire article is dedicated to BERT.

BERT has therefore, in part, made it possible to better consider the meaning of content to judge the quality of a page. Not to mention that this has also helped further limit Web spam (such as content spinning).

BERT is an artificial intelligence of machine learning and can be used in other applications such as voice recognition, image understanding, user search intent (understanding queries and refining search results)…

Other NLP Language Processing Models in a Semantic Web

The Web is increasingly approaching a Web 3.0. In Tim Berners-Lee’s definition, this means that a word is not just a string of characters but something that has meaning (semantics).

It is thus that Google created a knowledge graph called the Google Knowledge Graph in order to “really” understand things.

For example, historically Paris was just a vector. Today Paris is understood as a city in France, not because it is close to vectors such as “France” and “capital”.

Semantic algorithms

Initially created manually by volunteers, today the Knowledge Graph relies on the Knowledge Vault to continue improving their graphs, and this in an automated manner.

When Google detects an entity (this can also be a date, an event, a brand, …) it will try to see if it knows it in its knowledge graph in order to better grasp the meaning of the content as a whole.

Google's API for NLP (language processing)

Here, Google links Paris to the Wikipedia article “Paris” as a city.

Google tries to determine statistically that here, Paris refers to the city and not Paris Hilton. It then sees what Paris is and what Paris is truly linked to.

Then, it is pure linguistics work to examine how each word is related.

The syntax of words and phrases that Google determines with its NLP methods

Sentiment analysis also helps refine the understanding of language.

Sentiment analysis has many applications. For example, Google could use pre-trained continuous vector representations of texts to understand sentiments contained in customer reviews of a product.

Finally, Google will assign a category to the entirety of your text. For example, the category might be “/Beauty & Fitness/Cosmetology & Beauty Professionals”.

However, I strongly advise you to explore our complete article on how Google works as a structured and semantic search engine in order to understand everything:

SEO in a Semantic Web.

Vector Spaces, Semantics, Semantic Algorithms

Think now about Hummingbird, RankBrain, and how these AIs learn and change queries and search results based on semantically close vectors, close to their vector spaces.

Think about how everything is interconnected – for relevance, the score of content, and for semantic search, and for the ranking of pages.

As an SEO, How Does This Concern Us?

By using SEO tools such as SEOQuantum, YourTextGuru, Cocon.se, InLinks, and many others, these semantic SEO tools will help you increase the relevance of your content by adding words from the same lexical field for Google’s understanding (co-occurrences) as well as potentially obtaining rare words. Also, these tools have semantic SEO layers, such as suggesting verbs, adjectives that help identify entities, or suggesting answers to questions. Or other things such as lexies and metamots, which is what cocon.se does.

This will therefore allow you to:

  • Appear more easily on Google SERPs for keyword variations (relative to the targeted keyword and query)
  • Be more relevant in Google’s eyes
  • Beyond relevance, we can also say that this allows Google to better understand our pages by disambiguating our ideas by providing semantically close words

Beyond the tools, this can also help you learn to write content (SEO copywriting) so that Google better understands the meaning of your page. By indicating a broad lexical field around the topic you are covering, this allows Google to consider you a quality page. But also, in a semantic web you would need to learn to write in triples so that Google understands each entity in your text..

If you are more of a technical SEO, you could also build your own tools by manipulating lines of code based on Word Embedding libraries. Almost all of Google’s technologies in natural language processing NLP (Natural Language Processing) are open source. Regarding word vectors, you can use tools that have done the work for you by using Google’s open source APIs to maximise your content score, such as https://wordgraph.io/ .

The SEO tool Wordgraph based on Google's NLP APIs to maximise its SEO score

To give another example, it is possible to calculate your BERT score in order to determine whether your content is easily understandable by search engines.

N-grams are also important.

Analysing n-grams for your SEO with Oncrawl

N-grams refer to the terms that appear most often on your website. And Google analyses your n-grams. For example, this reminds me of a Google patent produced among others by Navneet Panda, the inventor of Google Panda. This patent indicates that Google could generate site quality scores based on linguistic models derived from n-gram statistics to compare with known high-quality sites. Among other things. And other similar patents. In short, n-grams are important.

N-grams can also reflect your SEO strategy, as they indicate the different keywords for which you want to rank.

N-grams have yet other applications in SEO optimisation, such as probing the competition. You could thus obtain information on your competitors’ N-grams by seeing the different keywords they are positioned on.

Best practices for n-grams:

  • Make sure you are not using a keyword stuffing strategy, as this has a negative impact on your n-grams. (Google Panda looks at your n-grams).
  • On the other hand, establish an appropriate keyword strategy in order to rank for specific terms.
  • Regularly check your n-grams using a tool like OnCrawl to see if your keyword strategy is appropriate and working correctly.

If you wish to know more about how Google judges the quality of a site in terms of content (but not only) you should consult the article on EAT (expertise, authority, trust).

And finally, to calculate TF IDF I advise you to start here:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer.fit_transform

All of this teaches you how Google works and how you can optimise.
When a man is hungry, it is better to teach him to fish than to give him a fish, attributed to Confucius. If you teach a man the art of fishing, he will eat for life.

Summary:

Word vectors and context vectors are at the foundation of Google, but also of search engines in general. They use these mathematical vector-based representations for many things, such as for Google RankBrain. This is something indispensable to know in search engine optimisation (SEO). Always and again, place synonyms and have a broad lexical field of words / keywords around a topic.

Also, this article is only a quick overview of a small thing that Google does and certainly does not suffice to explain everything. For example, Google looks at your content to see if it is of poor quality, i.e. whether your spelling is good in order to limit Web spam, whether you use too many semantically close words to boost your page’s text score. That anchors are powerful on-page factors for content optimisation, but this is simply not the direction I wanted to take in this blog post.

Thus, I advise you to read up on anti-spam filters such as Panda, as it is a filter that analyses spammy content. But also the algorithms that revolve around the concept of Google EAT, which could sometimes be linked to content scores.
And globally, a careful reading of the algorithms and the Google ranking factors listed on this site.

In any case, and as you have been able to realise, Google is both very intelligent and extremely limited. And Google still has a great deal of progress to make in improving information retrieval. This is why content is generally highlighted by search engines when it is well-known and has many links pointing to the document, as this is more reliable evidence than the content itself. Even though, obviously, everything is weighted and there are also other things taken into account to judge the ranking of a page.

If you have questions, or anything at all, please do not hesitate! Good comments will help clear up any ambiguities I may not have thought to clarify!

FAQ

What is a word vector?

In information retrieval and linguistics, vectors are used to quantify the degree of semantic similarity between words using vector spaces. The semantic proximity between multiple words is all the stronger when the spatial proximity between vectors is high. Calculated from an index between 0 and 1 (0 = zero proximity, 1 = maximum proximity), it is obtained from the angle or the length. Polysemy constitutes a limit of this approach: thus, the word “solution” having several meanings will be close to words related to one or the other meaning, even though there may be no connection for the context in which it was used.

What is a context vector?

A context vector solves the problem of word vectors. In the context of information retrieval at Google, BERT uses a bidirectional model to understand the context of a word, thus it can solve the problem of polysemous words.
Historically, the approach to remedy this was Word Embedding. This consisted of building fixed-size vectors that take into account the context in which the words are found.

What is Okapi BM25?

Okapi BM25 is a bag-of-words model using the principle of TFxIDF. Its variants such as BM25F have weighting applications for semantic HTML elements.

What is Word2Vec?

Word2Vec is composed of two architectures: the continuous bag of words model (CBOW: continuous bag of words) and the skip-gram model (k-skip-n-gram). CBOW aims to predict a word in a sentence. The skip-gram has a symmetrical architecture aimed at predicting context words via an input word. Its improved version is BERT.

What is FastText?

FastText is an openly accessible library developed by Facebook’s AI Research (FAIR) laboratory, including an engineer named Thomas Mikolov. The model allows creating a supervised or unsupervised learning algorithm in order to obtain vector representations of words and to do word embedding. This technology is particularly similar to Word2Vec but, for example, the use of the n-gram model differs.
FastText is very easy to set up even for a beginner, so do not hesitate to familiarise yourself with it.

What is BERT?

BERT, for Bidirectional Encoder Representations for Transformers, is a natural language processing model based on the same ideas as its predecessor Word2Vec. Instead of CBOW or skip-gram, we talk about MLM (Masked Language Modeling) and NSP (Next Sentence Processing). However, it allows a better understanding of the meaning and context of words based on surrounding sentences. Its mechanism is even more complex as it uses the attention mechanism.

What is NLP (natural language processing)?

Natural language processing (NLP) is a multidisciplinary field involving linguistics, computer science and artificial intelligence, which aims to create natural language processing tools for various applications.

What is Word Embedding?

Word embedding is a method of learning word representations used in natural language processing. Word embedding is simply the vectorisation of words in a vector space. However, some distort the word. Which can be complicated. Artificial intelligence and its professions change every day, in the same way as for SEO practitioners.

What is TF*IDF?

TF-IDF is a weighting method used in information retrieval and therefore by search engines, particularly in text mining but it can have other applications. This statistical measure allows evaluating the importance of a rare term.

What is a bag of words?

We consider that the world can be described by means of a dictionary (of “words”). In its simplest version, a particular document is represented by the histogram of occurrences of the words composing it: for a given document, each word is assigned the number of times it appears in the document, which is called a “bag” (multi-set). A document is therefore represented by a vector of the same size as the dictionary, whose i-th component indicates the number of occurrences of the i-th word of the dictionary in the document. Two classic normalisations are lemmatisation and stemming. It is also quite common to define a rejection list (stop words) of words not to be considered (such as pronouns, articles, etc.) as they are too numerous in text corpora to be discriminating. In addition to the words in the dictionary, it is also possible to consider combinations of them, in other words N-grams, thus increasing the size of the dictionary.

What is an n-gram?

N-gram is often used in language processing. But n-grams have enormously different applications. For example, they are used in artificial intelligence or in content duplication detection. Since a picture is worth a thousand words, here is what n-grams simply consist of:

Illustration of the n-gram model and operation

Difference between skip-gram and n-gram?

k-skip-n-gram (or briefly skip-gram) is a general concept of skipping certain words in a sequence (e.g. a sentence) whereas in the context of word2vec it is the name of one of its algorithms – much more complex than “just” the idea of skipping certain words, but it uses a context with a “skipped word”.

What are the SEO tools to boost your content score?

YourTextGuru, InLinks, SEOQuantum, 1.fr are the best known.

What is quality content for Google?

Quality content for Google does not mean much. It is necessary not to do internal and external content duplication. Then, offer content that adds value, and use semantic SEO tools to increase your “score”..

What is Google’s filter for detecting poor content?

Google Panda is the anti-spam filter detecting content “farms”. It uses in particular an n-gram model for duplicate content detection.

What is KELM?

Google AI Blog announced KELM, a method that could be used to reduce bias and toxic content in search (open-domain question answering). It uses a method called TEKGEN to convert Knowledge Graph facts into natural language text that can then be used to improve natural language processing models.