Algorithmes Google

Google Panda: Everything You Need to Know About the Anti-Spam Algorithm

Google Panda is Google's algorithm for detecting and penalizing low-quality content. Deployed in 2011 and integrated into the core algorithm in 2016, it uses machine learning and Navneet Panda's patents to evaluate content quality at scale.

Published on juin 14, 2022 Reading 17 min By Stan De Jesus Oliveira

Illustration pour l'algorithme de Google Panda

Google Panda Definition

Google Panda was deployed on 24 February 2011 with the primary aim of demoting content farms. Google Panda is therefore above all a Google algorithm acting as an anti-spam filter that detects and removes duplicate, insufficient or poor quality content from the top pages of search results.

This filter is today an integral part of the algorithm and therefore constantly operational. Its algorithms that compose it would logically be used at each iteration of the crawl process.

Where Does Google Panda Come From?

We generally talk about Google Panda to refer to the first anti-spam filter by the name of Panda. This filter detects spammy content and more generally low quality content, while focusing particularly on sites in the medical niche.

But the team, named “Panda” at Google, did not stop there and continued to publish new patents on detecting good content.

It is obviously interesting to know about the Google Panda filter, but it is also important to remember that Google introduces new algorithms every year, notably on the detection and understanding of human language in order to detect good content.

The Inventor of Google Panda

As mentioned above, the inventors behind this filter are teams. But still, let us note that its name comes from an engineer.

The developer of the Google Panda Algorithm was carried out among others by an engineer whose surname is Panda. The name of the filter was therefore called Google Panda. His full name is “Navneet Panda”.

According to Amit Singhal: Well, we named it internally after an engineer, and his name is Panda. So, internally, we called it Panda. He was one of the key guys. He essentially proposed the breakthrough a few months ago that made this possible.

He is an important person in the world of SEO — yeah, a real human being with a pretty cool name. And the update was based on his breakthrough. So, if Panda is a person whose recent breakthrough resulted in a massive change in how websites are evaluated in the eyes of Google, what we can know about him could help the largely confused world regarding Panda or Farmer or any other update.

What Does Google Panda Do Exactly?

Let us return to the premise of detecting good content, the Google Panda anti-spam filter. It is very well known, and everyone knows in one sentence what it is for: detecting whether it is a site that does not comply with Google’s guidelines by looking at its content.

That is indeed the working definition.

But here is something more interesting. Amit Singhal, an engineer at Google, had published 23 guiding questions on which the anti-spam filter focused:

Is the information presented in this article trustworthy?
Was this article written by an expert or someone with good knowledge of the subject, or does it only provide superficial information?
Does the site contain duplicate or overlapping articles on one or more similar topics, with slight variations in keywords?
Would you trust this site with your payment card details?
Does this article contain spelling mistakes, stylistic errors or inaccurate facts?
Does the proposed content correspond to readers’ genuine interests, or was it generated solely to improve the site’s search ranking?
Does the article provide original content, information, research, analysis or reports?
Does the page offer something additional compared to others?
To what extent is the quality of the proposed content controlled?
Does the article offer multiple perspectives on what happened?
Does the site have recognised authority in the area being addressed?
Is the content produced by a large number of creators, largely outsourced, or distributed across a wide network of sites, meaning that each page or site is not subject to thorough oversight?
Is the article polished or does it seem to have been written in a hurry?
If you had a question about a medical problem, would you trust the information on this site?
Does the name of this site evoke a competent authority?
Does this article offer a complete description of the topic?
Does this article offer in-depth analysis or information that required some reflection?
Is this the kind of page you would like to bookmark, share with a friend or recommend?
Does this article contain an excessive number of advertisements that distract the reader or prevent them from accessing the main content?
Could you find this article in a magazine, encyclopaedia or printed book?
Are the articles useless because they are too short, too superficial or too vague?
Have the pages been produced with a great deal of care and rigour or not?
Do internet users who access these types of pages have reason to complain?

To summarise this list, which I hope you read carefully, Singhal refers to the use of repeated words, i.e. semantic optimisation at content level and more broadly content spinning. To content duplication, author authority, webmaster common sense and even backlinks.

What Google Says to Improve Your Ranking

Here is an excerpt from Google Search Central explaining how to optimise your content for the Panda filter.

Many of you have asked us how to improve your ranking on Google, especially if you think you are affected by the Panda algorithm update. We advise you to keep the above questions in mind and do your best to offer quality content, rather than trying to optimise it solely to meet the requirements of Google’s algorithms.
Note also that if certain pages of a website are of poor quality, this can affect the ranking of the entire site. Therefore, you may remove these pages to optimise the ranking of those of better quality, or improve their content, integrate them into more useful pages or transfer them to another domain.

Obviously they are not wrong. Simply focus on creating good content. But when you see that among their questions they refer to the presence of a “qualified author“, this can nonetheless give you ideas for optimising your SEO.

What I Say to Optimise Your Content

It is very easy to summarise all of this as “create good content”. But the real question arises: what actually is good content?

To summarise, without casting too wide a net, here is specifically what you should take care to do when writing good content:

Cite your sources and mention entities
Have an “author” section
Do not write “waffle”
Do not copy sentences from other content (internal or external to your site)
Add additional information compared to your competitors
Match your content to the search intent
Think about user experience at page level from a visual standpoint and more specifically avoid pop-ups
Is your content long enough to cover the topic in its entirety?
The page has internal links allowing the topics mentioned to be explored further
(Create “good content” — do not create superficial, vague or too short article pages. Create content that is shareable)

An Overview of Some Panda Team Patents

We cannot say precisely whether a given patent concerns the first version of Panda’s main algorithm. But here is an overview of the patents filed by engineer Navneet Panda to improve search engine rankings.

The Navneet Panda Algorithm

Here is an example of a Panda team patent covering anti-spam but also the analysis of good content more broadly:
The Navneet Panda algorithm, a Google patent by the Panda team

Name: Navneet Panda algorithm.
Granted: 12 May 2015
Filed: 27 June 2012

This granted patent offered a means of measuring the quality of a website, and this measurement could influence a site’s ranking in search results for a particular query.

The patent explicitly indicates what features it was looking for in a site that might seem to indicate that the site was a quality site.

The score is determined from quantities indicating user actions consisting of searching for and preferring particular sites and the resources found in particular sites. A site quality score for a particular site may be determined by calculating a ratio between a numerator that represents user interest in the site as reflected in user queries directed to the site and a denominator that represents user interest in resources found on the site as responses to queries of all kinds. The quality score of a site can be used as a signal to rank resources or to rank search results that identify resources found on one site relative to resources found on another site.

A query may be classified as referring to a particular site when it has been determined that the query is a navigational query to that particular site.

In other words, the patent describes how to assign a quality score for a navigational intent, such as for example if you type “panda Google createur2site” on Google — i.e. by mentioning the website name, for example.

It can also process a site that is a “collection of resources” as a site, according to this site quality score approach. These collections may include multiple domains that exist on the same domain or a site divided into sub-domains or sub-directories.

The Panda Team’s “Gibberish” Patent

The patent “Identification of gibberish content in resources” allows a gibberish score to be calculated for the resource using the language model score and the query filling score; and uses the calculated gibberish score to determine whether to modify a ranking score of the resource.

This patent uses, among other things, n-grams to calculate the “gibberish rate”.

What Are N-gram Phrases?

An n-gram phrase can be a 2-gram, 3-gram, 4-gram or 5-gram phrase, where pages are broken down into two-word phrases, three-word phrases, four-word phrases or five-word expressions. If a body of pages is broken down into n-grams, they can create language models or sentence patterns to compare with other pages.

N-grams allow sentences to be broken down to see whether content is producing gibberish or duplicating other content.
How n-grams work

I will not explain n-grams mathematically.
Rather, understand this.

By breaking down sentences with this model, if someone copies a sentence and changes a few words, or their order, Google is able to understand that you have duplicated the content. Because it will be able to find identical “sets of shingles”.

And therefore, the proportion of common shingles equals the duplication rate.

As an SEO, How to Comply With This?

If you are in full control of your site, you know whether you have duplicated things. If you work for a client, it is good to do the following:

Crawl a site and check its internal duplication rate (ScreamingFrog for example)
Crawl a site and check its external duplication rate (Kill Duplicate for example).

Using N-gram Sentence Models to Generate Site Quality Scores

Using N-gram sentence models to generate site quality scores

We saw just before that shingles allow us to detect whether content is duplicated or not.

But also that this can calculate the gibberish rate of content.

But the N-gram sentence model also allows a relevance score to be assigned to content, as we can see in this 2017 patent.

In fact, this recent Google patent generates site quality scores based on linguistic models from n-gram statistics to compare with known high quality sites.
In addition to generating n-grams from text on sites, some versions of the implementation of this patent will include generating n-grams from anchor text of links pointing to pages on the sites. Building a sentence model involves calculating the frequency of n-grams on a site “based on the number of pages divided by the number of pages on the site”.

Google Panda: Algorithmic Implementation

Wired.com: But how do you implement that algorithmically?

Cutts: I think you’re looking for signals that recreate that same intuition, that same experience that you have as an engineer and that users have. Every time we looked at the most blocked sites, it matched our intuition and experience, but the key is that you also have your experience of the types of sites that will add value to users rather than not add it. And we actually came up with a classifier to say, OK, IRS or Wikipedia or New York Times is on this side, and low quality sites are on this side. And you can really see mathematical reasons…

Singhal: You can imagine in a hyperspace a bunch of points, some points are red, some points are green, and in others there is a mix. Your job is to find a plane that says most things on this side of the plane are red, and most things on this side of the plane are the opposite of red.

Merged Search Results

The patent is:
Selective Generation of Alternative Queries
Inventors: Navneet Panda, April R. Lehman, Trystan G. Upstill

The patent indicates that the search engine can use a manually prepared whitelist of high quality sites and blacklist of low quality sites, or one prepared by an algorithmic method.

We are also told that if a certain number of the top ranked pages for the initial query are on low quality sites, a second query based on this first query may be used. The patent tells us that one way to proceed is to use a database that “includes replacement query terms and can generate an alternative query by substituting a replacement query term for one of the query terms in the first query”.

This is reminiscent of certain Google patents covering substitute query terms, such as the likely Hummingbird or RankBrain, BERT patents, and all the algorithms influencing semantic SEO.

As an alternative, the patent tells us that the search engine could build “a conceptual query graph and traverse the graph to obtain one or more alternative queries”. Where each node of the graph is defined by a query and a set of top-ranked search results obtained for that query. The links between nodes in the graph may indicate that queries are related or that one query is an alternative query for another.

This would be very different from the link graphs we think of when it comes to Google, but an interesting way to think about how alternative queries might be found. The patent builds on this graph-based approach.

This search system can evaluate more than one possible alternative query before selecting one with the highest confidence measure.
If the result set includes a merged threshold of high quality sites, it may try to accumulate more results of alternative queries from high quality sites.

FAQ: Google Panda

Google Panda: What Penalties?

SEO myths persist, notably because things change over time. Today Google does not care if you duplicate content, by which I mean it does not penalise your site. It will however ignore duplicate or low quality pages, in the same way as Google Penguin and backlinks.

What Is Google Panda For?

Google Panda is an algorithm acting as a content farm filter that penalises pages that do not add value compared to other documents in the SERP.

How to Protect Yourself From Google Panda?

As explained in this article, you need to create content that is relevant and user-focused.

Relevance: Synonymous With Having an Author Section?

One of the biggest problems is that many people think that Google Authorship and Google Agent Rank or Author Rank are the same thing. And they think that if you verify authorship on your site, your site automatically has a lot of author rank.

Having an author description for everything and anything does not improve your ranking.
Some SEO practitioners even prefer to remove the link from the author biography to optimise PageRank.

I cannot tell you which is better than what. I do not know what would be most optimised and in any case “it depends”.

It depends on whether you are cited as an author on other sites, it depends on whether you place a link to the author for an Amazon product article hoping to appear more relevant.

Now, if you have a site that talks about health, having a mention specifying the author with a link redirecting to the biography is undoubtedly much better than optimising PageRank.
I would say that in most cases, the best approach, if we want to do it solely for Google’s algorithms, is to always place an author mention without links. But to display somewhere a link to the author’s biography.

Google Coati

Google Coati is the successor to the Panda algorithm, according to an announcement by Google Search’s vice-president at an SMX Next conference (November 2022). Coati is simply a rebranding of Google Panda. There is no additional information on any potential changes in the filtering criteria of the Google Coati algorithm compared to its predecessor Panda. This news could be considered an anecdote rather than relevant information for SEO practitioners and content creators.

Summary: What to Remember About Google Panda

Google Panda is designated as a single anti-spam focused algorithm that shook the web during the year 2011.

That is not entirely wrong.

It nonetheless seems important to me to mention that Navneet Panda, the person who invented the Panda anti-spam filter, regularly releases new patents for detecting good content. That he is a human being. Well beyond a simple anti-spam filter, he continues to improve search quality, sometimes building on the same algorithmic ideas.