Algorithmes Google

SEO and PageRank: The Complete Guide

PageRank is Google's most famous algorithm. Discover its history, how it is calculated (random surfer model, damping factor, topical PageRank), how it was updated in 2018, and how to optimize it both on-page and off-page.

Définition de SEO et PageRank (PR)

PageRank is Google’s best-known algorithm for ranking search engine pages. To simplify it, it is said that a link from one site pointing to another passes popularity, which helps gain authority in Google’s eyes. This is referred to as a vote of confidence. In search engine optimisation (SEO), we use the metaphor of “link juice”, and call these links backlinks (hypertext links connecting one web page to another).

This algorithm stems from the following observation:

How do you judge the quality of a page?

The quality of a product, a person, or an entity of any kind is tied to its popularity. The more popular a brand is, for example, the more likely its products are to be of higher quality, and the more likely internet users are to want to find that brand and not another.

That is why, in SEO, creating links pointing to your site is one of the best techniques for appearing on the first page of Google. This is called netlinking or link building.

Obviously this system is not perfect, and this effect is called the Matthew Effect. It creates a loop that makes an increasingly restricted set of content more and more visible.

This has moreover given rise to a myth in SEO that search engine optimisation does not need to be constantly supported.

The History of PageRank

Library science is at the origin of information retrieval (IR) and how Google works. In library science, the best scientific papers are those most frequently cited.

PageRank is fundamentally based on this same idea: ranking the best documents according to the number of times they are cited.

The PR as it was originally conceived was primarily a way of sweeping the Web in order to begin better organising and ordering relevant information.

Larry Page and Sergey Brin were both students working on a search engine while completing their doctoral degrees, and they drew on the concept of betweenness centrality discovered in the 1930s.

In graph theory and network theory, betweenness centrality is equal to the number of times a vertex lies on the shortest path between two other arbitrary nodes in the graph.

But the foundations of PR were primarily built on the work of individuals closely linked to the creation of the WWW, such as, to name just one, Massimo Marchiori (1997):

The Quest for Correct Information on the Web: Hyper Search Engines

https://www.w3.org/People/Massimo/papers/WWW6/

Overview of PageRank Papers and Patents

One of the first papers on PR: https://www.seobythesea.com/improved-text-searching-in-hypertext-systems.pdf

The idea formulated by Larry Page and Sergey Brin:
http://infolab.stanford.edu/~backrub/google.html

The 1998 PageRank patent by Larry Page:
Original PageRank 1988: Method for node ranking in a linked database — Patent number: 6,285,999 — Inventor: Larry Page

(PageRank test by Haveliwala on 18 October 1999: http://ilpubs.stanford.edu:8090/386/1/1999-31.pdf)

Update of this PageRank on 4 September 2001: Method for node ranking in a linked database — Patent number: 6 285 999

PageRank 28 September 2004: Method for scoring documents in a linked database — Patent number: 6 799 176

Taher H. Haveliwala publishes an article at Stanford entitled Topic-Sensitive PageRank: http://www-cs-students.stanford.edu/~taherh/papers/topic-sensitive-pagerank.pdf

PageRank 6 June 2006: Method for node ranking in a linked database — Patent number: 7 058 628

PageRank 11 September 2007: Scoring documents in a linked database — Patent number: 7 269 587

PageRank 20 October 2015 (granted in 2018): Producing a ranking for pages using distances in a web-link graph — Patent number: 9 165 040

In 2016, the Google toolbar ceased to exist — a tool that had been used to display the PageRank of websites (called the Google Toolbar PageRank), or TBPR, which displayed a score out of 10.

We strongly recommend the thesis by Trystan Upstill (2005, Document ranking using web evidence), which contains numerous elements on PageRank, as well as other things Google does to rank web pages.

Today, PageRank is very different from the original method, and its application for computer systems is too.

In 2020, an article showed how to apply PageRank to 100 billion web pages: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/4aada2aea3fe924feb2904007f1e0f0a085e3b51.pdf – https://research.google/pubs/pub48942/

Introduction to PageRank

Before getting started, it is important to understand that not all pages on the web have a PageRank.

Simply put, if Google (the crawlers) has never visited your page (crawl), it cannot assign a PageRank to your page. But that is not all — it may discover your page but not include it in its knowledge base (indexation). In that case, your page also has no PageRank. This means it is essential to verify that the page linking to you is known and therefore indexed by Google.

Next, it is important to understand that Google does not see websites but web pages. This notion is one of the most important to grasp in natural search engine optimisation. Consequently, metrics such as domain authority are factually incorrect. These metrics, the best-known being the Domain Rating from Ahrefs and the DA (Domain Authority) from Moz to name just two, are only there to give you an informed overview.

The PageRank Calculation

The value of this fluid, this popularity, equals 1. Initially, all pages have 1/N.

Called the “PageRank reservoir”, a site has a base available PR (PageRank) proportional to its number of pages.
The PageRank initialisation formula

Based on the links present on a page, popularity is passed on divided by the number of links (c*PR; c is a constant equal to 0.85, the teleportation constant — also called the damping factor).

In the following diagram, V1 transmits 85% of its PageRank to U. As V3 has 2 links, it transmits 85% of its PageRank divided by 2.

Diagram of the PageRank calculation

The calculation then continues, as it is an iterative calculation:

The iterative PageRank calculation

At the end of this calculation round, the PageRank calculation is not yet complete. The 15% of PageRank that was not passed on is distributed equally among all links.

Through these iterations, a page inherits the PageRank of its neighbours and so on.

PageRank is a probability (due to the random surfer, c = 0.85):

PageRank, a click probability

Note: The old PageRank and the new Google PageRank (they kept the same name) operate with O(N log N), but the replacement has a much smaller constant on the log N factor as it removes the need to iterate until the algorithm converges. This became inevitable in 2006 as the Web went from approximately 1 to 10 million pages.

This formula concretely indicates that the fluid spreads proportionally based on the number of links on the page (we call these outbound links).

Thus, this representative diagram of classic PageRank may help you:

Standard PageRank distribution

However, this diagram is factually incorrect as it does not take into account, for example, the reasonable surfer.

Classic PageRank explanations by Stanford: http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

The Random Surfer

The constant c (=0.85) comes from the random surfer principle. The idea was to create a teleportation principle to mimic human behaviour. When a visitor arrives on a page, they teleport by clicking on another link on the page or leave the page. However, this teleportation, as its name suggests, is random.

This problem therefore needed to be solved in order to obtain more reliable data.

The Reasonable Surfer

The reasonable surfer irons out the problems of the random surfer. The probability that a visitor clicks on a link on the page, in the menu, or on a link present in the content — or that they leave the page — is not random at all and can be predicted more precisely.

Thus, the reasonable surfer assigns a coefficient to each part of a page. Firstly, it breaks the site down into several parts, separating the menu, the footer, and the content. It then assigns a different probability depending on whether a user clicks on a link in the content rather than on a link in the menu.

Thus, the first link in the content might have a click probability of 50%, whereas a link in the footer might be around 2%.

The key takeaway is that the transmission of popularity will be much greater for a link present in the content (especially the first link), while links in the footer or menu will carry far less.
To summarise, here is a simplified diagram:

Illustration of the reasonable surfer for PageRank

The distribution of popularity among all links will not be proportional based on the position of your links. But that is not all.

The Topical PageRank

If PageRank is not proportional due to the reasonable surfer, that is not the only factor that influences the difference in authority power transferred.

We have seen previously the “PageRank reservoir”. A site has a base available PR proportional to its number of pages.

Let us now imagine 2 topics.

The first would be astrology, the second chemistry. Let us imagine that the chemistry topic is much more significant than astrology in the world of the Web, so the chemistry topic has a much larger PageRank reservoir.

If an internet user types the query “mercury” into Google, thanks to the higher PR reservoir, the sites displayed will answer that mercury is a shiny silver metal (chemistry topic). Yet the internet user might also be searching for information about the planet Mercury. The problem is that, assuming the astrology topic has a lower PageRank reservoir, the SERP does not display results about the planet Mercury. This is called semantic masking.

Historically, clever SEO specialists used large topic reservoirs to boost the authority of their pages without much context in the link, simply because it was far more powerful.

The topical PageRank put an end to this manipulation. Today, a so-called non-topical link has little or no benefit for increasing popularity. This means that if a car-related site sends a link to a gardening site, the link will have almost no benefit for your SEO, or indeed none at all.

The topical PageRank calculation is:

The topical PageRank calculation

You can find the topical PageRank in the Babbar tool, called Semantic Value. This metric calculates the PageRank of a page based on the thematic proximity of the links obtained (backlinks). We will discuss later how you can calculate your PageRank for Google optimisation.

To summarise, the authority of a link does not spread as simply as one might imagine.

The 2018 PageRank Update

Granted on 24 April 2018, the patent is entitled: Producing a ranking for pages using distances in a web-link graph.

Created by Nissan Hajaj, principal engineer at Google, this update provides an EAT and PageRank perspective.

2018 PageRank on trust transmission from authority seed sites

Behind this PageRank patent update, we see how it might avoid the manipulation of link web spam by reinforcing trust in a link graph as follows:

A possible variant of PageRank that would reduce the effect of these techniques consists of selecting a few “trusted” pages (also called reference pages or seed pages) and discovering other pages likely to be good by following the links of trusted pages. For example, the technique can use a set of high-quality seed pages (s1, s2, . . . , sn), and for each seed page i = 1, 2, . . . , n, the system can iteratively compute PageRank scores for the set of web pages P.

The problem highlighted by the inventor is that this would require manual work to define the known and high-quality sites.

Consequently, what is needed is a method and apparatus for producing a ranking of pages on the Web using a large number of diverse starting pages without the problems of the techniques described above.

An embodiment of the present invention proposes a system that ranks pages on the Web based on distances between pages, where pages are interconnected with links to form a link graph. Specifically, a set of high-quality starting pages is chosen as references for ranking pages in the link graph. The shortest distances between the set of starting pages and each given page in the link graph are calculated. Each of the shortest distances is obtained by summing the lengths of a set of links that follows the shortest path from a source page to a given page, the length of a given link being assigned to the link based on the properties of the link and the properties of the page attached to the link. The calculated shortest distances are then used to determine the ranking scores of the associated pages.

The PageRank patent discusses the importance of diversity of topics covered by seed sites and the value of a large number of seed sites.

In response to a query, the search engine uses the ranking information to identify highly ranked documents that satisfy the query. The search engine then returns a response via the web browser, the response containing matching pages as well as ranking information and references to the identified documents.

Image from the 2018 PageRank update patent:

Illustration of the Google PageRank patent on seed sites

They speak of the “shortest distance” between the seed set and the web page. It is therefore not an accumulation of PageRank from links, but rather a distance in a link graph.

In the mathematical field of graph theory, the distance between two vertices (nodes) in a graph is the number of edges (edge, relationship, …) in a shortest path connecting them.

Also called a geodesic graph, this is known as geodesic distance or shortest path distance.

Thus Google could calculate the geodesic distance between seed sites and other websites.

As an SEO professional, your work might consist of identifying these seed sites and having a link profile close to the seed sites.

Because this PageRank also works as a kind of fluid.

Thus, if you cannot have a direct link to authoritative sites, you should have the shortest possible path to seed sites.

Indeed, if A trusts B and B trusts C, then A slightly trusts C.

And this PageRank algorithm is something that would give weight to this, but we are not really talking about popularity but about trust-based algorithms linked to their EAT concepts rather than classic PageRank — specifically here the distance between nodes.

For a competitive topic, this is something you should probably focus on rather than acquiring thousands of backlinks with little value and no real trust value (i.e. distance between seed sites). This is even more true for YMYL (Your Money Your Life — medical websites, for example) sites.

Netlinking is not dead, far from it, but for certain topics or competitive niches, this trust-passing PageRank is essential to understand.
Note: To easily find Google seed sites: https://kalicube.pro/trusted-sources. Otherwise, large and established newspapers like the New York Times are also seed sites (the NYT is specifically mentioned in the patent).

Algorithms Related to the PageRank Concept

Other algorithms are linked to the idea behind PageRank, so it seems important to mention and briefly define them.

Google Penguin and PageRank

Google Penguin is an influential anti-spam filter algorithm affecting PageRank if you attempt to manipulate the PR algorithm.

Thus, Google Penguin checks among other things whether:

  • Exact anchor text is used abusively
  • DNS records between domain names are identical
  • Same <title>
  • Identical hostnames
  • Same domain name (NDD)
  • Too many outbound links
  • Identical IP bytes
  • Spammy domain name extension
  • Same contact address
  • Similar Whois
  • PBN (link chains)

EAT (Expertise, Authority, Trust)

Google’s EAT concept is a set of algorithms influencing the relevance of a website. One could theoretically mention PageRank, seed PageRank, Author Rank, Knowledge Based Trust, Google Penguin, and many others.

It would be worthwhile to take a look at E-A-T, as it would give you more leads for improving Google’s perception of your website’s authority, expertise, and trust.

KBT: Knowledge Based Trust

Knowledge Based Trust (KBT) is a trust measure linked to semantic algorithms. Therefore, focus on high-quality articles and verify the factual accuracy of your texts.

The role of Knowledge Based Trust in addressing PageRank gaps

More information here: https://www.vldb.org/pvldb/vol8/p938-dong.pdf

How to Optimise Your PageRank for Link Building (Off-Page)?

Now that you have the knowledge, let’s move on to practice.

All of this can be difficult to grasp, so here are 2 main techniques for building links.

The artificial way: using link platforms.
The “natural” way: contacting sites and explaining why they should place a link to you.

The reality is that, for certain topics, such as plumbing, no site links to plumbers. Why would they? The only thing to do is to register on directories, and then, when you realise you are still behind thousands of competitors, probably buy links.

If you want to know more, consult the guide on netlinking (various link creation techniques).

Overall, here is a summary list to help you optimise your links:

  • The site sending you a backlink must match your topic → If it is a site like the New York Times that covers a multitude of topics, it is still relevant to obtain a link from that type of site.
  • Semantic consistency → If Google does not perfectly understand the meaning of your text, it will not be able to understand the meaning of the link. → If possible, use the same keywords between your page and the page sending you link juice.
  • The site must not be spammy (Citation Flow for example) → Majestic
  • The site must be authoritative (Domain Authority) but must also be in an authoritative directory (URL Rating “UR”). If the domain is strong but the backlink is located on a low-quality “partners” page hidden deep within the site, it will be worth very little.
  • Verify that the link is set to dofollow: our complete article on link attributes
  • Nothing stops you from having multiple backlinks from the same site, but they will be progressively less powerful (we call this referring domain).
  • If possible, the link to your site should be placed in the middle or beginning of the page rather than at the end (reasonable surfer).
  • Think about an anchor text strategy. Exact anchor or semantic anchor. https://createur2site.fr/en/seo/off-page/netlinking/ancres-liens/
  • Click-through rate

Finally, think of PageRank as a fluid and not just a score. It is truly not intuitive, but understanding how PR works is essential for understanding “cycles”.

Intuitively, you can try to build links from sites that truly discuss the same topics as you, so the PR cycles in a loop of sites that could be linked to one another.

And the non-intuitive thing is to create links pointing to your competitors so that the PR cycles back to you. For example, A links to B and B links to you. Then you can link to A.

Training the Googlebot in this way — playing with the reasonable surfer and sending an outbound link that will come back to you, thanks to the reasonable surfer — should be part of your work.

Be careful not to do anything too borderline and not to overuse any single technique.

Note: when you point a link to Google and then point a second link with the same URL, the second link is not taken into account. It is therefore possible to make an SEO-optimised link in running text and then repeat the same link with a UX-optimised “call to action” without harming the anchor optimisation or the link position.

PageRank SEO Tools

Here are some tools that can help you optimise your PageRank and “calculate” it.

Babbar

Babbar is a relatively widely used SEO tool known for its highly innovative metrics on well-considered link building and PageRank in general, but it can also do other things.

Babbar's SEO metrics for calculating PageRank

Here are Babbar’s 4 PageRank-related metrics:

Host Value (HV): The overall power of the domain, sometimes called Domain Rating (DR), is referred to as Host Value in Babbar. It allows you to quickly check whether the website as a whole is authoritative.

Host Trust (HT): Host Trust, often called CF for citation flow (Majestic), is a metric that can give a certain idea of the quality of a site’s content. It measures the level of trust of the page.

Topical PageRank = Semantic Value (SV): This allows you to evaluate the importance of a link. This metric does not simply take into account the PageRank of the pages you target, but also the semantic/thematic connection with your own page. This means that Semantic Value calculates the PageRank of the page based on its own links pointing to it, weighted by their thematic relevance. (Babbar is probably the only operator able to provide this to you.)

BAS — Babbar Authority Score: An overall score that aggregates several pieces of information into a single value. The BAS allows you to compare or know your “real” PageRank quickly without needing to concern yourself with each metric separately.

Example:
Imagine a site with a Domain Rating of 70 — the Domain Rating calculates authority, i.e. the PageRank of a site (as with Ahrefs). Using the Babbar SEO tool, imagine the Semantic Value shows a score of 20. Well, in that case, this would mean that the site is not really authoritative. Although at first glance one might assume 70/100 is very good, in practice this is not the case since most links are not topically relevant. As the semantic proximity of the links is low, its popularity score is very different. In this imaginary scenario, the site owner has probably tried to manipulate their PageRank. If you are buying expired domains, we recommend using the BAS (Babbar Authority Score) metric to judge a site’s popularity rather than Ahrefs’ DR, for example.

Predicting the Power of a Backlink (Induced Force)

By calculating the induced force of a link using Babbar’s algorithms, it is possible to determine which link will be most profitable for you, based on its thematic proximity to your specific page. If you want to learn more about induced force, feel free to browse our dedicated article.

Finding Backlink Ideas

If you want to find web page ideas where you could place a link, the Babbar tool offers Spot Finder to obtain all potential links while ranking them by the highest thematic correlation (topical PR) relative to your web page.

 

Optimising PageRank Within a Site (On-Page)

Here are the main SEO tools that can help you calculate, predict PageRank, and generally optimise your internal linking.

PageRank Sculpting

PageRank Sculpting means paying attention to every small detail of the juice transmitted within your site. Think about the number of outbound links — the more there are, the more PageRank is spread — and think about topical PageRank.

Link Obfuscation

Link obfuscation is a PageRank sculpting method that consists of hiding links from Google in order to help Google’s robots better understand your website. Indeed, Google appears to only take href links into account, so by using a button you can better control the distribution of link juice.

Discover how to obfuscate a link.

Link obfuscation - PageRank Sculpting

Babbar

Internal Page Value: This is the internal popularity of the page within the host containing it. It is the equivalent of internal PageRank that you would calculate with, for example, Screaming Frog and Gephi, with the addition of a reasonable surfer model.

If you would like a tutorial on Gephi, PageRank, and SEO: https://www.briggsby.com/how-visualize-open-site-explorer-data-in-gephi

Oncrawl

Inrank is the PageRank metric of the Oncrawl SEO tool — a metric based on PageRank where the relative popularity of a page is ranked on a scale of 0 to 10 (like Google’s PageRank).

Several graphical representations can help you see more clearly, such as:

Visualisation of internal PageRank with the Oncrawl SEO tool
PageRank distribution based on link depth (Oncrawl)

This allows you to predict page rankings, control link juice, optimise crawlability, and especially in the PR context, optimise the most strategic pages.

What matters to us is how easy it is to find the content. So, especially if your home page is generally the most important page of your website and it takes several clicks to reach one of these stores, it is much harder for us to understand that these stores are actually quite important.

On the other hand, if it only takes one click from the home page to one of these stores, this tells us they are probably quite relevant and we should probably give them a little weight in the search results too.
It is therefore more a question of how many links it takes to reach that content.

– John Mueller, 1 June 2018 at the Google Webmaster Hangout

ScreamingFrog

To visualise the transmission of PageRank on your site, you can launch a crawl from ScreamingFrog:

Crawling a website with ScreamingFrog for PageRank

Then click on “Visualisations” -> “Force-directed crawl diagram”.

Force-directed crawl diagram

However, this does not give you the PageRank distribution.

Therefore, browse the properties and choose “link score”.

Force diagram, PageRank distribution with the ScreamingFrog SEO tool

You can easily see the distribution of the most linked pages based on node size (i.e. the larger a node, the more PageRank it has).

Note: you can see here that my contact page is heavily linked (for example). I need to do some PR Sculpting on that.

You can also view the internal PageRank score percentage without the visualisation mode, in which case the PageRank shown will be on a user-friendly logarithmic scale of 0 to 100 points. The higher the value, the more linked the page is.

Internal PageRank on a logarithmic scale of 0 to 100 (SF)

SEOClarity

SEOClarity can help you with link graphs. Through its Link Graph tool, you can find everything you might need. It can be used for PageRank both outside and inside your pages.

By clicking on the nodes you will get the clusters for each page (this can also be done by clicking on a node in ScreamingFrog).

Here are a few other things:

Conducting Internal PageRank Tests

When you have a good understanding of SEO, you will be able to think for yourself, conduct your own tests, and predict what will work or not. If the resources are available, it would be more judicious to run your own internal PageRank optimisation tests in a test environment.

So, firstly, copy and paste the website (staging).
Secondly, ensure it is invisible to Google (prevent its indexation).
Then modify the structure according to your assumptions.
Check in Oncrawl whether the PageRank in this new structure is more optimal.

If it looks good, optimise the site in the real world and measure the results provided by Google (i.e. whether your traffic has improved).

Here is an excellent e-book on this topic: https://fr.oncrawl.com/ebook/comment-votre-maillage-interne-influence-linrank/

FAQ:

What Is PageRank?

PageRank is Google’s algorithm that calculates the popularity of a website based on the number of other sites linking to it. Popularity gives it an index for better ranking its search results by order of relevance.

What Is Semantic PageRank?

Semantic PageRank is the idea of a PageRank with a “smart surfer”. However, the smart surfer was far too complicated to implement algorithmically, and also less effective than the idea of topical PR. Therefore it was never implemented. The topical PR, on the other hand, does effectively convey semantic information, but under the name “Topic-sensitive”, so we call it topical PageRank.

What Is the Smart Surfer?

The smart surfer is an abandoned calculation. The main goal was to achieve better performance by calculating the probability that a surfer follows the link from page A to page B when interested in the same query. A semantic continuity index.

What Is Domain Rating?

The Domain Rating (DR) metric is a logarithmic scale from 0 to 100 by Ahrefs to measure the overall popularity of a website.

Note: The term “logarithmic scale” means that the gap between DR 75 and DR 76 is much greater than that between DR 18 and DR 19. In other words, the higher your DR, the harder it will be to grow it.

Google ranks pages, not websites. Focus on producing high-quality content and acquiring high-quality backlinks directly to that content. Your DR and search traffic will naturally increase as a by-product of this.

What Is Domain Authority?

Domain Authority or DA is the same thing as Ahrefs’ Domain Rating. Although calculated differently, it serves the same purpose.

In general, these metrics allow you to quickly and approximately determine the popularity of a site in order to ensure that even a page without backlinks linking to your content will pass on PR from its other pages by bounce. However, it is generally essential to assess more thoroughly where the links come from, because even though these tools try to filter for spam, some black-hat SEOs have found ways to inflate the domain authority of these SEO tools and have exploited this loophole for profit (selling links more expensively because of high domain authority).

How to Know Your PageRank?

To know your PageRank — i.e. the number of links pointing to your page — you can use Ahrefs, Moz, or Semrush. However, if you want to take all factors into account, namely the topical PageRank and the reasonable surfer model, this is possible by entering your URL into the Babbar tool.

What Are the PageRank Calculation Tools?

The best-known tools for calculating your PageRank are: Ahrefs, Semrush, Moz, Babbar, Oncrawl, ScreamingFrog.

What Is the Sandbox?

Many SEO professionals and webmasters have noticed that websites with a recent domain name have a much harder time ranking. This is called the sandbox. However, Google disagrees that there is a specific algorithm for this.

Is PageRank Dead?

PageRank is not dead. Google repeats every year that PR is still used in its algorithms to rank pages.
Two things to keep in mind. The first is the declaration of 16 July 2019, when a Google research engineer on a Hacker News thread told the world that Google had stopped using the Stanford version of PageRank in 2006 — indeed, PR has changed significantly since 2006. The second is that many algorithms similar to PR are released every year, but they are complementary and do not replace PageRank, even if they sometimes call it PR (this is because they are based on the same formulas as PR).