A massive leak of Google documents revealed.
On 28 May 2024, SEO experts Rand Fishkin and Mike King unveiled more than 2,500 confidential Google documents, accompanied by 14,000 technical attributes.
This revelation began when Erfan Azimi shared Google API documents with Rand Fishkin (SparkToro), who then partnered with Michael King (iPullRank).
The files come from API documentation named “yoshi-code-bot/elixer-google-api”.
You can find all the files here.
I have dissected this document and recorded here the most interesting things I found, revealing the secrets of the Mountain View giant.
What you probably weren’t doing before, and what you need to do
- Focus on entities
- Have authors
- Have a site that focuses on a topic
- Possibility of creating thematic links between multiple topics but start by becoming an expert on one core topic
- Place the most important content at the beginning of the article
- Google favours content that requires effort (images, videos, complexity, etc.)
- Optimise for Navboost: content that users enjoy, hire a community manager, relay your articles, etc.
- Do not forget to please Google (so do both)
- Bold your links and words
- For backlinks, do PR outreach and understand the seed PageRank
- All local factors are addressed in this article — we are used to optimising everything except this (even if we already knew it)
Simplified formula for Google’s ranking
In summary, Google ranks you with:
User interaction scores
- UgcScore: user-generated content engagement
- TitleMatchScore: title/query relevance
- ChromeInTotal: total Chrome interactions
- SiteImpressions: total site impressions
- TopicImpressions: topical page impressions
- SiteClicks: site click-through rate
- TopicClicks: topical click-through rate
Content quality scores
- ImageQualityClickSignals: quality via image clicks
- VideoScore: video quality/engagement
- ShoppingScore: shopping content score
- PageEmbedding: page semantics
- SiteEmbedding: site semantics
- SiteRadius: semantic gap
- SiteFocus: main theme
- TextConfidence: text relevance/quality
- EffortScore: content creation effort
Link scores
- TrustedAnchors: backlink quality
- SiteLinkIn: inbound link value
- PageRank: authority score (0-2, ToolBar, NR)
Relevance boost
- TopicEmbedding: temporal relevance
- QnA: base quality
- STS: text/entity understanding
Quality boost
- SAS: link trust/authority
- EFTS: page effort (text/media/comments)
- FS: content freshness
Specific adjustments
- CDS: Chrome data score
- SDS: SERP adjustments
- EQSS: experimental variables
How Google works through the Google Leaks
Crawling:
- Trawler – Web crawling system. Manages the queue, crawl rates and page change frequency.
Indexing:
- Alexandria – Main indexing system.
- SegIndexer – System that ranks documents by levels in the index.
- TeraGoogle – Secondary indexing system for long-term disk-stored documents.
Rendering:
- HtmlrenderWebkitHeadless – Rendering system for JavaScript pages. The name refers to Webkit rather than Chromium. The docs mention Chromium, suggesting Google used WebKit before switching to Headless Chrome.
Processing:
- LinkExtractor – Extracts links from pages.
- WebMirror – Manages canonicalisation and deduplication.
Ranking:
- Mustang – Main scoring, ranking and serving system
- Ascorer – Main ranking algorithm before adjustments
- NavBoost – Re-ranking system based on user click logs.
- FreshnessTwiddler – Re-ranking system based on document freshness.
- WebChooserScorer – Defines the characteristics used for snippet scoring.
Serving:
- Google Web Server – Interface with the Google frontend. Receives data to display.
- SuperRoot – Google Search’s brain that communicates with servers and manages post-processing for re-ranking and presentation.
- SnippetBrain – Results snippet generation system.
- Glue – Results unification system based on user behaviour.
- Cookbook – Signal generation system, apparently created at runtime.
On-Page Factors:
- titlematchScore: Site-level title match score, indicating how well titles match user queries.
- fontsize: Font size of links; used by Google to assess link importance.
- OriginalContentScore: Score representing content originality, especially for pages with little content.
- Avg. Term Weight: Term reinforcement through the use of bold text or strategic terms.
- keywordStuffingScore: Spam score for keyword stuffing.
- spamWordScore: Score associated with words identified as spam.
- textConfidence: Confidence in text relevance and quality.
- effortScore: Effort and quality in content creation.
- Penguin Algorithm: Targets spammy links, including over-optimised internal links.
- Document Length: Limit on the number of words and punctuation; important content must be placed at the beginning of the text.
- Content Length: Google processes a limited number of characters; important content must be placed early on the page.
- Page Titles: Must be optimised and closely match query keywords.
- FreshnessTwiddler: Re-ranking based on content freshness.
Off-Page Factors
- Fresh Docs: Freshness multiplier for links; links from recent pages are ranked better.
- homePageInfo: Indicates whether the source page is a homepage and its trust level.
- SiteAuthority: Indicates the overall credibility or authority of a site.
- sourceType: Quality of the source page of a link, correlated with its indexing level.
- CreationDate: Date a link was first discovered and the last known date that link was found.
- TrustedAnchors: Quality and reliability of inbound links.
- SiteLinkIn: Average value of inbound links.
- PriorSignal: Information about URL history; poor prior quality can affect ranking.
- anchorDiversityScore: Diversity of anchor texts for links pointing to a site.
- trustTarget: Indicates whether a URL is on a trustworthy source; trusted sites have more latitude.
PageRank:
- PageRank: PageRank score taking various factors into account.
- homepagePagerankNs: PageRank of the site’s homepage.
- PagerankNS: Pagerank-NearestSeeds is a pagerank score for the document, calculated using the NearestSeeds method. This is the production PageRank value that teams should use. –> PageRank from 2018 – seed site, see my article on PageRank.
- pagerank: URL ranking value [0-65535]. DEPRECATED. Configuration in NearestSeeds.
- pagerank2: Experimental pagerank score. DEPRECATED in favour of MustangBasicInfo.
- crawlPagerank: Internal docjoiner use to transfer scores from source canonicals to final canonicals. –> using a canonical allows PageRank to be transferred.
- toolbarPagerank: Score [0-10]. If undefined, uses EstimatePreDemotion via MustangBasicInfo. –> the famous PageRank with the toolbar.
- FirstCoveragePagerankNs: Initial pagerank score at first indexing.
- feedPagerank: Normalised score [0-1] specific to RSS feeds. Distinct from the homepage pagerank.
- topPrOnsiteAnchorCount: Anchor quality – optimal >51000, standard <47000
- bookPagerank: PageRank score specific to book pages.
-
anchorPhraseCount: The number of unique anchor phrases. Limited by the constant kMaxAnchorPhraseCountInStats (=5000)
Spam
- Link Velocity: Rapidly acquiring many links can be flagged as spam.
- spamRank: Measures the probability that a document links to known spammers.
- phraseAnchorSpamCount: Number of spam phrases found in anchors.
- phraseAnchorSpamDays: Number of days over which 80% of these spam phrases were discovered.
- phraseAnchorSpamDemoted: Total number of anchors demoted due to spam.
- phraseAnchorSpamEnd: Time at which the anchor spam peak ended.
- phraseAnchorSpamFraq: Fraction of spam phrases among all document anchors.
- spamBrainTotalDocSpamScore: Spam score identified by SpamBrain (from 0 to 1).
- trendSpam: CTR manipulation indicator; number of matching trending spam queries.
Technical:
- URLHistory: Google keeps the last 20 changes of a URL.
- mobileFriendlinessScore: Indicates whether a site is optimised for mobile devices.
- pageLoadTimeScore: Score based on page load time; impacts user experience.
- bylineDate: Date explicitly set on the page, used in search results, syntacticDate: Date extracted from the URL or document title, semanticDate: Date estimated from the document content –> Date consistency (bylineDate, syntacticDate, semanticDate) on the page is important.
- Ranking Degradation Factors: Factors such as inconsistent links, poor UX, low CTR and poor quality content that can degrade rankings.
- NSR Data (chardVariance, chardScoreVariance, nsrdataFromFallbackPatternKey): Variance measurements for NSR scores applied to the site; predict site or page quality.
- hostAge: Date when Google first discovered content on the domain.
- YMYL Scores (ymylHealthScore, ymylNewsScore, encodedChardXlqYmylPrediction): Scores for YMYL content.
Semantic:
- author: Document author(s) stored as text.
- isAuthor: Indicates whether an entity on the page is also the document’s author.
- Authors and Entities: Google considers whether authors are recognised entities in the Knowledge Graph.
- TopicEmbedding: Temporal relevance value.
- siteEmbedding: Compressed vector representation of the site for thematic analysis.
- pageEmbedding: Compressed vector representation of the page for thematic analysis.
- siteFocusScore: Measures how specialised a site is in a specific domain.
- siteRadius: Measures a page’s deviation from the site’s main topic.
- Semantic Text Scores (STS): Global score based on text understanding, salience and entities.
- Short Content Originality: Emphasis on short content originality.
- AI-Generated Content: Google can detect and treat AI-generated content differently.
Local Factors:
- clickRadius50Percent: The radius (in miles) around the assigned location at which the document receives 50% of its clicks.
- localBusinessCompletenessScore: Completeness of local business information.
- businessReviewCount: Number of reviews and ratings for a local business.
- NAPConsistencyScore: Consistency of Name, Address and Phone information.
- contentRelevanceScore: Relevance of content for local searches.
- localMentionCount: Number of local online mentions.
- geoDistanceScore: Distance between the user and the target location.
- bestLocaleMatch: Relevance of local language and metadata.
User Engagement
- UgcScore: Score related to user-generated content engagement.
- crushed click, short click, long click: Types of clicks indicating the level of user satisfaction.
- ChromeInTotal: Total number of views via Chrome across the entire site.
- SiteImpressions, TopicImpressions, SiteClicks, TopicClicks: Engagement and relevance indicators.
- Modulators (Tweeters): Adjust rankings based on content freshness and user engagement signals.
- Navboost: Re-ranking based on user click logs.
- Mustang Algorithm: Main ranking algorithm with boosts for factors such as CTR and content freshness.
- dailyClicks – Daily clicks
- dailyGoodClicks – Daily good clicks
Demotion Algorithms
- Devaluation of Small Blogs: Small blogs can be devalued compared to authoritative sites.
- exact_match_domain_demotion: Demotion applied to exact match domains (EMD).
- Anchor Mismatch: Link text does not match the target site; the link is demoted.
- SERP Demotion: Demotion based on factors observed in results pages, indicating user dissatisfaction.
- Nav Demotion: Demotion for pages with poor navigation or user experience issues.
- Product Review Demotion: Demotion related to product review quality.
- Location Demotions: Global pages can be demoted in favour of more localised results.
- Panda Demotion: Quality_Coati.