Knowledge Graph

Knowledge Vault and SEO: How Google's AI Knowledge Base Works

The Knowledge Vault is Google's knowledge base powered by probabilistic inference. Discover how it auto-generates semantic triples, works with entities, and why factual accuracy matters for your SEO.

Published on juillet 16, 2022 Reading 6 min By Stan De Jesus Oliveira

The Knowledge Vault is a knowledge base equipped with an inference system. It is therefore a Google Knowledge Graph 2.0 — since it allows for the automatic generation of semantic triples.

We extract approximately 1.6 billion candidate triples, covering 4,469 different types of relations and 1,100 different types of entities. Approximately 271 million of these facts have an estimated probability of being true above 90%.

– Google Search

Since we’ve already covered classic definitions on semantic search and the Google Knowledge Graph elsewhere, we won’t redefine them here.

How the Knowledge Vault Works

Large-scale knowledge bases are increasingly popular — notably Wikipedia, Microsoft’s Satori, and Google’s Knowledge Graph.

However, Google wanted a knowledge base capable of encompassing as many entities as things that exist on the scale of the Web.

To achieve this, they needed to use (and continue to use) automatic methods to build RDF triples (subject — predicate — object) through probabilistic inference systems.

In simple terms, an inference system predicts semantic triples based on other existing triples.

How the Knowledge Vault Automatically Generates Triples

Web content extraction (obtained via text analysis, tabular data, page structure and human annotations) is the first step in defining entities.

The Knowledge Vault extracts facts in the form of triples from across the entire Web — meaning the KV is built on an open world scale rather than a local closed-world assumption. In other words, the KV uses web pages to create new triples. To avoid introducing bias, they merge these facts with already established knowledge — such as from Freebase or the Google Knowledge Graph.

Let’s formalize its operation a bit more.

Extractors

A Knowledge Vault (KV) first needs to extract “things” from the Web to expand its knowledge at scale. These are called extractors.

These systems extract triples from a large number of web sources. Each extractor assigns a confidence score to a triple, representing uncertainty about the identity of the relationship and its corresponding arguments.

Graph-Based Priors

These systems learn the prior probability of each possible triple, based on triples stored in an existing knowledge base.

Knowledge Fusion

This system calculates the probability that a triple is true, based on agreement between different extractors and priors. Naturally, facts extracted from the Web cannot always be trusted. The primary mechanism they use to prevent this is essentially Freebase to verify factuality.

Key Technical Points About the Knowledge Vault

In the original Knowledge Vault paper (https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45634.pdf), they reference how their inference engines would allow the KV to have a structured knowledge repository independent of language.

This sentence particularly struck me, as it reminded me of a Google patent I’d seen previously. That patent already existed but was updated following the incorporation of the KV into their information retrieval systems.

The patent is “Query language,” patent number: US20070198480A1.

This patent demonstrates how Google uses the KV to understand relationships and entities in the world — rather than within a specific country or language context.

Example of a Google patent on entities and their role in understanding web documents

The patent gives an example:

Bill Clinton — the value of a fact could be the text string “Bill Clinton was the 42nd President of the United States from 1993 to 2001.” Some object IDs may have one or more associated property facts while others may have no associated facts. The Figs. 2(a)-2(d) described above are examples only. The structure of the data in repository 115 may take other forms. Additional fields may be included in the facts and some fields described above may be omitted. Additionally, each object ID may have additional special facts beyond name facts and property facts — such as facts conveying a type or category.

Indeed, entities and their relationships are concepts that are language-independent, having been created in a factual and ontological nature. This means they don’t need specific languages to be true — they are not words, sentences, or character strings, but pure meanings.

This could give you some ideas. Most English Wikipedia articles are significantly richer in content than their French-translated counterparts. Since the facts specified in an English document are understood by the search engine as universal, mentioning them on a French-language web page would be perfectly understandable to a search engine — and could potentially give you more weight in search.

SEO and the Knowledge Vault

If you’re getting started with semantic SEO, here are some key points to understand:

The Knowledge Vault assigns a probability score of entity relevance (and trust) for a given query, sourced from the Google Knowledge Graph API using a measure called ResultScore.

The Knowledge Vault and its related systems verify the factual accuracy of your texts. When they find entities (via NLP) but your content is factually incorrect (per KV), Knowledge Based Trust (KBT) could lower your rankings — since you’d essentially be spreading misinformation.

What this means in practice: verifying the factual accuracy of your texts is critical to ranking on the first page of Google. That said, simply mentioning facts isn’t in itself a ranking criterion — but if you mention facts that are incorrect, Google may revise your position downward.

Featured snippets, entity-based search results — all of these are connected to the Knowledge Vault. Understanding how these systems work, even at a basic level, will help you optimize for semantic SEO.