crawl

How to Manage Duplicate Content for SEO: Complete Guide

Duplicate content is a recurring SEO problem. Discover how Google detects it, the main causes (URL parameters, HTTP/HTTPS, multilingual sites), the best tools to find it, and how to fix it.

Published on août 17, 2022 Reading 9 min By Stan De Jesus Oliveira

Définition de Comment gérer le contenu dupliqué pour le SEO ?

Duplicate content — or “duplicate content” as it’s called in English — is a recurring problem to address when optimizing your website for organic search.

Indeed, Google and search engines in general are perfectly capable of detecting duplicated content.

Concretely, this can involve duplicated text between two pages of your website, or between your site and an external website.

We can distinguish two types of duplicate content:

Internal duplicate content
External duplicate content (plagiarism)

Even if you’re not duplicating text between your web pages or other websites (plagiarism), you’re still probably affected by duplicate content issues — particularly due to parameterized URLs.

Simplified Explanation of How Google Calculates Duplication

In Google patents, it is mentioned that the search engine creates content clusters with identical or very similar versions of content — then selects content from the corpus as the representative content.

The representative content will be the one that represents all content clusters as canonical content. The decision for selecting canonical content will generally depend on which site has the domain with the most authority (PageRank).

Interestingly, when the content cluster grows with future duplicates and web pages with no unique value, the representative content and the URL holding it gain more authority. This is called “link inversion.”

Does Google Penalize Duplicate Content?

Search engines aim to find unique content — this is one of the many reasons not to duplicate text, whether from your own site or an external one.

If a search engine recognizes content as duplicated, the page can be devalued or removed from the index.

How Does Duplicate Content Occur?

In 2013, Matt Cutts, a Google engineer, stated that between 25 and 30% of the web was duplicated — and that this wasn’t necessarily a bad thing, as in most cases it’s not intentional. But how can you have duplicate content without doing it intentionally? Let’s see.

URL Parameters

URL parameters are a frequent cause of internal content duplication. A URL parameter is a code element that completes a standard page URL — usually identifiable by the question mark extending the URL.

For example, faceted navigation or faceted URLs. If your website is an e-commerce site, your URLs are duplicated by the number of facet navigations. So if your site displays a product with a different color, you’ll have duplicate content with the original URL like https://your-site.com/t-shirt/ and https://your-site.com/t-shirt?color=red.

This can also happen for tracking codes, internal site search (?q=search-term), AMP versions (site.com/page and amp.site.com/page), pagination, and other cases using URL parameters. To avoid this issue, you can indicate the canonical version — i.e., the main page version — to search engines.

The code is: <link rel="canonical" href="https://example.com/example-page/" />

HTTP / HTTPS

Beyond parameterized URLs, you can also have internal duplication if your web pages are accessible both with and without HTTPS. To address this, you should only allow access to your pages via HTTPS. With good hosting, this is usually just a button click in your hosting dashboard.

www and non-www

Google can also detect duplicate content if your web pages are accessible both with and without “www.” before your domain name. This is also handled from your hosting and registrar settings.

Duplicate Content on Multilingual Sites

You can also have duplicate content if you’ve incorrectly configured Hreflang tags.

<link rel="alternate" hreflang="en-US" href="https://www.example.com/content-in-en-uk" />

Otherwise, Google could inadvertently consider it duplication. Make sure to follow the guidelines on hreflang tags. This occurs when you have a page with identical content because a country may share the same language but requires adjustments like different prices, phone numbers, etc.

Ranking Signal Division

You can also have a duplication problem if you have too many similar pages for the same canonical intent. If multiple pages have very similar content, they’ll rank for the same queries and therefore receive lower rankings — more commonly known as keyword cannibalization. If you’re affected, I recommend either applying a canonical URL to your preferred version or better categorizing your content.

How to Remove Duplicate Content

The primary approach is obviously to avoid duplicating text between your own pages or those on the web. However, as we’ve seen, you can still encounter duplicate content issues. If you work with writers, you also can’t always be certain they haven’t duplicated content. That’s why tools exist to detect both internal and external content duplication.

Tools to Detect Internal Duplication

Screaming Frog: A crawler that identifies technical errors on your website — and can help with canonical tag verification and detecting near-duplicate internal content.

Siteliner: An internal duplicate content analysis tool — more accessible for beginners.

Tools to Detect External Duplication

Kill Duplicate: A powerful tool for external duplication analysis.

Duplichecker: Checks whether content is plagiarized (page by page).

Copyfight: A content protection tool — comes with a WordPress plugin.

CopyScape: A traditional tool for detecting plagiarism.

Canonical Tag vs 301 Redirect vs Meta Robot noindex

Canonical URL: Indicates the reference to the original URL. Should be used for URL parameters.

301 Redirect: If you have an older version of content still known to Google, it could be detected as a duplication of the new version. In this case, redirect the old page to the new one with a 301 redirect.

Meta robots: If for any reason you need to indicate to crawlers that a page should not be seen as a duplicate, you can use the meta robots noindex tag: <meta name="robots" content="noindex"/>. With the page marked as not to be indexed, your duplication issue may disappear. A common bad practice is to mention in robots.txt not to crawl noindex pages — this is incorrect. If you want Google to know a page shouldn’t be indexed, the robot must still be able to find it.

FAQ: Duplicate Content for SEO

When is duplicate content a problem?

There’s no predefined quota or percentage of duplicate content that triggers an SEO problem. It’s more about adopting good practices daily.

What is near-duplicate content?

Near-duplicate refers to similar but not identical duplicated content — the most common form of external duplication. It can be detected by search engines. It’s not exact duplication but partial duplication calculated algorithmically. That’s why when writing content, it’s important not to simply recycle existing content.

Should I worry about duplicate content?

Duplicate content should be verified by the webmaster to achieve better search rankings. Failing to do so could send a negative signal to search engines.

What’s the difference between repetitive and duplicated content?

There’s not much difference from Google’s perspective. Your menu and footer are similar across all your pages — and so can your CTAs (Call-To-Actions). Google refers to these as “recurring text modules.” Google can detect these as duplication. To avoid this, keep recurring modules brief and prefer using links. Avoid or rewrite (ideally per page) repeated modules or widgets, and consider using the semantic tag <blockquote> in appropriate contexts.

Which algorithm detects duplicate content?

Google Panda is the anti-spam filter algorithm used to verify (among other things) duplicate content. This algorithm measures the “gibberish rate” of content and detects duplication by identifying identical shingle sets.

How to find duplicate content using Google?

To find duplicate content, you can use Google search operators like site:my-site.com "the phrase" — or operators like intitle, inurl, inanchor. Example: site:my-site.com intitle:the phrase.

How to find duplicate content in Google Search Console?

Google Search Console offers a “Coverage” report that can detect potential duplicate content. Click on “Excluded” and explore URLs listed as “Crawled, currently not indexed” or “Discovered, currently not indexed.” If your URL appears here, it’s likely due to duplicate content — and Google is choosing not to add it to its index.

How to find duplicate content with Python?

If you’re a technical SEO professional with programming skills, check out the Advertools Python library — or use custom scraping scripts with frameworks like Scrapy and libraries such as URLLib.