crawl

robots.txt and SEO: Everything You Need to Know

The robots.txt file tells search engines where they can and cannot go on your site. Discover how to create and configure it correctly, the key directives, 7 best practices and common mistakes to avoid.

Published on février 21, 2022 Reading 8 min By Stan De Jesus Oliveira

Tout savoir sur le robots.txt pour optimiser le référencement d'un site Web

The robots.txt text file tells search engines where they can and cannot go on your website.

Primarily, it lists all the content where you want to prevent search engine crawling such as Google. It is also possible to instruct other search engines to crawl different pages.

Is a robots.txt Important for SEO?

It is clear that a site that does not use a CMS and has fewer than ten pages is really not important. However, to avoid any future problems, I strongly advise you to set it up.

In any case, it will not be negative for your SEO, quite the contrary.

Here are some points that robots.txt brings in terms of natural SEO optimisation:

Keeping sections of a website private (for example, your staging or test environment)
Preventing an internal search engine from crawling
Optimising your crawl budget.
Preventing crawling of duplicate content
…

Where to Find Your robots.txt?

If you already have a robots.txt file or want to create one, place it at the root of your site, i.e.: https://your-site.com/robots.txt

Make sure to write “robots.txt” exactly.

How to Create a robots.txt?

The robots.txt is a simple text file that crawlers visit in advance before crawling your pages. This allows them to know what they are allowed to visit.

Here is what a typical robots.txt looks like for WordPress sites:

Sitemap: https://createur2site.fr/sitemap_index.xml
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

If you do not yet have a robots.txt file, it is easy to create one. Simply open a .txt document and start specifying your directives.

Warning, incorrect use of robots.txt can send your site to the bottom of search engines. For example, if you prevent crawling of all URLs on your site by search engines.

User Agents

Each search engine identifies itself with a different user agent. You can define custom instructions for each of them in your robots.txt file. There are hundreds of user agents, but here are a few useful ones for SEO:

Google: Googlebot
Google Images: Googlebot-Image
Bing: Bingbot
Yahoo: Slurp
Baidu: Baiduspider
DuckDuckGo: DuckDuckBot

You can simply use the wildcard asterisk character (*) to assign directives to all crawler bots.

For example, suppose you want to prevent all bots except Googlebot from crawling your site. Here is how to do it:

User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

Warning, every time you declare a new user agent, it acts as a clean slate on the previously specified elements.

So you indicate the following:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

The crawlers understand that you do not want them to visit the wp-admin directory except for admin-ajax.php. However, if you specify additional directives only for Googlebot, you will need to indicate again that you do not want them to visit the wp-admin directory, for example.

Basic Directives

Directives are rules that you want the specified user agents to follow.

Allow

The “allow” directive allows search engines to crawl a subdirectory or page, even within a specifically disallowed directory.

For example, if you want to prevent search engines from accessing all pages of your site except your blog, your robots.txt file could look like this:

User-agent: *
Disallow: /
Allow: /blog

Disallow

This directive tells search engines not to access files and pages that have a specific path.

For example, if you want to prevent all search engines from accessing your blog, your robots.txt file could contain these instructions:

User-agent: *
Disallow: /blog

sitemap (site map)

To be placed at the beginning of robots.txt. Including the sitemap in your “robots” allows you to specify the location of the site map to crawlers. You do not need to repeat the sitemap directive multiple times for each user agent. This does not apply to just one. So, no need to place it below “User-agent: *”.

Sitemap: https://example.com/sitemap_index.xml

7 Best Practices to Adopt for Your robots.txt

Let us look at five best practices to avoid common mistakes.

1. One Line Per Directive

Each directive must be on a new line.

That is, you must write like this:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

And not like this:
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php

2. Use “*” to Avoid Hundreds of Useless Lines

The “*” character can be applied to all. That is, it allows you to specify that all search engines are affected by the directives, but not only that.
For example, instead of including:

Disallow: /products/t-shirts
Disallow: /products/hoodies
Disallow: /products/jackets
…
You can simply use the asterisk:
Disallow: /products/*

3. Use “$” to Indicate the End of a URL

The “$” symbol marks the end of a URL.

Suppose you want to prevent search engines from accessing all .pdf files on your site, your robots.txt file could look like this:

User-agent: *
Disallow: /*.pdf$

Thus, this directive tells bots that all PDFs on the site should not be crawled.

4. Specify a User Agent Only Once

This seems fairly logical, but you must not include the same agent multiple times.

User-agent: Googlebot
Disallow: /page/

User-agent: Googlebot
Disallow: /page-2/

5. Be Specific

Avoid neglecting the trailing slash at the end of your directives.
For example, “Disallow: /en” would prevent crawling of all pages in the English directory. However, if you have a page “/encyclopedia”, the page will not be crawled. In this case, it is better to add the slash: “Disallow: /en/”

6. One robots.txt Per Subdomain

Robots.txt controls crawling behaviour only on the subdomain where it is hosted. If you want to control crawling on another subdomain, you will need a new robots.txt file, placed at the root of the relevant subdomain.

For example, if your site is on your-site.com and your English site is on en.your-site.com, you will then need two robots.txt files. One in the root directory of the main domain and the other in the root directory of the “en” subdomain, i.e., “en.your-site.com/robots.txt”.

7. Prevent Crawling of Parameterised URLs “?”

This is a pure SEO trick: parameterised URLs, also called filter URLs or faceted navigation, can lead to significant content duplication. This is not good for your SEO.

For example: “my-site.com/t-shirt?colour=blue” will be considered a different URL from my-site.com/t-shirt/ even though the user has simply chosen the colour blue.

Preventing the crawling of parameterised URLs is generally far more beneficial than harmful. Thus, you can include this directive in the robots.txt:

User-agent: Googlebot
Disallow: /*?

Note: “?” is not a special character in robots.txt; parameterised URLs simply use the “?” character.

Verify Your robots.txt

To be certain you have not made any mistakes, use the Google Search Console to identify potential errors caused by robots.txt
You will find this in the “coverage” report of the GSC. For example, a URL could indicate “this URL has been blocked by robots.txt”.

FAQ

Here are some frequently asked questions. If you have additional questions, let me know in the comments or ping me on Twitter (or elsewhere).

robots.txt to Prevent Indexing?

No, robots.txt does not prevent a page from being indexed. If you want to deindex a page, you must first place a no-index tag or an x-robots-tag in the HTTP header.
For example:
If you do this, Google could indicate in the coverage report that a URL is indexed but blocked by the robots.txt file.
If you block Google’s access to non-indexed content, it will never see the noindex directive because it cannot crawl the page.

Where Is robots.txt in WordPress?

If you use WordPress, your robots.txt should be at the root of your website, i.e.: domain.com/robots.txt .

How to Create a robots.txt in WordPress?

If you use WordPress, it will be automatically generated upon installation.

How to Modify robots.txt in WordPress?

Log in to your site via FTP then modify the robots.txt at the root of your site, probably at the address “/www/your-site/public”. Alternatively, you can use Yoast SEO and go to “Yoast SEO → Tools → File editor”.

What Is the Maximum Size of a robots.txt File?

~500 KB. Kilobytes (or kB)