Robots.txt Guide for SEO

Optimized Robots.txt strategy improves SEO. Blocking unnecessary URLs is one of the most critical steps in this strategy.

Robotx.txt plays an essential role in SEO strategy. Beginners tend to make mistakes when they do not understand the use of Robots.txt on websites.

It is responsible for your website’s crawlability and indexability.

An optimized Robots.txt file can significantly improve your website’s crawling and indexing.

Google also told us to use Robots.txt to block action URLs such as login, signup, checkout, add-to-cark, etc.

But how to do it the right way.

Here is everything!

What is Robots.txt?

The robots.txt file is a code that you place in your website’s root folder. It is responsible for allowing crawlers to crawl your website.

Robots.txt contains 4 critical directives:

User-agent: It tells that if you allow every crawler or a few targeted crawlers.
Disallow: Pages you do not want search engines to crawl.
Allow: Pages or part of the website that you want to allow for crawling.
Sitemap: your XML sitemap link.

Robots.txt file is case sensitive.

Robots.txt Hierarchy:

Robots.txt should be in an optimized format.
The most common robots.txt order is as follows:

User-agent: *
Disallow: /login/
Allow: /login/registration/

The first line allows search engines to crawl everything.

The second line disallows search bots from crawling login pages or URLs.

The third line allows the registration page to be crawled.

Simple Robots.txt rule:

User-agent: * Disallow: /login/ Allow: /login/

In this format, the search engine will access the Login URL.

Importance of Robots.txt:

Robots.txt helps optimize your crawl budget. When you block unimportant pages, Googlebot spends its crawl budget only on relevant pages.

Search engines prefer an optimized crawl budget. Robotx.txt makes it possible.

For example, you may have an eCommerce website where check-in, add-to-cart, filter, and category pages do not offer unique value. It is often considered as duplicate content. It is best to save your crawl budget on such pages.

Robots.txt is the best tool for this job.

When You Must Use Robots.txt?

It is always necessary to use Robots.txt on your website.

Block unnecessary URLs such as categories, filters, internal search, cart, etc.
Block private pages.
Block JavaScript.
Block AI Chatbots and content scrapers.

How to Use Robots.txt to Block Specific Pages?

Block Internal Search Results:

You want to avoid indexing your internal search results. It is pretty easy to block action URLs.

Just go to your robotx.txt file and add the following code:

Disallow: *s=*

This line will disallow search engines from crawling internal search URLs.

Block Custom Navigation:

Custom navigation is a feature that you add to your website for users.

Most e-commerce websites allow users to create “Favorite” lists, which are displayed as navigation in the sidebar.

Users can also create Faceted navigation using sorted lists.

Just go to your robotx.txt file and add the following code:

Disallow: *sortby=*
Disallow: *favorite=*
Disallow: *color=*
Disallow: *price=*

Block Doc/PDF URLs:

Some websites upload documents in PDF or .doc formats.

You do not want them to be crawled by Google.

Here is the code to block doc/pdf URLs:

Disallow: /*.pdf$ Disallow: /*.doc$

Block a Website Directory:

You can also block website directories such as forms.

Add this code to block users, forms, and chats from your Robots.txt file:

Disallow: /form/

Block User Accounts:

You do not want to index user pages in search results.

Add this code in Robots.txt:

Disallow: /myaccount/

Block Irrelevant JavaScript:

Add a simple line of code to block non-relevant JavaScript files.

Disallow: /assets/js/pixels.js

Block Scrapers and AI Chatbots:

The Google.com/robots.txt file says that you should block AI chatbots and scrapers.

Add this code to your Robots.txt file:

#ai chatbots User-agent: anthropic-ai User-agent: Applebot-Extended User-agent: Bytespider User-agent: CCBot User-agent: ChatGPT-User User-agent: ClaudeBot User-agent: cohere-ai User-agent: Diffbot User-agent: FacebookBot User-agent: GPTBot User-agent: ImagesiftBot User-agent: Meta-ExternalAgent User-agent: Meta-ExternalFetcher User-agent: Omgilibot User-agent: PerplexityBot User-agent: Timpibot Disallow: /

To block scrapers, add this code:

#scrapers User-agent: magpie-crawler User-Agent: omgilibot User-agent: Node/simplecrawler User-agent: Scrapy User-agent: CCBot User-Agent: omgili Disallow: /

Allow Sitemap URLs:

Add sitemap URLs to be crawled using robots.txt.

Sitemap: https://www.newexample.com/sitemap/articlesurl.xml
Sitemap: https://www.newexample.com/sitemap/newsurl.xml
Sitemap: https://www.newexample.com/sitemap/videourl.xml

Crawl Delay:

Crawl-delay works only for some search bots other than Google. You can set it to tell the bot to crawl the next page after a specific number of seconds.

Google Search Console Robots.txt Validator

Go to Google Search Console.
Click on “Settings.”
Go to “robots.txt.”
Click on “Request to Crawl.”

It will crawl and validate your robots.txt file.

Conclusion:

Robots.txt is an important tool for optimizing the crawl budget. It impacts your website’s crawlability, which in turn impacts the indexing in search results.

Block unnecessary pages to allow Googlebot to spend time on valuable pages.

Save resources with optimized robots.txt file.

October 20, 2024