October 20, 2024

Robots.txt Guide for SEO

Optimized Robots.txt strategy improves SEO. Blocking unnecessary URLs is one of the most critical steps in this strategy.

Robotx.txt plays an essential role in SEO strategy. Beginners tend to make mistakes when they do not understand the use of Robots.txt on websites.

It is responsible for your website’s crawlability and indexability.

An optimized Robots.txt file can significantly improve your website’s crawling and indexing.

Google also told us to use Robots.txt to block action URLs such as login, signup, checkout, add-to-cark, etc.

Robots.txt Guide for SEO: eAskme

But how to do it the right way.

Here is everything!

What is Robots.txt?

The robots.txt file is a code that you place in your website’s root folder. It is responsible for allowing crawlers to crawl your website.

Robots.txt contains 4 critical directives:

  1. User-agent: It tells that if you allow every crawler or a few targeted crawlers.
  2. Disallow: Pages you do not want search engines to crawl.
  3. Allow: Pages or part of the website that you want to allow for crawling.
  4. Sitemap: your XML sitemap link.

Robots.txt file is case sensitive.

Robots.txt Hierarchy:

Robots.txt should be in an optimized format.
The most common robots.txt order is as follows:

  1. User-agent: *
  2. Disallow: /login/
  3. Allow: /login/registration/

The first line allows search engines to crawl everything.

The second line disallows search bots from crawling login pages or URLs.

The third line allows the registration page to be crawled.

Simple Robots.txt rule:

User-agent: *
Disallow: /login/
Allow: /login/

In this format, the search engine will access the Login URL.

Importance of Robots.txt:

Robots.txt helps optimize your crawl budget. When you block unimportant pages, Googlebot spends its crawl budget only on relevant pages.

Search engines prefer an optimized crawl budget. Robotx.txt makes it possible.

For example, you may have an eCommerce website where check-in, add-to-cart, filter, and category pages do not offer unique value. It is often considered as duplicate content. It is best to save your crawl budget on such pages.

Robots.txt is the best tool for this job.

When You Must Use Robots.txt?

It is always necessary to use Robots.txt on your website.

  • Block unnecessary URLs such as categories, filters, internal search, cart, etc.
  • Block private pages.
  • Block JavaScript.
  • Block AI Chatbots and content scrapers.

How to Use Robots.txt to Block Specific Pages?

Block Internal Search Results:

You want to avoid indexing your internal search results. It is pretty easy to block action URLs.

Just go to your robotx.txt file and add the following code:

Disallow: *s=*

This line will disallow search engines from crawling internal search URLs.

Block Custom Navigation:

Custom navigation is a feature that you add to your website for users.

Most e-commerce websites allow users to create “Favorite” lists, which are displayed as navigation in the sidebar.

Users can also create Faceted navigation using sorted lists.

Just go to your robotx.txt file and add the following code:

Disallow: *sortby=*
Disallow: *favorite=*
Disallow: *color=*
Disallow: *price=*

Block Doc/PDF URLs:

Some websites upload documents in PDF or .doc formats.

You do not want them to be crawled by Google.

Here is the code to block doc/pdf URLs:

Disallow: /*.pdf$
Disallow: /*.doc$

Block a Website Directory:

You can also block website directories such as forms.

Add this code to block users, forms, and chats from your Robots.txt file:

Disallow: /form/

Block User Accounts:

You do not want to index user pages in search results.

Add this code in Robots.txt:

Disallow: /myaccount/

Block Irrelevant JavaScript:

Add a simple line of code to block non-relevant JavaScript files.

Disallow: /assets/js/pixels.js

Block Scrapers and AI Chatbots:

The Google.com/robots.txt file says that you should block AI chatbots and scrapers.

Add this code to your Robots.txt file:

#ai chatbots
User-agent: anthropic-ai
User-agent: Applebot-Extended
User-agent: Bytespider
User-agent: CCBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: cohere-ai
User-agent: Diffbot
User-agent: FacebookBot
User-agent: GPTBot
User-agent: ImagesiftBot
User-agent: Meta-ExternalAgent
User-agent: Meta-ExternalFetcher
User-agent: Omgilibot
User-agent: PerplexityBot
User-agent: Timpibot
Disallow: /

To block scrapers, add this code:

#scrapers
User-agent: magpie-crawler
User-Agent: omgilibot
User-agent: Node/simplecrawler
User-agent: Scrapy
User-agent: CCBot
User-Agent: omgili
Disallow: /

Allow Sitemap URLs:

Add sitemap URLs to be crawled using robots.txt.

  • Sitemap: https://www.newexample.com/sitemap/articlesurl.xml
  • Sitemap: https://www.newexample.com/sitemap/newsurl.xml
  • Sitemap: https://www.newexample.com/sitemap/videourl.xml

Crawl Delay:

Crawl-delay works only for some search bots other than Google. You can set it to tell the bot to crawl the next page after a specific number of seconds.

Google Search Console Robots.txt Validator

  • Go to Google Search Console.
  • Click on “Settings.”
  • Go to “robots.txt.”
  • Click on “Request to Crawl.”

It will crawl and validate your robots.txt file.

Conclusion:

Robots.txt is an important tool for optimizing the crawl budget. It impacts your website’s crawlability, which in turn impacts the indexing in search results.

Block unnecessary pages to allow Googlebot to spend time on valuable pages.

Save resources with optimized robots.txt file.

Other People Are reading: