🤖 Guide to Robots.txt for WordPress | ®

If you’re reading this, you’ve heard something about the robots.txt file, but before getting into it, it’s important that we understand who this file is for. Its main function is to facilitate the work of search engines, indicating which web pages they should not waste time visiting.

Search engines search for and organize the content of websites with the aim of displaying the most relevant content on the results page (SERPs) for the user when faced with a specific search. To perform this classification, search engines perform three functions:

  • Tracking or crawling: bots, also called crawlers or spiders, explore the content of different web pages and discover new pages through linking. But beware! For a website to be crawled, it must be accessible. And this is where robots.txt comes in, as you will see later.
  • Indexing: The crawled content is stored, organized and makes sense when analyzed. At this point, the search engines determine the relevance or not of the page to be shown in the search results, coming into play factors such as authority or indexing control tags such as or .
  • Classification: the search engines will weight the contents and will order them in the search results.

What is robot.txt?

The robots.txt is a file that provides a series of instructions on the pages that we do not want the bot to crawl -and, therefore, not index-, in addition to other parameters such as the location of the sitemap.xml, which is the index of web pages.

What is robots.txt for?

When a bot arrives at a website, in principle, the first thing it looks for is the robots.txt. If we have a well-configured robots.txt, we will make it easier for search engines. These are the things you can define in the robots.txt:

  • Blocking parts of our website that we do not want the crawler to crawl or index because they are not relevant. In this way, the bots will not lose crawl budget in crawling pages of our website that we are not interested in indexing, but will focus their efforts on the most relevant pages for us.
  • Web blocking in development. Until now, it was also common to use directives in the robots.txt to block the indexing crawl of a website under construction that is on the network, either by blocking the entire domain or by blocking the folders in which they are stored. find the new website (when we have a published website and host another on the same domain).
  • Blocking internal parts of the web. It may be that on our website we have internal management pages that are only accessed by our employees or, in the case of an e-commerce, by our customers by logging in. In the case of WordPress, there is also a directory dedicated to the administration of the site, WP-ADMIN, which does not make sense for the crawler to track.
  • Address of the sitemap.xml. As we have said before, the sitemap.xml is the index of the pages of our website, so it is useful that we indicate to the bot the location where it can find it in the first file that it crawls when it enters our website. But beware! remember to also have .
  • Blocking certain bots. In the robots.txt you can set locks by bot type. In this way, we can define which bot we do not want to crawl certain pages or, more commonly, block the entire site from crawling. This is used to avoid wasting resources on bots that do not interest us or to avoid tracking by marketing tools that may be used by competitors to analyze our website. But keep in mind that not all bots respect robots.txt directives.
See also  What is Hreflang tag and how to implement it? |

How to generate a robots.txt in WordPress?

When we install a WordPress, a robots.txt file is automatically generated, located in the root folder of the website, and which you can access by putting in the browser: “yourdomain.com/robots.txt”. If you have not configured additional parameters, you will find a file like the following:

User-agent: *

Disallow: /wp-admin/

Allow: /wp-admin/admin-ajax.php

In previous versions of WordPress, when you checked the “Dissuade search engines from indexing this site” checkbox, it was intended to prevent content from being indexed while the website was under development. Checking the box caused WordPress to create a robots.txt like the following:

User-agent: *

Disallow: /

What this rule does is block all bots from accessing the entire website. However, the WordPress 5.3 version changes the way of instructing search engines to prevent the website from being indexed. Instead of putting a directive in the robots.txt, sites with the “Dissuade search engines from indexing this site” option enabled will generate the meta tag .

This is because, although the most common is that indexing is passed to ranking, the truth is that a page blocked by robots.txt can be shown in search results for example, if links are found pointing to it . In the words of Joost de Valk, founder of Yoast SEO:

“You don’t need to crawl a site to get it listed. If a link points to a page, domain or whatever, Google follows this link. If the robots.txt file on that domain prevents a search engine from crawling that page, it will still show the URL in results.”

Therefore, if what you want is to avoid the indexing of the page, the ideal is to put a tag in the header of the content that we do not want to index, since Google will read it when crawling that content.

Robots.txt Rules

The robots.txt, both in WordPress and in the rest of CMS, is built by determining, on the one hand, the bot to which the rules are intended and, on the other hand, the rules that the bot must follow. These are the parameters used:

  • User Agent: Specifies which bots the rules below are intended for.
  • Disallow: is the rule that specifies the blocking of access. After “Disallow: “ the route that we want to block will be set.
  • Allow: is the rule that makes exceptions to disallow. That is, it adds routes that the bot will be able to enter even if they are inside the previously excluded folders. After “Allow: “, the route that we want to unlock will be set.
  • Crawl-delay: determines the time between requests that the bot makes on the website; however, its effectiveness is almost null, since bots like Googlebot do not take this directive into account.
  • Sitemap: Specifies the path where the website’s sitemap is located.
See also  Data Protection Changes for 2022 -

Additionally, there are wildcard parameters that will help you configure your robots.txt:

  • Asterisk
    • : is a wildcard used to include “everyone”. For example:
    • User-agent: * > We are indicating that the rules will apply to all bots.
  • Disallow: / > We are indicating that all directories will be blocked.
    • Dollar ($): Used in extensions to indicate that the rule applies to all files ending in a specific extension. For example:

/*.html$ > We are indicating that the rule will be applied to all html files.

If nothing is specified in the robots.txt, it is understood that the robot is being allowed to pass through your entire website.

Robots.txt Rules for WordPress

In addition to the specific rules that the web requires, there are some common rules that are usually used in WordPress:

#Basic lock

User-agent: *

Disallow: /wp-admin/

Disallow: /wp-includes/

Disallow: /xmlrpc.php

With these rules we are blocking CMS resources not directly related to the content (which will be in the wp-content folder). However, blocking /wp-includes/ can cause problems in GSC due to resource blocking. To allow Google to crawl CSS and JavaScript, which also manage and display content, you must add “allow” to these resources:

Allow: /wp-includes/*.js

Allow: /wp-includes/*.css

Edit WordPress robots.txt manually

As we have seen, WordPress automatically creates a robots.txt on your website, but now that you know more about this file, you may want to create one tailored to your website.

Generally, this file is located in the root folder of the domain, which you will find on your FTP as www or public_html. To edit this robots.txt, you just have to create a custom one from scratch and upload it to the root folder of your website via FTP to replace the previous one.

Edit robots.txt in wordpress with plugin

Although it is very easy to edit the robots.txt file manually, if you do not want to touch the guts of your website, you can do it using a plugin. There are countless plugins that edit the robots.txt. Here are some of the most used:

See also  What is spam? How to detect and prevent it on my website -

If you are involved in the SEO world in WordPress, surely you already know this plugin. It is the most used SEO plugin due, among other things, to its ease of use. To access the robots.txt, go to the SEO tab > Tools > File editor (if this option does not appear, check that you have full permissions in the plugin). By clicking on “Create robots.txt”, you will access the robot editor without leaving your desktop. Once you’ve entered the rules you want, hit “Save changes to robots.txt” and voilĂ ! this robots will override any rules you have in the robots.txt of your root folder.

Robots.txt in All in One SEO

It is another of the most popular SEO plugins that, how could it be otherwise, also includes the option to edit the robots.txt from the WordPress interface. And easier if possible than with Yoast SEO. You can do it by entering “Utility Manager” from the left menu > Robots.txt.

You will only have to select the type of rule, put the bot to which it is destined and the route that you want to block or unblock. Furthermore, this plugin allows you to directly block malicious bots. Simple right?

Test robots.txt As we have mentioned before, to see the robots.txt you only have to access yourdomino.com/robots.txt. But if in addition to seeing it you want to test it, you can do so by going to:

. It is an integrated tool in Google Search Console, so you will have to have an account and ownership in GSC.

Once logged in, choose the property and the sitemap.xml of the website in question will appear. But the most useful thing about this tool is the possibility of knowing if a specific URL is blocked by the parameters that we put in the robots.txt, which is very useful when we are using general parameters or many rules.

conclusion

If what you are looking for is to increase the positioning and visibility of your website, you must make sure that the bots, and especially Googlebot, track the content that you are most interested in showing on your website. And the robots.txt can help you. Just make sure you set it up correctly. And remember:

1) Just because you include a folder in your robots.txt file doesn’t mean Google won’t index it. If Google finds links…

Loading Facebook Comments ...
Loading Disqus Comments ...