Guide to use the robots.txt file, which allows us to tell Google how to interact with our content, and which pages it can index.
What is robots.txt?
Actually it is a simple text file that is placed in the root of our website, that is, we can find it in:
This file “suggests” to search engines (that is, to Mr. Google), how you should interact with the content of our website, what URLs you can index and what URLs you can’t, since by default Google will index ALL the content it finds, like crazy. But be careful, because as I said, we are actually “suggesting”, but then the search engine will do what it wants if it wants to, so don’t trust it too much.
Anyway, search engines usually respect and use it. In fact, even Google has a section and a tool for the robots.txt file in Google Webmaster Tools. If you want to know more about this, I recommend the .
On the other hand, be very careful, because that file is public! In other words, if for example we have a private page that we do not want to be indexed, for example:
If we put it in robots.txt, it is true that it will not be indexed, but it is also true that we are placing it in a file that anyone can see. In fact, many hackers start by taking a look there, to see if they find something interesting to attack. So we don’t use robots.txt to hide information, but simply prevent useless content from being indexed.
User agents (that is, robots)
The first and most typical thing is to indicate the “User Agent”, that is, the robot or search engine that we want to address, and what gives meaning to the name “robots.txt”. If we want to address any robot, we will simply write this:
And done. This is telling us that the rules that we set below will be for all robots, whether they are from Google, Bing, Yahoo, or whatever it takes. But if we want to go to one in particular, we must put their name. In the case of Google, it is Googlebot.
Note that there are many more bots than we think. To give you an idea, Google already has nine! For mobile phones, for videos, for images… anyway, here you go. So take a look to see who you let pass 🙂
Well, now yes, now we can ask what we want. We are going to see the two most typical commands. Allow and Disallow.
Allow and Disallow in robots.txt
For example, if we want it to NOT index a specific directory, we do it like this:
User-agent: Googlebot Disallow: /directory/
With that we prevent that URL and all the content that depends on it from being indexed. A very typical case is to unindex the admin folders:
User-agent: Googlebot Disallow: /wp-admin/ Disallow: /wp-includes/
Those two folders are the ones that always go with WordPress. On the other hand, we will never put the /wp-content/ folder, since the images are usually stored there, and those if we want them to be indexed. By the way, if you want to know how, here you go.
The two most radical ways to use “Disallow” are to allow access to everything:
User-agent: * Disallow:
And not allowing access to ANYTHING (note that this makes us disappear from Google:
User-agent: * Disallow: /
On the other hand, we also have the Allow command, which will allow us to allow the indexing of a certain subdirectory or file within a directory that we had prohibited from indexing. So, if we have:
User-agent: * Disallow: /disallowed-directory/
We can create a specific exception, such as:
User-agent: * Disallow: /disallowed-directory/ Allow: /disallowed-directory/exception/
So, inside the directory not allowed, we can create an exception that is indexed, if we want.
There is a certain hierarchy of instructions that depends on the level of specificity of each User Agent. So, if we have multiple instructions that can be fulfilled based on this parameter, the most specific one wins. For example:
User-agent: * Disallow: /directory/ User-agent Googlebot Allow: /directory/
In this case, Google could follow both the first instruction, since it has an asterisk indicating that this rule is addressed to all robots, but also the second one, which is specific to it. Since the rules contradict each other, the second one wins, because it is more specific.
Alternative: Meta tag robots
An interesting alternative to the robots file is the robots meta tag, which instead of going in that file, goes at the page level:
Unlike the robots.txt file, this is not listed anywhere, but is in the code of each page. In addition, apart from being able to indicate the “noindex” parameter, we can also use “nofollow”, so that the links on that page are not followed. If you want to know more about how to put this label and its use, I recommend the .
Crawl-delay in robots.txt
Another very useful parameter is Crawl-delay, which says “how often” that robot should crawl, to search for new content. It works like this:
Crawl delay: 3600
Here we are telling all the robots that they can be passed every 3600 seconds, that is, every hour. It would be recommended for newspapers or sites that update content several times a day.
Although for that Google already has a crawling tool in Google Webmaster Tools, which I recommend using before fiddling with the robots.txt file:
Another very interesting option of the robots.txt file is the possibility of using wildcards. For example, if we want to block all URLs that contain a question mark, we will do it like this:
This is useful if we don’t want URLs with parameters to be indexed, such as search, comments, custom campaigns, etc. In the specific case of WordPress searches, the “s” parameter is used, so we could go into more detail:
This way we make sure that no search page is indexed, even if we have it in an internal or external link, and thus we avoid duplicate content problems.
Another wildcard parameter is the dollar sign $, which allows you to affect any URL that contains a certain string. For example, if we want to deindex all the files ending in .php we will tell it like this:
User-agent: Googlebot Disallow: /*.php$
Obviously, we could do this with any extension or string, and it allows us to make sure that no file that doesn’t interest us is indexed.
The robots.txt file is very comfortable and practical, and is even included in Webmaster Tools, so it is advisable to have it active, even with the minimum of basic information, and then complete it with other tools such as labels.
But above all, it is important to make sure that we are not blocking any content that we want to index, because sometimes certain plugins can modify that file, and if they do it wrong, it can get messy.
In other words, at least we verify that we have it, that it exists, that Google Webmaster Tools validates it, and we continue with our lives 😉