The other day, I was looking at my Google Webmaster site and spotted something strange - a whole bunch of pages on this domain blocked from being read by a spider due to robots.txt.
Strange, I thought. I don’t have a robots.txt file on my domain. However, I go to one of the other neat tools Google Webmaster has - robots.txt viewer - and find that this is the contents of my robots.txt -
User-agent: *Disallow: /searchSitemap: http://www.sephyroth.net/feeds/posts/default?orderby=updated
At first, I was surprised since I don’t have access to my domain to put in a robots.txt, then I was annoyed that it was there, without my permission, and there is nothing that I can do to change it (outside of changing to a proper hosting arrangement) or to remove the file.
But then I thought about it for a few seconds, and it clicked. What the robots.txt file does, in this case, is stop any spiders from crawling the contents of the search folder on my domain. This folder only contains duplicates of posts that are already being indexed. The reality is that this is a good thing since it will give more accurate results for people using search to get to my site. A perfect example of this was a person searching for words in the content of this post (in which the video’s already been removed from YouTube). Instead of coming back with just that post page, they were given a link to my music label.
By adding in the disallow on /search, results like that will not be possible anymore, unless the person searching clicks on one of the labels for that posts. Again, that also means less confusion because you will not see one result from the post page, one from (possibly) a “monthly archive” page, and results from any label pages the post might be on.
However, even though I think it’s a good thing to have this available for us to have, it should definitely be something that is left up to each blog’s owner to choose the restrictions on spiders through robots.txt. Incidentally, you’ll notice the line about sitemaps - I think it’s a good thing to have in there, especially if you have a blog that is somewhat complex. That is something that I’d been having issues with Google Webmaster tools about verifying my pages as being mine.
In the end, all I want is that we have control over our blogs’ robots.txt files. If that cannot happen, then the Blogger people should be upfront about the reason why this is not possible. There’s no reason to hide behind curtains and other devices when you make changes in policy that affect every person who has signed up for your service.