I am always surprised at how the vast majority of things that I see during audits are simple “SEO 101” infractions. But then again it makes sense; clients come to us because they aren’t SEOs themselves. So in today’s blog post I am going to cover what many SEOs consider basic SEO with a few advanced tips that I often see over looked.
Today we are talking about the infamous robots.txt file. No other element with in a SEO campaign can do so much for your rankings, both positive and negative. One small error can screw up your entire approach, but executed correctly and you can control the engines like a dog on a leash. Therefore it is incredibly important to get it right the first time.
Top to Bottom
A commonly misunderstood aspect of every robots.txt file is how the search engines actually read it. When a search engine crawls and indexes a robots.txt file it reads the document from top to bottom. Which means when an error in syntax or anything else is present the crawler will ignore all directives below the error. So therefore the lesson here is, if you are unsure of your syntax, or you are attempting something “unique”, you should put it at the bottom of the file so if a error is found, the other directives will not be ignored.
Use Wildcards Correctly
The wildcard directive can be very handy. With this directive you can create simple statements that help disallow patterns found in URLs. However if used incorrectly it can screw everything up. One important thing to remember is not all search engine crawlers support the wildcard directive. Because of this its a good idea to put any wildcard statements at the bottom of the file as to not cause an error and ignore the other directives, like we talked about above.
I know that for some of you this is going to sound obvious. But, its important to remember that the robots.txt file is only used to “block” or disallow crawlers from sections of a site. It is not intended to point crawler in the direction of URLs that should be indexed, that’s what sitemaps are for. I mention this only because during several audits I have seen robots.txt files contain the directive “Allow: /example/”. This “Allow:” directive does not exist and will only cause errors. UPDATE (4/19/2013): @Zen2Seo pointed out that Googlebot does accept the Allow: directive. However it should only be used to allow sub directories of other directories that have been previously blocked. It does not help the crawler find new URLs. Also Googlebot seems to be the only user agent that supports this directive.
Use Line Breaks
Search engine crawlers read robots.txt files in segments. First the user agent is defined, and then the preceding block of code will contain the Disallow directives that are associated with that user agent. The proper format is to define the user agent, leave a blank line immediately below, and then each disallow statement should precede on it’s own line. If a new user agent needs to be defined, a new blank line should be placed separating the last disallow statement before the new user agent is defined. With out the proper use of line breaks errors will be created and the remaining directives will be ignored.
I took a creative writing course in college once. I remember one of my first assignments was handed back to me with the words, “KISS this.” written in red. I couldn’t understand what the heck my professor was talking about, so I asked her after class. She explained that KISS should for Keep It Super Simple. Apparently, I was way to “wordy” in my assignment, and needed to trim some of the adjectives. When optimizing a robots.txt file, my best advice is to KISS it. The more complicated the file the more likely an error will occur. A few tips to keep things simple is: Don’t use the robots.txt file to block individual URLs at a time. In the case that you need to block just a handful of specific URLs, use the NOINDEX meta robots tag on the page itself. If you need to use a wildcard, thing about the simplest way to execute it. Keeping things simple will cut down on the likely hood that mistakes are made, and will create a smaller robots.txt file for faster processing.
All of the tips above are geared towards staying away from making serious errors. However, no one is perfect and everyone’s work should be check. Which is why when I write a robots.txt file I use a validator to check my work.
Until next time, happy roboting!