One of the worst things a self-respecting SEO can do is feel out of control. “It doesn’t matter what I do, Google will decide how my site ranks in the end.” That’s dangerous talk, for sure. There are plenty of straight-forward tasks that every site owner can perform to take control of their SEO destiny (or rankings if you prefer). Of those tasks, the creation and correct implementation of a robots.txt file is of extreme importance.
What is a robots.txt file? At its simplest form, the robots.txt file is a text file that instructs the search engine spiders which directory paths and pages should and shouldn’t be crawled. With this text document, you can communicate with all web crawlers at once, or with each individual crawler as needed to pass along any necessary instructions.
Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.
Why is it important to use robots.txt? Put simply, not every page of your website needs to be crawled and potentially ranked in the SERPs. This could be utility pages (like a Terms of Service) or a directory that contains information visitors would have no use for. The robots.txt file allows you to Allow or Disallow that content from your site’s crawl. To think of it another way, the robots.txt can be used to “shape” how the search engines see your site – restricting non-essential content so that all of the robots’ energy can be used to crawl pages that you want to have indexed and ranked in the SERPs. And if all of that weren’t enough, Google’s official Webmaster Guidelines explicitly recommend their use.
Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled.
Basic Creation of a Robots.txt File
The syntax of a very basic robots.txt use two rules: User-agent and Disallow. The first designates which robot you’re communicating with (i.e. Googlebot, Slurp, etc.), or can simply communicate to all at once (with an asterisks *). The second is used to list the pages you want blocked!
These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry.
Using the Disallow statement, you can block your entire site (Disallow: /), a directory (Disallow: /directory/), a page (Disallow: /page.html), images and even specific file types (Disallow: /*.gif$).
Things to Watch Out For
As with most things SEO, there are several pitfalls and oddities to avoid when using a robots.txt file. Here are some of the most important:
Don’t assume that the robots.txt will 100% block your pages from being indexed by the search engines.
- If the page you’re blocking is linked to from other pages, it can still be indexed (as “URL only listings”). Make sure to use meta noindex tags in the header of these pages to be absolutely sure.
- Be sure to follow robots.txt syntax to ensure that the search engines follow your instructions correctly, otherwise you may block important pages accidentally. You can view detailed instructions here and here.
- If you use both a “global command” for all search engines and a specific command for Googlebot (for example) in your robots.txt, any Disallow statements made in the global command must be repeated for Googlebot. The search engines treat each User-agent command separately, so beware.
So, as you can see, the robots.txt file is an important weapon on your SEO arsenal. You should never feel that the way search engines crawl and view your website is out of your control. All you need to do is create a basic text file, add a few statements and “whalah!” you’ve taken a step towards being a smart, savvy SEO.