Welcome to SEO Boy, the authority on search engine optimization -- how to articles, industry news, insider tips, and more! If you like what you see, you can receive free and daily updates via email or RSS.

Just Grin and Bear It, Your Website Needs a Robots.txt File

Websites are dynamic creations, with loads of content lurking around every dark corner. Those dark corners may be portions of your site that weren’t created with search engines in mind. This poses a unique challenge-how can you instruct search engines to essentially ignore that content and not include it in their index of your site? Create a Robots.txt file. To alleviate your fears, it’s important to remember that a Robots.txt file is just a simple text file that communicates to the search bots what content should and shouldn’t be indexed.

Anatomy of a Robots.txt File

The basic structure of the Robots.txt file contains two parts: the “User Agent” and “Disallow” statements. Following this process serves as a way to precisely instruct search engine robots on which pages they should ignore (not index) in their crawl of your website.

User Agent

There are two paths you can follow when implementing the User Agent portion of your Robots.txt file. First, you can choose to have ALL search engines follow the same rules. This is represented by entering an asterisk (*), which acts as a “wild card” entry.

  • Example: User-Agent: *

The second path you can follow involves separating the search bots out individually, providing different instructions for each search engine.

  • Example: User-Agent: Googlebot (Google) or User-Agent: Slurp (Yahoo!)

Disallow

This is where the fun begins. The Disallow statement allows you to insert commands to block directories, pages/files, images, and even your entire website, if need be. If nothing is listed, all URLs are free game to be crawled. Here are examples of Disallow statements at work:

  • Block entire website: Disallow: /
  • Block a directory: Disallow: /directoryname/
  • Block a page: Disallow: /page.html
  • Block specific file types: Disallow: /*.gif$ (exchange .gif for whatever file type you need blocked)

. . . it’s important to remember that a Robots.txt file is just a simple text file..

Why Should You Care About Robots.txt?

The Robots.txt protocol gives you control over what content from your site the search engines will index. Another way to look at this is that by telling the search engines what site content is non-essential to their crawling and indexing, you can then maximize the attention given to your most important content. Some specific examples of content you may want to block in your Robots.txt file include privacy statements, terms of use, other “utility” pages, administrative functions, and in general any content you don’t want showing up in the SERPs.

Robots.txt Myths Explained

MYTH: Your site needs a Robots.txt file in order to be indexed.
FACT: No, your site will be indexed whether you create a Robots.txt file or not. A Robots.txt file will not draw robots to your site any faster than normal.

MYTH: Your site needs a Robots.txt file in order to rank higher.
FACT: No, your Robots.txt file will only tell the robots what pages and links can or cannot be indexed. However, the result of having a Robots.txt file will have a secondary effect on your site’s rankings: if you improve your site’s crawlability, you’ll improve your rankings.

MYTH: You can block pages completely by using “Disallow” statements.
FACT: No, though the Disallow statement is powerful, you cannot guarantee a 100% invisible page. Just because a page or directory is listed in your Robots.txt file doesn’t mean the search engines won’t crawl those pages. That’s an important distinction to remember. Robots.txt files block indexation but do nothing to stop crawling. If you want to create an invisible page, you should consider the use of the Meta Robots tag employing “noindex/nofollow.”

MYTH: The more bots accessing your site the better.
FACT: No, some search bots are simply out there to scour your site for e-mail addresses for spamming purposes. Knowing how to block them will aid in the ongoing spam war. Here is a large list of robots that go past normal search and indexing robots. If you are aware of malicious bots crawling your website, add them individually (with separate User Agent statements) to your Robots.txt file, and use Disallow: / to block that bot from your entire site.

<< Straight Talk on Meta Keywords Create XML Sitemaps >>
 
Unsexy SEO – Table of Contents