14 Feb 2012

The Ultimate Guide to Blocking Your Content in Search

We all work so hard to make sure all of our content is crawled and indexed by the search engines. So it’s ironic when sometimes we must also struggle to remove or prevent some otherwise private content from getting into the indexes.

The process of blocking content from search can be frustrating, removal can be slow, and the whole experience exasperating – especially if you don’t know what options you have. Let’s talk about the various options you have for both removing content from the search indexes and how to prevent it from being indexed in the first place.


Find all of the affected URLs

Before you leap into the URL removal process, look to see which URLs point to the content you want removed. Think in terms of reverse canonicalization. If the content is older, it might be indexed under multiple URLs, such as:

  • xyz.com/mystuff
  • xyz.com/mystuff/
  • www.xyz.com/mystuff
  • www.xyz.com/mystuff/
  • www.xyz.com/mystuff/Index.htm
  • www.xyz.com/mystuff/index.htm

and many other variations. Identify all of the URLs pointing to the content you want removed so you are ready to remove all references to it. For more information on canonicalization concepts, see this helpful post on canonicalization.


Remove indexed content from search

There are several ways to tell the search engines the content is no longer available. Let’s jump right in.

Remove it from the web server

The easiest way to remove content from the search indexes is simply to remove it from your site. When a search crawler comes back to your site to check the status of your published content, its request for the removed content will result in HTTP status 404 messages, which tells the crawler the file can’t be found. That result kicks off the automatic (albeit slow) process of removing the URL from the index.

Set the web server to return a 404 (or 410) for the URL

If you must leave the content on the server, you can configure the web server to still return either the 404 “File Not Found” or 410 “File Gone” message for the given URL. The process of configuring a specific, non-default HTTP status message for a URL on your web server depends upon the platform used. See your web server documentation for details. Note that this technique won’t work for non-HTML content, such as PDFs and Microsoft Word DOC files.

Permanently redirect a URL

Assigning a 301 (aka permanent) redirect to a URL tells the search crawler that the requested URL is no longer available and has been permanently replaced by a substitute (the URL receiving the redirect traffic).

All of the above methods take time for the results to take effect. They are dependent upon waiting for the search crawler to return to the site, request the affected URL in order to receive the actionable HTTP status code, and then for the search engine algorithm to eventually purge the content. If the issue is an emergency, such as when proprietary business or confidential personal information is accidentally exposed, you need immediate action to get that content purged. Here’s how to do that:

Use the search engines’ webmaster tools to remove specific pages

Both Google and Bing offer tools for requesting the immediate removal of indexed content. Before you can access them, you must be a registered user of Google Webmaster Tools and Bing Webmaster Center Tools (this alone is reason enough to register your site now before an urgent problem arises).

  • Google:
  1. Log in to Google Webmaster Tools and click Site configuration > Crawler access > URL removals tab.
  2. Click Create a new removal request, type or paste the URL to be removed, and then click Continue. Remember that URLs are case sensitive, so I recommend copying and pasting the URL to be removed.
  3. From the dropdown list, select the type of data removal you want (cache only, cache and SERP, or entire directory), and then click Submit Request. Your request will appear as a listing in the tool, where you can monitor the status of the request.
  • Bing (which includes organic SERPs in Yahoo!):
  1. Log in to Bing Webmaster Center Tools and click the Index tab > Block URLs.
  2. Select the type of data removal you want (click either Block URL and Cache or Block Cache).
  3. Select what to block (page only, directory, or entire site).
  4. Copy and paste the URL to be removed, click Next, click Confirm, and then click Finish.

Note that the search engine-provided URL removal tools are typically intended for urgently needed data removals. In addition to the above techniques, there are other ways to remove content that also proactively prevents it from be indexed in the first place. Let’s explore those.


Block URLs to prevent duplicate content in the index

The most commonly used method of managing search crawler access to your site’s content is to use Robots Exclusion Protocol (REP) directives. This can be achieved through several methodologies:

Use a robots.txt file on the site

The robots.txt file is a plain text file containing crawling exclusion directives aimed at one or more REP-compliant crawlers (or most commonly, generic directives applicable to all REP-compliant crawlers). When the file is uploaded to the domain (or subdomain) root of a website, will automatically be read by REP-compliant crawlers before any URLs are fetched (all major search engine crawlers are REP-complaint). If a targeted URL is blocked by a robots.txt directive, the URL is not fetched.

The robots.txt file (note that, by protocol, this file name always uses lower-case letters) enables webmasters to block crawlers from accessing one or more particular files in a directory, whole directories, or the entire site. (Note: Per Google, this is the only approved method for removing entire directories from their index.) It also supports wildcard characters to make it extremely versatile.

The most common robots.txt instruction targets all crawlers (referred to as “user-agents” in REP). It’s followed by a specific directive, such as blocking access to a file, directory, or the site. Sample robots.txt directive code for generic user-agents looks like this:

User-agent: *
Disallow: /private.htm
Disallow: /offlimits/

You can also use Allow directives to allow crawlers access to a specific file within an otherwise blocked directory, such as in the following example:

Allow: /offlimits/index-me.htm
Disallow: /offlimits/

Note: The Allow directive takes precedence in any logic conflicts between Allow and Disallow directives, so be careful. It’s an SEO best practice to isolate allowed and disallowed files on a per directory basis to eliminate confusion.

Wildcards in directives

The “*”represents all characters in URLs up to the point of usage, meaning that the following directive,

Disallow: *cars

would block crawler access to a variety of content such as:

  • /redcars.htm
  • /roadsters/blue-cars.htm
  • /cars/black-roadster.htm
  • /2012/cars/bmw

and so on. Note that asterisks are not needed as a wildcard suffix, as the directive, by default, applies to any child content underneath the listed location in robots.txt.

The “$” character is used to filter by file name extension, such as in the following sample:

Disallow: *.pdf$

The sample code blocks crawlers from accessing all URLs containing the file type “*.pdf”. By comparison, omitting the $ wildcard would block any file paths containing the string “.pdf”, such as /docs.pdf/newcars.htm.

Wildcards can create very powerful, wide-reaching directives. However, wildcard usage in robots.txt often contains logical coding errors, which can result in unintended crawler behavior. It is extremely common for search engines to receive complaints about wildly incomplete site crawls when in fact a misconfigured robots.txt file is actually to blame, and the crawlers were simply abiding by the directives listed.

Don’t attempt to hide confidential content with robots.txt

Some webmasters, in their effort to block search crawlers from accessing their business confidential files and directories, mistakenly list them in robots.txt. What they fail to realize is that the robots.txt file is always in the same location on a site, and is always available to be read, including by people. For example, let’s say you had a robots.txt file that contained the following code:

User-agent: *
Disallow: /private/
Disallow: /client-list.php
Disallow: /secrets/

You can rest assured that your competitors, the ones who know web technologies, are snooping around your site and will see these references. They will then attempt to browse to the listed files and directories to see what’s there, such as a client list or a business expansion plan. Listing such content in robots.txt is effectively advertising where you keep your confidential documents!

To block the snoopers from probing the depths of your website for business intelligence, you can protect the directory by restricting access to authenticated usernames with passwords. If the site structure or functionality prevents you from doing that, make sure you at least have an index page in the directory so the browser doesn’t return a directory listing showing all of the files up for grabs. You may even try renaming the directory to be blocked in robots.txt to a more innocuous name or burying it in a deep subdirectory (but first be sure any such change in the URL path won’t break any functionality within your site!).

Use a <meta> robots tag on the page

The <meta> tag (or “element” for you HTML grammarian purists) can be used with REP directives. These directives apply only to the page on which they appear. The following sample code demonstrates a common usage, in which the crawler is disallowed from both indexing the content and following any of the links on the page:

<meta name="robots" content="noindex, nofollow">

Note the name attribute uses the generic value “robots”, which is applicable to all REP-compliant crawlers. You can alternatively choose to specify the exact name of a user agent as well, such as googlebot or bingbot. If you do choose to specify individual user agents, be sure the name is exactly right, or the directive may be ignored by the targeted crawler. Any crawler not identified by a specific or a generic <meta> robots directive will default to crawling the page for purposes of potentially indexing its content and following its links.

The following values for the content attribute can be used in the <meta> robots tag:

Value Function Supported by
noindex Prevents the bot from indexing the contents of the page, but links on the page can be followed. Bing, Google
nofollow Prevents the bot from following the links on the page, but the page can be indexed Bing, Google
none Equivalent to “noindex, nofollow” Google
nosnippet Prevents the display of the descriptive snippet text for that page in the SERPs. Bing, Google
noarchive Prevents the display of a cache link for that page in the SERP. Bing, Google
nocache Same as noarchive. Bing
noodp Instructs the bot to not use a title and snippet from the Open Directory Project (ODP) for that page in the SERP. Bing, Google
notranslate Prevents translation of the page in the SERP. Google
noimageindex Prevents indexing of images on the page. Google
unavailable_after: [date/time] Prevents the page from showing in the SERPs after the specified date/time. The date/time data must be in RFC 850 format. Google

Note: The attribute and value data fields are not case-sensitive.

Use the HTTP header X-Robots-Tag on the web server

For non-HTML-based content, such as TXT, DOC, and PDF documents, there is no way to apply REP directives via <meta> robots tags to them. Assuming you don’t use robots.txt for this, you can instead set REP directives for individual URLs using the HTTP header X-Robots-Tag. This header uses the same content values as shown in the table above for the <meta> robots tag. The following is an example of a commonly used X-Robots-Tag header that applies to all REP-compliant crawlers:

X-Robots-Tag: noindex, nofollow

You can optionally identify a specific crawler for a directive, and pair that with a separate directive for all other crawlers not specified, as shown in the following sample:

X-Robots-Tag: googlebot: noindex, nofollow
X-Robots-Tag: otherbot: noindex

The process for implementing custom HTTP headers is dependent upon the web server platform used. Review your web server documentation for details.

REP methodology precedence

Generally speaking, it’s best to only use one REP method of controlling crawler access for your website. Redundant methods typically result in logic conflicts, crawler access problems, and indexing shortfalls, which can be difficult to resolve.

Note that from the search engine perspective, robots.txt blocking directives take precedence. This is because before a page on a site is accessed, the crawler first checks for the presence of a robots.txt file to see if access is blocked. If so, the page is not fetched. However, to read the directives in either <meta> robots tags or the HTTP header X-Robots-Tag, the page has to first be fetched. If blocking directives are found there, only then is the page discarded. As a result, this means that the URL of the page may get indexed, but no content from that page with blocking directives will be included in the index.

There is one caveat to the precedence of robots.txt directives: when a crawler is specifically given access in robots.txt with an Allow directive but then encounters a blocking directive in either <meta> robots or X-Robot-Tag, the blocking directive overrides the Allow directive.

Lastly, the use of REP directives not only identify what content is off-limits to crawlers, if a new REP directive appears that blocks content that has already been indexed, that content is purged from the index. For more information on the robots.txt protocol, see www.robotstxt.org.


Require authentication

Another method of preventing the search crawler from accessing content is to require authentication for access. If a password is required on a site, the search crawler will not be able to access its content. Note that using Secure HTTP (https) by itself (without requiring authentication) will not block the crawler. This is a common misunderstanding and is one of the reasons why so many duplicate pages are indexed by search.


Password protect a directory on the web server

Alternatively, instead of requiring authentication to use the site, a webmaster can put content in a password-protected directory on the server to prevent crawler access. This method can be used for web server administrator-related content.


Block dynamic URL parameters

For sites that use dynamic URL parameters to track referrer data to their pages, content duplication can become a significant problem. So to prevent URLs using specific URL parameters from being indexed, and thus avoid content duplication, you can tell the search engines via their webmaster tools to ignore indexing URLs using specified URL parameters. Here’s how:

  • In Google:
  1. Log in to Google Webmaster Tools and click Site configuration > URL parameters.
  2. Click Configure URL parameters, and then click Add parameter.
  3. Type the parameter name, select whether the parameter changes what the user sees in the page, and then click Save.
  • In Bing (which also covers organic content found in Yahoo!):
  1. Log in to Bing Webmaster Center Tools and click Index tab > URL Normalization.
  2. Click Add Parameter, type the parameter name, and then click Submit.

Be careful of what you add to these lists. If your site uses URL parameters to define the page contents rather than to track referrers, you could accidentally purge a large number of pages from the search index.


Canonicalization techniques

Canonicalization is the process of redirecting unwanted URL variants for a given page to that page’s designated primary URL. Canonicalization effectively blocks those URL variants from the search engine indexes by using 301 permanent redirects. I discuss canonicalization techniques in detail, including the use of the <link> rel=canonical tag and how to set up 301 redirects in the recent blog post, The Ultimate How-To Guide on 301 Redirects. This post is long enough as is. I’ll refer you there for those details. You’re welcome!


Ineffective methods

Lastly, I’ll briefly mention what doesn’t work. For years it was a given that text content embedded within images, Flash, and other non-text media on a webpage were deal-breakers. Well, crawlers have come a long way in recent years. But don’t misunderstand me – I am not saying you should feel free to put text content you want indexed within these types of media. They are still very difficult to crawl and parse for content, and search engine success rates are not great. That all said, very difficult is not impossible.

For crawling efficiency purposes, always spoon-feed your content to crawlers as pure, on-page text. But thanks to the use of optical character recognition (OCR) technologies and improvements in crawling JavaScripts and rich Internet application technologies, a portion of this once-lost content is today being crawled and indexed. As a result, you can’t depend upon these technologies to be impenetrable walls shielding content from the prying eyes of search. You can’t rely on it to work, and you can’t rely on it to fail. What a world we live in!

Getting content out of a search engine index can be a frustrating and time-consuming experience, but it can be done. By reviewing and implementing the techniques described above, you can get confidential content purged relatively quickly as well as prevent it from being indexed again.

Be careful out there. There’s very little that’s private anymore on the web.