14 Feb 2012

The Ultimate Guide to Blocking Your Content in Search

We all work so hard to make sure all of our content is crawled and indexed by the search engines. So it’s ironic when sometimes we must also struggle to remove or prevent some otherwise private content from getting into the indexes.

The process of blocking content from search can be frustrating, removal can be slow, and the whole experience exasperating – especially if you don’t know what options you have. Let’s talk about the various options you have for both removing content from the search indexes and how to prevent it from being indexed in the first place.

Find all of the affected URLs

Before you leap into the URL removal process, look to see which URLs point to the content you want removed. Think in terms of reverse canonicalization. If the content is older, it might be indexed under multiple URLs, such as:

  • xyz.com/mystuff
  • xyz.com/mystuff/
  • www.xyz.com/mystuff
  • www.xyz.com/mystuff/
  • www.xyz.com/mystuff/Index.htm
  • www.xyz.com/mystuff/index.htm

and many other variations. Identify all of the URLs pointing to the content you want removed so you are ready to remove all references to it. For more information on canonicalization concepts, see this helpful post on canonicalization.

Remove indexed content from search

There are several ways to tell the search engines the content is no longer available. Let’s jump right in.

Remove it from the web server

The easiest way to remove content from the search indexes is simply to remove it from your site. When a search crawler comes back to your site to check the status of your published content, its request for the removed content will result in HTTP status 404 messages, which tells the crawler the file can’t be found. That result kicks off the automatic (albeit slow) process of removing the URL from the index.

Set the web server to return a 404 (or 410) for the URL

If you must leave the content on the server, you can configure the web server to still return either the 404 “File Not Found” or 410 “File Gone” message for the given URL. The process of configuring a specific, non-default HTTP status message for a URL on your web server depends upon the platform used. See your web server documentation for details. Note that this technique won’t work for non-HTML content, such as PDFs and Microsoft Word DOC files.

Permanently redirect a URL

Assigning a 301 (aka permanent) redirect to a URL tells the search crawler that the requested URL is no longer available and has been permanently replaced by a substitute (the URL receiving the redirect traffic).

All of the above methods take time for the results to take effect. They are dependent upon waiting for the search crawler to return to the site, request the affected URL in order to receive the actionable HTTP status code, and then for the search engine algorithm to eventually purge the content. If the issue is an emergency, such as when proprietary business or confidential personal information is accidentally exposed, you need immediate action to get that content purged. Here’s how to do that:

Use the search engines’ webmaster tools to remove specific pages

Both Google and Bing offer tools for requesting the immediate removal of indexed content. Before you can access them, you must be a registered user of Google Webmaster Tools and Bing Webmaster Center Tools (this alone is reason enough to register your site now before an urgent problem arises).

  • Google:
  1. Log in to Google Webmaster Tools and click Site configuration > Crawler access > URL removals tab.
  2. Click Create a new removal request, type or paste the URL to be removed, and then click Continue. Remember that URLs are case sensitive, so I recommend copying and pasting the URL to be removed.
  3. From the dropdown list, select the type of data removal you want (cache only, cache and SERP, or entire directory), and then click Submit Request. Your request will appear as a listing in the tool, where you can monitor the status of the request.
  • Bing (which includes organic SERPs in Yahoo!):
  1. Log in to Bing Webmaster Center Tools and click the Index tab > Block URLs.
  2. Select the type of data removal you want (click either Block URL and Cache or Block Cache).
  3. Select what to block (page only, directory, or entire site).
  4. Copy and paste the URL to be removed, click Next, click Confirm, and then click Finish.

Note that the search engine-provided URL removal tools are typically intended for urgently needed data removals. In addition to the above techniques, there are other ways to remove content that also proactively prevents it from be indexed in the first place. Let’s explore those.

Block URLs to prevent duplicate content in the index

The most commonly used method of managing search crawler access to your site’s content is to use Robots Exclusion Protocol (REP) directives. This can be achieved through several methodologies:

Use a robots.txt file on the site

The robots.txt file is a plain text file containing crawling exclusion directives aimed at one or more REP-compliant crawlers (or most commonly, generic directives applicable to all REP-compliant crawlers). When the file is uploaded to the domain (or subdomain) root of a website, will automatically be read by REP-compliant crawlers before any URLs are fetched (all major search engine crawlers are REP-complaint). If a targeted URL is blocked by a robots.txt directive, the URL is not fetched.

The robots.txt file (note that, by protocol, this file name always uses lower-case letters) enables webmasters to block crawlers from accessing one or more particular files in a directory, whole directories, or the entire site. (Note: Per Google, this is the only approved method for removing entire directories from their index.) It also supports wildcard characters to make it extremely versatile.

The most common robots.txt instruction targets all crawlers (referred to as “user-agents” in REP). It’s followed by a specific directive, such as blocking access to a file, directory, or the site. Sample robots.txt directive code for generic user-agents looks like this:

User-agent: *
Disallow: /private.htm
Disallow: /offlimits/

You can also use Allow directives to allow crawlers access to a specific file within an otherwise blocked directory, such as in the following example:

Allow: /offlimits/index-me.htm
Disallow: /offlimits/

Note: The Allow directive takes precedence in any logic conflicts between Allow and Disallow directives, so be careful. It’s an SEO best practice to isolate allowed and disallowed files on a per directory basis to eliminate confusion.

Wildcards in directives

The “*”represents all characters in URLs up to the point of usage, meaning that the following directive,

Disallow: *cars

would block crawler access to a variety of content such as:

  • /redcars.htm
  • /roadsters/blue-cars.htm
  • /cars/black-roadster.htm
  • /2012/cars/bmw

and so on. Note that asterisks are not needed as a wildcard suffix, as the directive, by default, applies to any child content underneath the listed location in robots.txt.

The “$” character is used to filter by file name extension, such as in the following sample:

Disallow: *.pdf$

The sample code blocks crawlers from accessing all URLs containing the file type “*.pdf”. By comparison, omitting the $ wildcard would block any file paths containing the string “.pdf”, such as /docs.pdf/newcars.htm.

Wildcards can create very powerful, wide-reaching directives. However, wildcard usage in robots.txt often contains logical coding errors, which can result in unintended crawler behavior. It is extremely common for search engines to receive complaints about wildly incomplete site crawls when in fact a misconfigured robots.txt file is actually to blame, and the crawlers were simply abiding by the directives listed.

Don’t attempt to hide confidential content with robots.txt

Some webmasters, in their effort to block search crawlers from accessing their business confidential files and directories, mistakenly list them in robots.txt. What they fail to realize is that the robots.txt file is always in the same location on a site, and is always available to be read, including by people. For example, let’s say you had a robots.txt file that contained the following code:

User-agent: *
Disallow: /private/
Disallow: /client-list.php
Disallow: /secrets/

You can rest assured that your competitors, the ones who know web technologies, are snooping around your site and will see these references. They will then attempt to browse to the listed files and directories to see what’s there, such as a client list or a business expansion plan. Listing such content in robots.txt is effectively advertising where you keep your confidential documents!

To block the snoopers from probing the depths of your website for business intelligence, you can protect the directory by restricting access to authenticated usernames with passwords. If the site structure or functionality prevents you from doing that, make sure you at least have an index page in the directory so the browser doesn’t return a directory listing showing all of the files up for grabs. You may even try renaming the directory to be blocked in robots.txt to a more innocuous name or burying it in a deep subdirectory (but first be sure any such change in the URL path won’t break any functionality within your site!).

Use a <meta> robots tag on the page

The <meta> tag (or “element” for you HTML grammarian purists) can be used with REP directives. These directives apply only to the page on which they appear. The following sample code demonstrates a common usage, in which the crawler is disallowed from both indexing the content and following any of the links on the page:

<meta name="robots" content="noindex, nofollow">

Note the name attribute uses the generic value “robots”, which is applicable to all REP-compliant crawlers. You can alternatively choose to specify the exact name of a user agent as well, such as googlebot or bingbot. If you do choose to specify individual user agents, be sure the name is exactly right, or the directive may be ignored by the targeted crawler. Any crawler not identified by a specific or a generic <meta> robots directive will default to crawling the page for purposes of potentially indexing its content and following its links.

The following values for the content attribute can be used in the <meta> robots tag:

Value Function Supported by
noindex Prevents the bot from indexing the contents of the page, but links on the page can be followed. Bing, Google
nofollow Prevents the bot from following the links on the page, but the page can be indexed Bing, Google
none Equivalent to “noindex, nofollow” Google
nosnippet Prevents the display of the descriptive snippet text for that page in the SERPs. Bing, Google
noarchive Prevents the display of a cache link for that page in the SERP. Bing, Google
nocache Same as noarchive. Bing
noodp Instructs the bot to not use a title and snippet from the Open Directory Project (ODP) for that page in the SERP. Bing, Google
notranslate Prevents translation of the page in the SERP. Google
noimageindex Prevents indexing of images on the page. Google
unavailable_after: [date/time] Prevents the page from showing in the SERPs after the specified date/time. The date/time data must be in RFC 850 format. Google

Note: The attribute and value data fields are not case-sensitive.

Use the HTTP header X-Robots-Tag on the web server

For non-HTML-based content, such as TXT, DOC, and PDF documents, there is no way to apply REP directives via <meta> robots tags to them. Assuming you don’t use robots.txt for this, you can instead set REP directives for individual URLs using the HTTP header X-Robots-Tag. This header uses the same content values as shown in the table above for the <meta> robots tag. The following is an example of a commonly used X-Robots-Tag header that applies to all REP-compliant crawlers:

X-Robots-Tag: noindex, nofollow

You can optionally identify a specific crawler for a directive, and pair that with a separate directive for all other crawlers not specified, as shown in the following sample:

X-Robots-Tag: googlebot: noindex, nofollow
X-Robots-Tag: otherbot: noindex

The process for implementing custom HTTP headers is dependent upon the web server platform used. Review your web server documentation for details.

REP methodology precedence

Generally speaking, it’s best to only use one REP method of controlling crawler access for your website. Redundant methods typically result in logic conflicts, crawler access problems, and indexing shortfalls, which can be difficult to resolve.

Note that from the search engine perspective, robots.txt blocking directives take precedence. This is because before a page on a site is accessed, the crawler first checks for the presence of a robots.txt file to see if access is blocked. If so, the page is not fetched. However, to read the directives in either <meta> robots tags or the HTTP header X-Robots-Tag, the page has to first be fetched. If blocking directives are found there, only then is the page discarded. As a result, this means that the URL of the page may get indexed, but no content from that page with blocking directives will be included in the index.

There is one caveat to the precedence of robots.txt directives: when a crawler is specifically given access in robots.txt with an Allow directive but then encounters a blocking directive in either <meta> robots or X-Robot-Tag, the blocking directive overrides the Allow directive.

Lastly, the use of REP directives not only identify what content is off-limits to crawlers, if a new REP directive appears that blocks content that has already been indexed, that content is purged from the index. For more information on the robots.txt protocol, see www.robotstxt.org.

Require authentication

Another method of preventing the search crawler from accessing content is to require authentication for access. If a password is required on a site, the search crawler will not be able to access its content. Note that using Secure HTTP (https) by itself (without requiring authentication) will not block the crawler. This is a common misunderstanding and is one of the reasons why so many duplicate pages are indexed by search.

Password protect a directory on the web server

Alternatively, instead of requiring authentication to use the site, a webmaster can put content in a password-protected directory on the server to prevent crawler access. This method can be used for web server administrator-related content.

Block dynamic URL parameters

For sites that use dynamic URL parameters to track referrer data to their pages, content duplication can become a significant problem. So to prevent URLs using specific URL parameters from being indexed, and thus avoid content duplication, you can tell the search engines via their webmaster tools to ignore indexing URLs using specified URL parameters. Here’s how:

  • In Google:
  1. Log in to Google Webmaster Tools and click Site configuration > URL parameters.
  2. Click Configure URL parameters, and then click Add parameter.
  3. Type the parameter name, select whether the parameter changes what the user sees in the page, and then click Save.
  • In Bing (which also covers organic content found in Yahoo!):
  1. Log in to Bing Webmaster Center Tools and click Index tab > URL Normalization.
  2. Click Add Parameter, type the parameter name, and then click Submit.

Be careful of what you add to these lists. If your site uses URL parameters to define the page contents rather than to track referrers, you could accidentally purge a large number of pages from the search index.

Canonicalization techniques

Canonicalization is the process of redirecting unwanted URL variants for a given page to that page’s designated primary URL. Canonicalization effectively blocks those URL variants from the search engine indexes by using 301 permanent redirects. I discuss canonicalization techniques in detail, including the use of the <link> rel=canonical tag and how to set up 301 redirects in the recent blog post, The Ultimate How-To Guide on 301 Redirects. This post is long enough as is. I’ll refer you there for those details. You’re welcome!

Ineffective methods

Lastly, I’ll briefly mention what doesn’t work. For years it was a given that text content embedded within images, Flash, and other non-text media on a webpage were deal-breakers. Well, crawlers have come a long way in recent years. But don’t misunderstand me – I am not saying you should feel free to put text content you want indexed within these types of media. They are still very difficult to crawl and parse for content, and search engine success rates are not great. That all said, very difficult is not impossible.

For crawling efficiency purposes, always spoon-feed your content to crawlers as pure, on-page text. But thanks to the use of optical character recognition (OCR) technologies and improvements in crawling JavaScripts and rich Internet application technologies, a portion of this once-lost content is today being crawled and indexed. As a result, you can’t depend upon these technologies to be impenetrable walls shielding content from the prying eyes of search. You can’t rely on it to work, and you can’t rely on it to fail. What a world we live in!

Getting content out of a search engine index can be a frustrating and time-consuming experience, but it can be done. By reviewing and implementing the techniques described above, you can get confidential content purged relatively quickly as well as prevent it from being indexed again.

Be careful out there. There’s very little that’s private anymore on the web.


  1. Dewaldt Huysamen February 15, 2012 at 12:42 AM

    I have found that its best to only use noindex as link juice can still be passed by links followed, even though the content you trying to remove from the index might be bad performing post panda etc, your penalty will be lifted and link juice is still passed.

    Have tested this although I know this article is for removing content permanently not related to panda or anything else.

  2. David Hehr February 15, 2012 at 12:41 PM

    Nice recap Rick.

    Helpful reconciliation of tactics and approaches.
    Depending on the situation, as you have it’s often best to employ multiple tactics with the same aim, in part to reflect multiple potential use cases (domain based base crawling by googlebot vs. inbound deep linking to an interior page from an outside source, etc.).

    HTTP header + robots are a favorite if it’s an entire site section.

    If it’s a Panda related issue, note that tactics like meta NOINDEX can be helpful in preventing the content from being indexed, but technically googlebot still has to crawl the page and knows the content (perhaps duplicate, licensed, etc.) is still there — which can be problematic — so it depends on the situation. Nice recap.

    1. Rick DeJarnette March 2, 2012 at 5:07 PM

      Thanks, David! I’m glad you found it useful.

  3. Crow April 29, 2012 at 5:38 PM

    Rick, if I could trouble you with a question:

    I have two blog pages that I need totally removed from Bing/Yahoo search. They are hosted on wordpress.com, so I’m unable to insert meta tags or upload a robots.txt file (trust me, I’ve tried all day). I’ve used Bing Webmaster to manually block both URL and cache, however, when I search for these pages, they still come up in the search results. How can I stop this? Is it just a matter of time? Thanks for any advice.

  4. Rick DeJarnette April 30, 2012 at 2:14 PM

    Crow, I maintain my own WordPress site and am able to upload a custom robots.txt file using FTP to the host. That would certainly be a good approach for you to take in this case (if you have not done so, talk to your host provider’s support team to set up FTP access to the site root). Of course, this assumes you have administrative access to the host account.

    Note that a URL blocked by robots.txt may not be completely eliminated from the SERPs. If other sites link to the URL you want blocked, the URL of the page may still be indexed, but your robots.txt block will remove all of the content associated with that indexed URL.

    How long has it been since you implemented the URL block request from Bing? If it’s only been a matter of minutes or hours, then keep checking back. However, if it’s been a matter of a couple of business days, then it’s time to escalate the matter.

    At that point, I’d recommend posting a request for assistance on the Bing Webmaster Forum (http://www.bing.com/community/webmaster/default.aspx). I hope that helps. Thanks for writing!

    1. Crow April 30, 2012 at 9:31 PM

      Thanks for your reply. I should add that I am using the free hosting on wordpress.com, not wordpress.org, so I have very little room to work with. WordPress.com even explicitly forbids ftp connection. I can only upload certain “media” through their platform, no .txt files (even if I did, I don’t think it would go to the root folder).

      But the “block URL” on Bing seems to have worked – I can no longer find the pages by searching for key phrases. I guess it needed several hours to kick in. Hopefully that should do it for my purposes. Thanks again, great article.

  5. nauval2007 June 21, 2012 at 9:49 PM

    Hei…I didn’t know we could use * directive on Disallow. Nice guide by the way, I wanna post article about blocking content. Your post help me a lot.

  6. ashok sharma July 23, 2012 at 3:12 AM

    hi,Thanks for sharing the great article.can u provide me the following:

    i have a website abc.com my google webmaster tool returning a soft 404 error, i want to block these urls by robots.txt


    Noindex: /fabrics.html/*?step1&category_id=*



    Noindex: /fabrics.html/*?page=shop.pdf_output&showpage=*

    if we apply above these should this work for me

    I will be thankful for your help….

  7. Lydie July 30, 2012 at 9:48 AM

    Do you think blocking dynamic parameter in GWT is working ? because I don’t think so… When I do search queries searching for the parameter, I still have search results…

    1. ashok July 31, 2012 at 12:40 AM

      ok i understand this.. but can u please let me know how can block these url from search results….. Also what will be way to decrease soft404 errors…

  8. GAURAV KUMAR September 20, 2012 at 2:23 PM

    great article about SEO

  9. Sumit October 4, 2013 at 10:55 PM

    I removed my urls from google index using Webmaster but after 12-13 days it again appeared in search results (even I had trashed those articles and did a permanent 301 redirection)
    My old blog was http://www.tricksaddiction.in and I redirected its url to http://www.rigglr.com

Leave a Reply