01 May 2012

The Ultimate Guide to XML Sitemaps

The World Wide Web is a massive place and continues to grow at a phenomenal rate. According to Netcraft, a March 2012 survey revealed there are well over half a billion websites, 644,275,754 to be exact, and that figure represented a rise of 31.4 million (a 5.1% growth) from just the previous month! And that’s only counting websites. How many published webpages does that add up to? The projected numbers vary wildly, given the rate of change and of growth, not to mention so much page content duplication pointed to by multiple URL variables. But whatever it is, let’s face it – it’s massive.

So what are little website owners to do in getting their content pages indexed in such a massively crowded universe of web content? (And frankly, big site owners ask the same question.) Addressing that issue is a big part of the work SEOs perform, and one of the best tools we have for feeding the search engines is the XML-based Sitemap.


What is a Sitemap?

A Sitemap is not the same thing as a sitemap. Allow me to clarify. A Sitemap file (using the capital S) is an XML-encoded listing of the most important content files within a site, built specifically for search engine crawlers to consume as a data feed. By contrast, a sitemap file (written in lower-case s) is typically an HTML file that lists the most important content files within a site, but this file is intended for human users to browse and find the content they want to read within a site. The key difference in purpose is the intended audience, and thus the markup coding used within the file.

Search engines use Sitemaps to learn about the site’s structure, better plan their crawl activity budgets, and as a webmaster-generated, suggested crawler feed. Please understand that listing a webpage’s URL in a Sitemap guarantees its inclusion in the search index. Not at all. However, it does mean that if the Sitemap uses well-formed XML code, supplies clean, valid URLs, and meets the other requirements of the search engines, the URLs it contains will at least be noted for consideration by the search engines for future crawling activity. That alone improves the chances many pages would otherwise have for discovery and being crawled if they had to otherwise rely solely on links.

While Sitemaps are helpful for websites to have under any circumstance, they are especially helpful in the following conditions:

  • New sites with new pages not yet well-linked (both internally and externally)
  • Sites that use dynamic URLs for their content pages
  • Sites with archived content that’s not well-linked to its currently active pages
  • Site with hard-to-discover pages that use hard-to-crawl links (such as those in scripts) or  are heavy in non-text content , such as rich Internet application (Flash or Silverlight)

Sitemaps are essentially content discoverability feeds to the search engines.


Structure of Sitemap XML

All of the major search engines support the formalized XML data protocol as defined on Sitemaps.org. A sample of the XML code used in a Sitemap looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.example.com/</loc>
    <lastmod>2012-04-30</lastmod>
    <changefreq>monthly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>http://www.example.com/about-us.html</loc>
    <changefreq>yearly</changefreq>
    <priority>0.4</priority>
  </url>
</urlset>

The Sitemap XML code includes both required and optional tags. Valid Sitemap XML code starts with the standard opening tag, “<?xml> “ and its required attributes. It’s followed by one iteration of the “<urlset>” tag which includes at least one “xmlns” attribute referencing the necessary namespace that defines the XML schema structure. Each URL referenced in the Sitemap requires both a <url> tag and a <loc> tag nested within. The remaining tags are optional.

The tags used in the Sitemap XML protocol are defined in the table below:

Tag Status Description
<?xml> Required Opening tag of file, includes required attributes:

  • version=”1.0″
  • encoding=”UTF-8″
<urlset> Required Used only once, this tag includes the required namespace attribute xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″. Google supports including additional namespace attributes and valuesfor mixed data types, including:

  • Images: xmlns:image=http://www.google.com/schemas/sitemap-image/1.1
  • Video: xmlns:video=”http://www.google.com/schemas/sitemap-video/1.1″
  • Mobile: xmlns:mobile=”http://www.google.com/schemas/sitemap-mobile/1.0″
  • Code search: xmlns:codesearch=”http://www.google.com/codesearch/schemas/sitemap/1.0″
  • News: xmlns:news=”http://www.google.com/schemas/sitemap-news/0.9″
<url> Required Parent tag for each URL added. All remaining tags are nested within this tag.
<loc> Required The webpage URL. Use the full URL, including protocol, not to exceed 2,048 characters.
<lastmod> Optional Date of page’s last revision, written in YYYY-MM-DD format (per W3C Datetime).
<changefreq> Optional Expected frequency of page revisions, treated as a hint by search engines. Valid values include:

  • always (changes each time the page is accessed)
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never (used for archived pages)
<priority> Optional The priority value of a page relative to others on your site. Valid values ranges from 0.0 to 1.0, with the default at 0.5.

File formats and types

Both Google & Bing accept the XML format for Sitemaps as specified in the protocol description listed on Sitemaps.org. However, they also accept “Sitemap” feeds using such file protocols as RSS 2.0, Atom 1.0, and basic ASCII text files. Sitemaps can be posted on websites in either normal XML format (as .xml files) or using Gzip file compression (as .gz files).

In addition to the standard Sitemap for web content (referencing HTML and other common webpage content), Google supports specialized Sitemap extensions that are dedicated to specific media types. These include Sitemaps for video, images, mobile content, software source code, and news content, all of which include additional, specific metadata for search engines to use for classifying the data about the media files found on your site. As long as the appropriate Sitemap extension is declared in the Sitemap namespace, all of this additional content can be added to the same Sitemap file.

Notes:

  • Bing has not announced support for the Google-supported Sitemap extensions, but Duane Forrester of Bing said “anything not supported will be ignored”, so webmasters should feel free to include these references in their Sitemap files.
  • Google recommends that news content Sitemaps be separate files as they are crawled at a much higher frequency.

As an alternative to using a Sitemap to identify video metadata for search, both Google and Bing support the use of media RSS (aka mRSS) files for this purpose. And since Bing does support this method, developing an mRSS video feed for both Google and Bing will be a smarter bet for sites containing a large number of video files they want indexed in search. Check out the details on both Google and Bing for more information about their officially supported mRSS feeds.


Limitations on Sitemap files

There are a number of rules and limits imposed on Sitemap files, either by the protocol or by the search engines:

  • Identify the basic Sitemap namespace. The Sitemap must specify the default XML namespace: xmlns=http://www.sitemaps.org/schemas/sitemap/0.9
  • Size limits. Sitemap files should not exceed 50,000 URL entries or 50 MB in size (uncompressed). If you need to have more than 50,000 entries in your Sitemap, use a Sitemap index file (discussed in the next section below).
  • Text requirements. Sitemap files must be UTF-8 encoded and use entity-escaped characters in URLs when needed (such as replacing “&” with “&amp;” in a dynamic URL variable).
  • Consistent syntax. All URLs within a Sitemap must use the same URL syntax. This means don’t mix URLs starting with the “www.” prefix with those that omit that prefix. Also, don’t include URLs that incorporate session IDs.
  • Location matters. Sitemap files stored in a directory can only reference URLs stored in that directory or its child directories. URLs in parallel directories, parent directories to the Sitemap’s directory, different subdomains, or those using a different protocol (such as https: versus http:) are not valid references. As such, storing your Sitemap in the root directory helps avoid invalid references.
  • Clean links.There should be no more than 1% link errors or the entire Sitemap may be discarded. (A link error is considered any HTTP response code other than 200, including 404 for broken links, as well as 301 and 302 for redirected links.) This is specifically a known rule with Bing, but using clean links is an SEO Best Practice for all search engines.Note: You can check the HTTP response code for an individual URL using the Header Checker Tool, or for multiple URLs on the same site, use the Find Broken Links, Redirects & Google Sitemap Generator Free Tool, both available for free at Internet Marketing Ninjas Tools.
  • Cross-site references not universally compatible. Google allows for cross-site URL submission in a Sitemap only if you prove you own all of the referenced sites, which is done by verifying them in your Google Webmaster Tools account. Note that cross-site Sitemaps are not compatible with Bing, and thus are not recommended for universal usage.

Sitemap index files

Many large sites have more than 50,000 content URLs that they consider worthy of indexing in search. To get around the 50,000-entry limit on Sitemap files, webmasters can instead create a Sitemap index file. A Sitemap index file references other Sitemap files rather than page URLs directly. A Sitemap index also can include up to 50,000 entries, which theoretically means you can submit up to 2.5 billion URLs from your site to search.

The XML code structure of Sitemap index files is very similar to Sitemap files. A sample of the Sitemap index XML code looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>http://www.example.com/sitemap01.xml</loc>
    <lastmod>2012-04-30</lastmod>
  </sitemap>
  <sitemap>
    <loc>http://www.example.com/sitemap02.xml</loc>
  <lastmod>2012-04-30</lastmod>
  </sitemap>
</sitemapindex>

Similar to the standard Sitemap file format, the Sitemap index XML code also includes both required and optional tags. Valid Sitemap index XML code starts with the standard opening tag, “<?xml> “ and its required attributes. It’s followed by one iteration of the “<sitemapindex>” tag which includes one “xmlns” attribute referencing the necessary namespace that defines the Sitemap index XML schema structure.

Each Sitemap referenced in the Sitemap index requires both a <sitemap> tag and a <loc> tag nested within. The tags used in the Sitemap index XML protocol are defined in the table below:

Tag Status Description
<?xml> Required Opening tag of file, includes required attributes:

  • version=”1.0″
  • encoding=”UTF-8″
<sitemapindex> Required Used only once, this tag includes the required namespace attribute xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″.
<sitemap> Required Parent tag for each Sitemap added. All remaining tags are nested within this tag.
<loc> Required The Sitemap URL. Use the full URL, including protocol.
<lastmod> Optional Date of Sitemap’s last revision, written in YYYY-MM-DD format (per W3C Datetime).

Implement your Sitemap

The process of implementing your own Sitemap so it’s available to the search engine crawlers is fairly straight-forward. Just follow these steps:

  1. Identify your most important content pages for search. You can ignore shopping cart pages, user login pages and non-canonical pages (URLs using dynamic URL variables when none are needed to access the page).
  2. Create your Sitemap with a CMS or external tool. If you publish to the web with a content management system, it likely already has a Sitemap generator tool included (this will be the best route to go if you need to create multiple Sitemap files and a Sitemap index). However, if you have no CMS or have a smaller site, you can also choose to use a third-party, Sitemap generator tool, such as the Find Broken Links, Redirects & Google Sitemap Generator Free Tool  from Internet Marketing Ninjas. In any case, you may want to review and edit the resulting file to ensure your non-content pages were excluded.
  3. Validate your Sitemap file. Before you publish the file for search crawlers to consume, confirm that it’s valid. There are many Sitemap validation tools available on the web.
  4. Post the Sitemap file on your website.As suggested earlier, the root directory is a great place to put it, as you avoid most invalid URL issues there. That said, when using a Sitemap index, posting the individual Sitemaps in the directories of the URL they contain is a perfectly valid strategy. Also, once the Sitemap (or the Sitemap index) has been posted, be sure to update your site’s robots.txt filewith a reference to the Sitemap’s location. Add a line of text similar to the following sample to the end of your robots.txt file:
    Sitemap: http://www.example.com/sitemap.xml
  5. Register your Sitemap with Google and Bing. Unlike with the robots.txt file, search engines do not automatically look for a Sitemap and read it if found. You need to explicitly register your Sitemap with them. The best way to do this is to use the Webmaster Tools of Google (click Site configuration > Sitemaps > Add/Test Sitemap) and Bing (click Crawl > Sitemaps (XML, Atom, RSS) > Add Feed). If you don’t already have an active account with each, consider this to be the reason to create one today.
  6. Update and repost your Sitemap file as site content changes. Your site is not likely a static entity. Reflect the new and changed content on your site with periodic updates to your Sitemap file. How often you need to do so depends on how often you change or add content to your site. Some sites can update their Sitemaps monthly and keep up with new content. News sites, on the other hand, will do so on a minute-by-minute basis.

Search engines eat good web content like food. Feed them your best content with a valid, clean Sitemap.

Comments

  1. Paul May 1, 2012 at 10:37 AM

    Great article and I’m glad we follow all of these rules on our site!

    I wonder if you have any opinion on any other search engines which are useful? We have a growing proportion of traffic from Yandex (Russia) and Baidu (China). Would you recommend taking any action with these sites?

  2. Rick DeJarnette May 1, 2012 at 2:01 PM

    I looked into both Yandex and Baidu with regard to creating and submitting XML-based Sitemaps, and it appears they both support the standard protocol. Check out http://company.yandex.com/press_center/press_releases/2008/2008-08-15.xml and http://www.baidu.com/search/sitemap_help.html for details (I used http://translate.google.com to read the Baidu page).

    Bottom line: Create your standard XML-based Sitemap file following the protocol defined on http://www.sitemaps.org/protocol.html (per the advice given in this post). Identify the regional markets in which you want to participate, note which search engines have substantial market share there, and then go through each of those search engines’ formal Sitemap feed submission process to ensure it is accepted. As backup insurance, however, also list the Sitemap in your robots.txt file as well.

    Thanks for writing!

  3. Zoe Alexander May 2, 2012 at 5:48 AM

    Hi Rick, this is a very comprehensive post. Being new to the internet world where should I start first? Where can I find my Sitemap if I don’t have access to my site code? I have a retail e-commerce site and content changes quite regularly. I have access to my administration area and I can ‘ping’ my sitemap to google for the bots to search. Should I do this every time I do a major upload of new content? Also where would I find the robots.txt file you mention? Many thanks!

    1. Rick DeJarnette May 2, 2012 at 1:25 PM

      Zoe,

      The robots.txt file is always stored in the root directory for a site domain (I created a link to my earlier post on using robots.txt files in the post above — thanks for the nudge to do that!). Be sure to check that out. Sitemaps, on the other hand, might be stored practically anywhere in the site structure and can use most any file name the webmaster wishes (within the limitations stated in the post above). That said, most commonly the file is named sitemap.xml and is stored in the root directory of the site. Your site may or may not already have one in place.

      Reading between the lines of your question leads me to believe you may not have admin access to the site in question. Without that, you will have a hard time working with robots.txt and Sitemap files. How you create a Sitemap depends on whether you use a content management system (CMS) or not, how many content pages are on your site, whether your site is a WordPress blog or not, and several, other criteria. CMS users can typically use the CMS to publish a Sitemap along with the content pages. WordPress users can install plugins that create and manage Sitemaps. If the site is small, you can create one manually as well (but you’ll need that admin access to post it).

      Once a search engine is notified that your site uses a Sitemap (via registration), you don’t need to re-register the Sitemap or ping that search engine to reread the file each time it is updated. Registering the Sitemap the first time makes reading the contents of that file an automatic part of the crawl process for the site.

      I hope this helps! Thanks for writing!

  4. Zoe Alexander May 2, 2012 at 6:06 AM

    Hi Rick! Sorry another question to fire over to you! I don’t know why despite my daily progression up the alexa rankings I still struggle to get any visibility on the front pages at Google. I can’t see where it is going wrong! Am I missing something vital? How can other competitor sites with significantly worse alexa rankings still be on pages 1-4 when we seem to be stuck on pages 7 and lower? Many many thanks!

    1. Rick DeJarnette May 2, 2012 at 1:38 PM

      Zoe,

      There isn’t always a clear correlation between data shown in sites like Alexa and how Google ranks specific pages. You’re really asking how do you use SEO to improve page rank, and that is a very big question. There are so many variables involved that there’s no quick answer that applies to all site in all circumstances.

      As such, I suggest you go to http://www.theseoace.com/resources/ (that’s my personal list of great SEO resources) and check the links under “Learn more about SEO”, “Search Engine Guidelines”, and “Blogs”. There’s a university education on SEO there to be had. Good luck!

  5. Patricia McDermond July 25, 2012 at 7:10 PM

    Hi Rick,
    I just created a web site with text embedded in images. (not recommended I know :() In order to become “discovered”, I have made a site map that lists links to images on the web site; I have also written descriptions in the “alt” tags. I also want to make an image sitemap. I see samples of the coding but I don’t know what file name to save it as! Here’s the link to the site I’m talking about, http://www.nypleinairpainters.com. There are captions, titles and links buried in the image files!
    Can you help me?
    Many thanks,

    Patricia

  6. Scott August 31, 2012 at 10:57 AM

    Rick, I have a one page site that is dynamically generated with JS and Prototype so none of the content is contained in the HTML. All of the content is contained in XML files and as a user clicks around in the site, the XML content is used to populate content sections that get created when needed. How can I get Search Engines to index the contents of my xml data files (which are generated via a backend system and change routinely) and point them at index.html? I can get to any of the content with Query String parameters.
    example site: http://norcrossbluedevils.org/

    1. nick November 15, 2012 at 5:31 AM

      I would like to understand why you said cross domain sitemaps are not compatible with Bing. Since 2008 Bing and Yahoo stated they support, together with Google cross-domain sitemaps: http://www.bing.com/community/site_blogs/b/webmaster/archive/2008/02/28/microsoft-to-support-cross-domain-sitemaps.aspx

  7. Chris Thurston April 15, 2013 at 9:48 PM

    Hi Rick, great post! My question is very simple: can you reference the same URL(s) in different Sitemaps? As a simple example, let’s say that you’re a car site and you create one Sitemap for all of your Sedans, and another Sitemap for all of your Ford related pages. You might do this to build a theme around some pages so Google knows what you’re about. So if you had a page for Ford Sedans could you list it in both Sitemaps (if they’re both stored in the root directory)? Is that acceptable or should each URL be in one Sitemap only? Thanks in advance.

  8. twitter_joomladka September 13, 2013 at 12:29 PM

    Thanks for this article. It really saved my day while I was learning the ropes with sitemap.xml.

    Cheers,

    Alex

Leave a Reply