01 May 2012

The Ultimate Guide to XML Sitemaps

The World Wide Web is a massive place and continues to grow at a phenomenal rate. According to Netcraft, a March 2012 survey revealed there are well over half a billion websites, 644,275,754 to be exact, and that figure represented a rise of 31.4 million (a 5.1% growth) from just the previous month! And that’s only counting websites. How many published webpages does that add up to? The projected numbers vary wildly, given the rate of change and of growth, not to mention so much page content duplication pointed to by multiple URL variables. But whatever it is, let’s face it – it’s massive.

So what are little website owners to do in getting their content pages indexed in such a massively crowded universe of web content? (And frankly, big site owners ask the same question.) Addressing that issue is a big part of the work SEOs perform, and one of the best tools we have for feeding the search engines is the XML-based Sitemap.


What is a Sitemap?

A Sitemap is not the same thing as a sitemap. Allow me to clarify. A Sitemap file (using the capital S) is an XML-encoded listing of the most important content files within a site, built specifically for search engine crawlers to consume as a data feed. By contrast, a sitemap file (written in lower-case s) is typically an HTML file that lists the most important content files within a site, but this file is intended for human users to browse and find the content they want to read within a site. The key difference in purpose is the intended audience, and thus the markup coding used within the file.

Search engines use Sitemaps to learn about the site’s structure, better plan their crawl activity budgets, and as a webmaster-generated, suggested crawler feed. Please understand that listing a webpage’s URL in a Sitemap guarantees its inclusion in the search index. Not at all. However, it does mean that if the Sitemap uses well-formed XML code, supplies clean, valid URLs, and meets the other requirements of the search engines, the URLs it contains will at least be noted for consideration by the search engines for future crawling activity. That alone improves the chances many pages would otherwise have for discovery and being crawled if they had to otherwise rely solely on links.

While Sitemaps are helpful for websites to have under any circumstance, they are especially helpful in the following conditions:

  • New sites with new pages not yet well-linked (both internally and externally)
  • Sites that use dynamic URLs for their content pages
  • Sites with archived content that’s not well-linked to its currently active pages
  • Site with hard-to-discover pages that use hard-to-crawl links (such as those in scripts) or  are heavy in non-text content , such as rich Internet application (Flash or Silverlight)

Sitemaps are essentially content discoverability feeds to the search engines.


Structure of Sitemap XML

All of the major search engines support the formalized XML data protocol as defined on Sitemaps.org. A sample of the XML code used in a Sitemap looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.example.com/</loc>
    <lastmod>2012-04-30</lastmod>
    <changefreq>monthly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>http://www.example.com/about-us.html</loc>
    <changefreq>yearly</changefreq>
    <priority>0.4</priority>
  </url>
</urlset>

The Sitemap XML code includes both required and optional tags. Valid Sitemap XML code starts with the standard opening tag, “<?xml> “ and its required attributes. It’s followed by one iteration of the “<urlset>” tag which includes at least one “xmlns” attribute referencing the necessary namespace that defines the XML schema structure. Each URL referenced in the Sitemap requires both a <url> tag and a <loc> tag nested within. The remaining tags are optional.

The tags used in the Sitemap XML protocol are defined in the table below:

Tag Status Description
<?xml> Required Opening tag of file, includes required attributes:

  • version=”1.0″
  • encoding=”UTF-8″
<urlset> Required Used only once, this tag includes the required namespace attribute xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″. Google supports including additional namespace attributes and valuesfor mixed data types, including:

  • Images: xmlns:image=http://www.google.com/schemas/sitemap-image/1.1
  • Video: xmlns:video=”http://www.google.com/schemas/sitemap-video/1.1″
  • Mobile: xmlns:mobile=”http://www.google.com/schemas/sitemap-mobile/1.0″
  • Code search: xmlns:codesearch=”http://www.google.com/codesearch/schemas/sitemap/1.0″
  • News: xmlns:news=”http://www.google.com/schemas/sitemap-news/0.9″
<url> Required Parent tag for each URL added. All remaining tags are nested within this tag.
<loc> Required The webpage URL. Use the full URL, including protocol, not to exceed 2,048 characters.
<lastmod> Optional Date of page’s last revision, written in YYYY-MM-DD format (per W3C Datetime).
<changefreq> Optional Expected frequency of page revisions, treated as a hint by search engines. Valid values include:

  • always (changes each time the page is accessed)
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never (used for archived pages)
<priority> Optional The priority value of a page relative to others on your site. Valid values ranges from 0.0 to 1.0, with the default at 0.5.

File formats and types

Both Google & Bing accept the XML format for Sitemaps as specified in the protocol description listed on Sitemaps.org. However, they also accept “Sitemap” feeds using such file protocols as RSS 2.0, Atom 1.0, and basic ASCII text files. Sitemaps can be posted on websites in either normal XML format (as .xml files) or using Gzip file compression (as .gz files).

In addition to the standard Sitemap for web content (referencing HTML and other common webpage content), Google supports specialized Sitemap extensions that are dedicated to specific media types. These include Sitemaps for video, images, mobile content, software source code, and news content, all of which include additional, specific metadata for search engines to use for classifying the data about the media files found on your site. As long as the appropriate Sitemap extension is declared in the Sitemap namespace, all of this additional content can be added to the same Sitemap file.

Notes:

  • Bing has not announced support for the Google-supported Sitemap extensions, but Duane Forrester of Bing said “anything not supported will be ignored”, so webmasters should feel free to include these references in their Sitemap files.
  • Google recommends that news content Sitemaps be separate files as they are crawled at a much higher frequency.

As an alternative to using a Sitemap to identify video metadata for search, both Google and Bing support the use of media RSS (aka mRSS) files for this purpose. And since Bing does support this method, developing an mRSS video feed for both Google and Bing will be a smarter bet for sites containing a large number of video files they want indexed in search. Check out the details on both Google and Bing for more information about their officially supported mRSS feeds.


Limitations on Sitemap files

There are a number of rules and limits imposed on Sitemap files, either by the protocol or by the search engines:

  • Identify the basic Sitemap namespace. The Sitemap must specify the default XML namespace: xmlns=http://www.sitemaps.org/schemas/sitemap/0.9
  • Size limits. Sitemap files should not exceed 50,000 URL entries or 50 MB in size (uncompressed). If you need to have more than 50,000 entries in your Sitemap, use a Sitemap index file (discussed in the next section below).
  • Text requirements. Sitemap files must be UTF-8 encoded and use entity-escaped characters in URLs when needed (such as replacing “&” with “&amp;” in a dynamic URL variable).
  • Consistent syntax. All URLs within a Sitemap must use the same URL syntax. This means don’t mix URLs starting with the “www.” prefix with those that omit that prefix. Also, don’t include URLs that incorporate session IDs.
  • Location matters. Sitemap files stored in a directory can only reference URLs stored in that directory or its child directories. URLs in parallel directories, parent directories to the Sitemap’s directory, different subdomains, or those using a different protocol (such as https: versus http:) are not valid references. As such, storing your Sitemap in the root directory helps avoid invalid references.
  • Clean links.There should be no more than 1% link errors or the entire Sitemap may be discarded. (A link error is considered any HTTP response code other than 200, including 404 for broken links, as well as 301 and 302 for redirected links.) This is specifically a known rule with Bing, but using clean links is an SEO Best Practice for all search engines.Note: You can check the HTTP response code for an individual URL using the Header Checker Tool, or for multiple URLs on the same site, use the Find Broken Links, Redirects & Google Sitemap Generator Free Tool, both available for free at Internet Marketing Ninjas Tools.
  • Cross-site references not universally compatible. Google allows for cross-site URL submission in a Sitemap only if you prove you own all of the referenced sites, which is done by verifying them in your Google Webmaster Tools account. Note that cross-site Sitemaps are not compatible with Bing, and thus are not recommended for universal usage.

Sitemap index files

Many large sites have more than 50,000 content URLs that they consider worthy of indexing in search. To get around the 50,000-entry limit on Sitemap files, webmasters can instead create a Sitemap index file. A Sitemap index file references other Sitemap files rather than page URLs directly. A Sitemap index also can include up to 50,000 entries, which theoretically means you can submit up to 2.5 billion URLs from your site to search.

The XML code structure of Sitemap index files is very similar to Sitemap files. A sample of the Sitemap index XML code looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>http://www.example.com/sitemap01.xml</loc>
    <lastmod>2012-04-30</lastmod>
  </sitemap>
  <sitemap>
    <loc>http://www.example.com/sitemap02.xml</loc>
  <lastmod>2012-04-30</lastmod>
  </sitemap>
</sitemapindex>

Similar to the standard Sitemap file format, the Sitemap index XML code also includes both required and optional tags. Valid Sitemap index XML code starts with the standard opening tag, “<?xml> “ and its required attributes. It’s followed by one iteration of the “<sitemapindex>” tag which includes one “xmlns” attribute referencing the necessary namespace that defines the Sitemap index XML schema structure.

Each Sitemap referenced in the Sitemap index requires both a <sitemap> tag and a <loc> tag nested within. The tags used in the Sitemap index XML protocol are defined in the table below:

Tag Status Description
<?xml> Required Opening tag of file, includes required attributes:

  • version=”1.0″
  • encoding=”UTF-8″
<sitemapindex> Required Used only once, this tag includes the required namespace attribute xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″.
<sitemap> Required Parent tag for each Sitemap added. All remaining tags are nested within this tag.
<loc> Required The Sitemap URL. Use the full URL, including protocol.
<lastmod> Optional Date of Sitemap’s last revision, written in YYYY-MM-DD format (per W3C Datetime).

Implement your Sitemap

The process of implementing your own Sitemap so it’s available to the search engine crawlers is fairly straight-forward. Just follow these steps:

  1. Identify your most important content pages for search. You can ignore shopping cart pages, user login pages and non-canonical pages (URLs using dynamic URL variables when none are needed to access the page).
  2. Create your Sitemap with a CMS or external tool. If you publish to the web with a content management system, it likely already has a Sitemap generator tool included (this will be the best route to go if you need to create multiple Sitemap files and a Sitemap index). However, if you have no CMS or have a smaller site, you can also choose to use a third-party, Sitemap generator tool, such as the Find Broken Links, Redirects & Google Sitemap Generator Free Tool  from Internet Marketing Ninjas. In any case, you may want to review and edit the resulting file to ensure your non-content pages were excluded.
  3. Validate your Sitemap file. Before you publish the file for search crawlers to consume, confirm that it’s valid. There are many Sitemap validation tools available on the web.
  4. Post the Sitemap file on your website.As suggested earlier, the root directory is a great place to put it, as you avoid most invalid URL issues there. That said, when using a Sitemap index, posting the individual Sitemaps in the directories of the URL they contain is a perfectly valid strategy. Also, once the Sitemap (or the Sitemap index) has been posted, be sure to update your site’s robots.txt filewith a reference to the Sitemap’s location. Add a line of text similar to the following sample to the end of your robots.txt file:
    Sitemap: http://www.example.com/sitemap.xml
  5. Register your Sitemap with Google and Bing. Unlike with the robots.txt file, search engines do not automatically look for a Sitemap and read it if found. You need to explicitly register your Sitemap with them. The best way to do this is to use the Webmaster Tools of Google (click Site configuration > Sitemaps > Add/Test Sitemap) and Bing (click Crawl > Sitemaps (XML, Atom, RSS) > Add Feed). If you don’t already have an active account with each, consider this to be the reason to create one today.
  6. Update and repost your Sitemap file as site content changes. Your site is not likely a static entity. Reflect the new and changed content on your site with periodic updates to your Sitemap file. How often you need to do so depends on how often you change or add content to your site. Some sites can update their Sitemaps monthly and keep up with new content. News sites, on the other hand, will do so on a minute-by-minute basis.

Search engines eat good web content like food. Feed them your best content with a valid, clean Sitemap.