How does Siteimprove use sitemaps?
By Sean Needham
What is a sitemap?
When we talk about a sitemap in this article, we are referring to the XML file found on a website that identifies pages to be indexed by a search engine crawler.
The location of the sitemap will normally be provided in a website’s robots.txt file and is usually found in the root directory of your web server, e.g. https://your_website_domain.com/sitemap.xml.
Note: In this article, we are not referring to the sitemap found in the Quality Assurance Inventory which shows how your site is divided into directories and folders based on Siteimproves crawler data.
How does Siteimprove use sitemaps to determine the content to be checked for issues in QA, Accessibility, etc?
By default, Siteimprove will not purposefully crawl all content listed in a website’s sitemap. The Siteimprove crawler usually starts at the Primary Index URL and then follows the links found on each page to discover the content on your website. For further information on the crawler see “Siteimprove's Crawler: Frequently Asked Questions”
We will only crawl the content of a sitemap if it has been added as the Primary Index URL (Site URL) or added as an Extra Index URL by our Technical Support team. Otherwise, we only consider the sitemap file for presenting results within the SEO module.
How does Siteimprove use sitemaps to determine the results that appear in SEO “Pages not included in the sitemap”?
Siteimprove will review sitemaps listed in the robots.txt file, including any sitemap index files that link to other sitemaps (i.e. nested sitemaps). If no sitemap files are found here we will look in the root directory of your web server, e.g. https://your_website_domain.com/sitemap.xml.
If no sitemap is found in these locations, then a ‘Missing Sitemaps’ warning will be presented within SEO Issues and Recommendations.
If a sitemap is found, then any pages that are indexed by our crawler but not mentioned in the sitemap will be listed as ‘Pages not included in the sitemap’ within SEO Issues and Recommendations.
How does Siteimprove handle nested sitemaps?
Nested sitemaps are supported if listed in sitemap index XML files, as specified by the sitemaps XML format protocol. This is true for both discovery of URLs to crawl when added as an Index URLs, and for SEO purposes when found in robots.txt or root directory of the web server.
What formats of sitemap are indexed by default for SEO purposes (e.g. xml .gz .aspx)?
Siteimprove will download any sitemap listed as a sitemap in the primary index URL's robots.txt file for SEO purposes, regardless of file ending and server technology. If the file has the file ending .gz, this indicates that the content is gzipped and we will unzip the compressed sitemap.
How does Siteimprove store XML-sitemaps used as index-URLs
See the article "Storing XML-sitemaps used as index-URLs".