Skip to main content

Siteimprove's Crawler: Frequently Asked Questions

Sean Needham avatar


What is the Siteimprove crawler?

Web crawlers are computer programs that scan the web, ‘reading’ everything they find. A crawler starts out by visiting your website and systematically identify all hyperlinks on all pages, it then follows them to their conclusion.

Our crawlers scan your website using Siteimprove servers from specific IP addresses with identifiable user agents. Our crawlers use HTTP (Hypertext Transfer Protocol) requests to collect the HTML code on which to carry out error checks.

The data harvested by the crawler is stored in Siteimprove's databases. Based on the content found on each page, information is reported to Siteimprove's online platform, i.e. accessibility issues, misspellings, broken links, etc.
Learn more about the Siteimprove crawler and how it identifies broken links.

Where can I find more information on my website crawl status?

You can find the most recent scan dates and the scan times for your sites in Crawler Management. 

Go to, Settings > Crawler Management

Why does Crawler Management show that a crawl is finished but I still can’t see it in QA Check history?

The crawl will show as finished in Crawler Management as soon as the crawl is complete, however, the QA check history will only show when the full scan, including processing of data (link checking, accessibility, etc.) is complete.

At, Settings > Crawler Management > Scan History, we show each stage of the scan and the status. If any stage in the scan history table says “Pending” then that scan is not complete.

The QA check history, along with all the data in the platform will only update when a full scan is complete.

The screenshot below shows, the crawl got done but processing the data found in the crawl did not finish. Therefore, the QA check history won’t update.

Scan_history.png

You can read more about the scan stages in the scan process description.

Why does Crawler Management show a different number of pages and links than the QA Check history for a specific site?

When crawling a site, we analyze (parse) all the URLs. Afterward, we process the data, which includes removing links/pages based on exclusions, aliases, deduplication rules, etc. configured for your website.

  • Crawler Management shows all the pages and links found during a crawl.
  • QA Check history will show the pages and links that have been stored after exclusions, aliases, deduplication rules, etc. have been processed.

Why does Crawler Management show more pages and links for a site than the products (QA, Accessibility, Policy, SEO, Data Privacy)?

  • Crawler Management shows all the pages and links that we have seen during a crawl.
  • QA Check History shows the pages and links that have been stored after the crawl data has been processed­—meaning those we have found, minus the pages/links that have been excluded due to site settings. 

See "Aliases and exclusions: How to add and remove content from a crawl" for information on exclusions. 

Why does Crawler Management show 0 pages for a site but the products (QA, Accessibility, Policy, SEO, Data Privacy) show all pages?

If we find 0 pages in a crawl, then Crawler Management will show 0 pages, but QA still stores all the pages from the last successful scan. This state will remain until there is a new successful scan that completes all three stages (queue, crawl, processing).

The crawl may find 0 pages due to a site being down temporarily, but this mechanism means users can still work on the results of the last successful scan until the next scan completes. See also "Typical Reasons for Crawl Problems".

 

How often is my website crawled?

By default, our servers crawl your website every 5 days. Between the 5 day crawls we carry out periodic re-checks of broken links and pages with broken links, if the content has changed.

It is possible to change this schedule. A more frequent crawl will mean an extra cost. If you would like to change the frequency of your crawl, please contact Siteimprove.

How would switching from a non-JS crawl to a JS crawl affect my data?

See the article "How would switching from a non-JS crawl to a JS crawl affect my data?"

Where can I see the last and next crawl date?

You can see the last scan and next crawl dates in Settings > Crawler Management > Site overview.

You'll also see both the last and next crawl dates on the Crawl details widget on the Quality Assurance (QA), Accessibility, and SEO Overview pages.

Why does “Next crawl scheduled” show a date in the past on the Crawl details widget?

The Crawl details widget can show a “Next crawl scheduled” date in the past when the site is in a queue waiting for a crawl slot. See the Crawler queuing for more information on queuing.

This can occur if your account allows for too few max simultaneous crawls. It will usually resolve itself, otherwise, feel free to contact Siteimprove Technical support. Read more about Maximum simultaneous crawls.

Where can I see when a specific page was crawled?

At the top of the Page Report menu, you can see the date and time that specific page was last checked.

date_and_time_on_page_report

Can I recheck my site or pages outside of the normal crawl schedule?

Yes, it is possible to initiate a recheck at the following levels:

  • Single page
  • Multiple pages
  • Group of pages
  • Entire site

Learn more on how to re-crawl your pages, groups, and sites.

Note: Crawl duration varies depending on the number of pages on your site and the number of sites on your account crawling simultaneously.

Can I prevent specific sections of my site from being crawled?

Yes, you set up exclusions to tell our crawler not to check certain sections of your website.

Does the Siteimprove crawler consider "noindex" or "nofollow" when deciding what pages to include?

No, our crawler does not consider "noindex" or "nofollow" when determining what content to crawl. We crawl based on crawl settings. See also "Aliases and exclusions: How to add and remove content from a crawl"

How do I cancel the crawl of a website?

To cancel or stop a crawl on a website please contact the Siteimprove technical support team with details of the site account and URL.

What steps can be taken to reduce unnecessary load on the webserver during crawling?

  • Siteimprove uses intelligent algorithms and looks at several parameters to determine when and what to re-check. For example, we use a MD5 key to determine if the page has changed; if the page has not changed there is no need for a recheck.
  • The default delay between HTTP requests is 200 milliseconds. Pauses of any time up to 20,000 milliseconds between requests will be added automatically if we suspect the crawler is affecting the site's performance.
  • If necessary, pauses between HTTP requests can be added manually by Siteimprove.
  • We automatically stop crawling the site if we get several time outs or if we notice internal errors from the website server.
  • Siteimprove can limit the crawl by page number, page level, or by the number of links.
  • The crawl can be configured to start at a particular time/day by request.
  • Siteimprove can exclude parts of the site from a crawl.
  • By request, we can check the site less frequently than every 5 days.
  • By default, we limit the number of simultaneous crawls running on one account to two at a time.

If you would like any of the above settings changed for a crawl on your website, please contact Siteimprove Support.

Are all checks performed during the crawl?

No. Many checks are performed after the crawl is complete. The image below can be used as a rough guide to illustrate checks that will typically continue after the crawl has ended.

Crawl_and_check_sequenceWhat are some typical reasons for a problem with a website crawl?

 For information on this see the article "Typical reasons for crawl problems".

Was this article helpful?

41 out of 49 found this helpful