Skip to main content

Siteimprove's Crawler: Frequently Asked Questions

Sean Needham avatar
By Sean Needham
  • What is the Siteimprove crawler?
    Web crawlers are computer programs that scan the web, ‘reading’ everything they find. The crawler starts by visiting your website and systematically identifying all hyperlinks on all pages.

    Our crawlers scan your website using Siteimprove servers from specific IP addresses with identifiable user agents. Our crawlers use HTTP (Hypertext Transfer Protocol) requests to collect the HTML code on which to carry out error checks.

    The data harvested by the crawler is stored in Siteimprove's databases. Based on the content found on each page, information is reported to Siteimprove's online platform, i.e. accessibility issues, misspellings, broken links, etc.
    Learn more about the Siteimprove crawler and how it identifies broken links.

  • How often is my website crawled?
    By default, our servers crawl your website every 5 days. Between the 5 day crawls we carry out periodic re-checks of broken links and pages with broken links, if the content has changed.

    It is possible to change this schedule. A more frequent crawl will mean an extra cost. If you would like to change the frequency of your crawl, please contact Siteimprove.

  • Where can I see the last crawl and next crawl date?
    At the bottom of the Quality Assurance summary page, Accessibility summary page or Dashboard, you can see both the last crawl date and the next scheduled crawl date.

    Screenshot with last crawl date and next scheduled crawl date

  • Where can I see when a specific page was crawled?
    At the top of the Page Report menu you can see the date and time that specific page was last checked.

    Page Report menu with date and time highlighted

  • Can I recheck my site or pages outside of the normal crawl schedule?
    Yes, it is possible initiate a recheck at the following levels:
    • Single page
    • Multiple pages
    • Group of pages
    • Entire site

Learn more on how to re-crawl your pages, groups and sites.

Note: Crawl duration varies depending on the number of pages on your site and the number of sites on your account crawling simultaneously. 

  • Can I prevent specific sections of my site from being crawled?
    Yes. You set up exclusions to tell our crawler not to check certain sections of your website.

  • What steps can be taken to reduce unnecessary load on the web server during crawling?
    • Siteimprove uses intelligent algorithms and looks at several parameters to determine when and what to re-check. For example, we use a MD5 key to determine if the page has changed; if the page has not changed there is no need for a recheck.

    • The default delay between HTTP requests is 200 milliseconds. Pauses of any time up to 2000 milliseconds between requests will be added automatically if we suspect the crawler is affecting the site's performance.

    • If necessary, pauses between HTTP requests can be added manually by Siteimprove.

    • We automatically stop crawling the site if we get several time outs or if we notice internal errors from the website server.

    • Siteimprove can limit the crawl by page number, page level, or by the number of links.

    • The crawl can be configured to start at a particular time/day by request.

    • Siteimprove can exclude parts of the site from a crawl.

    • By request, we can check the site less frequently than every 5 days.

    • By default, we limit the number of simultaneous crawls running on one account to two at a time.

If you would like any of the above settings changed for a crawl on your website, please contact Siteimprove Support.

  • How long does it take to crawl my entire website?
    Below you can see estimated times for a site to be crawled in relation to the number of pages on the site.

    Note:
    These are approximations. As each site is unique, crawl times can be longer or shorter. Crawls are queued and in some cases may not start until several hours after the crawl is ordered.
Pages Time approximation for a crawl to complete (avg.)
1,000 45 minutes
2,500 1.5 hours
5,000 3 hours
10,000 6.5 hours
20,000 14 hours
50,000 28 hours


Additional resources:
- Typical reasons for crawl problems

Was this article helpful?
13 out of 15 found this helpful