- What is the Siteimprove crawler?
- How often is my website crawled?
- How would switching from a non-JS crawl to a JS crawl affect my data?
- Where can I see the last crawl and next crawl date?
- Where can I see when a specific page was crawled?
- Can I recheck my site or pages outside of the normal crawl schedule?
- Can I prevent specific sections of my site from being crawled?
- Does the Siteimprove crawler consider "noindex" or "nofollow" when deciding what pages to include?
- How do I cancel the crawl of a website?
- What steps can be taken to reduce unnecessary load on the web server during crawling?
- How long does it take to crawl my entire website?
- Are all checks performed during the crawl?
- What are some typical reasons for a problem with a website crawl?
What is the Siteimprove crawler?
Web crawlers are computer programs that scan the web, ‘reading’ everything they find. A crawler starts out by visiting your website and systematically identify all hyperlinks on all pages, it then follows them to their conclusion.
Our crawlers scan your website using Siteimprove servers from specific IP addresses with identifiable user agents. Our crawlers use HTTP (Hypertext Transfer Protocol) requests to collect the HTML code on which to carry out error checks.
The data harvested by the crawler is stored in Siteimprove's databases. Based on the content found on each page, information is reported to Siteimprove's online platform, i.e. accessibility issues, misspellings, broken links, etc.
Learn more about the Siteimprove crawler and how it identifies broken links.
How often is my website crawled?
By default, our servers crawl your website every 5 days. Between the 5 day crawls we carry out periodic re-checks of broken links and pages with broken links, if the content has changed.
It is possible to change this schedule. A more frequent crawl will mean an extra cost. If you would like to change the frequency of your crawl, please contact Siteimprove.
How would switching from a non-JS crawl to a JS crawl affect my data?
Where can I see the last crawl and next crawl date?
On the Quality Assurance (QA), Accessibility and SEO Overview pages, you can see both the last crawl date and the next scheduled crawl date.
Note: If the "next crawl date" shown in the platform is a date in the past this means that the site crawl is queued due to a number of simultaneous crawls on your account. Usually, this situation resolves itself after a short period of time.
Where can I see when a specific page was crawled?
At the top of the Page Report menu you can see the date and time that specific page was last checked.
Can I recheck my site or pages outside of the normal crawl schedule?
Yes, it is possible to initiate a recheck at the following levels:
- Single page
- Multiple pages
- Group of pages
- Entire site
Note: Crawl duration varies depending on the number of pages on your site and the number of sites on your account crawling simultaneously.
Can I prevent specific sections of my site from being crawled?
Yes. You set up exclusions to tell our crawler not to check certain sections of your website.
Does the Siteimprove crawler consider "noindex" or "nofollow" when deciding what pages to include?
No, our crawler does not consider "noindex" or "nofollow" when determining what content to crawl. We crawl based on crawl settings. See also "Aliases and exclusions: How to add and remove content from a crawl"
How do I cancel the crawl of a website?
To cancel or stop a crawl on a website please contact the Siteimprove technical support team with details of the site account and URL.
What steps can be taken to reduce unnecessary load on the web server during crawling?
- Siteimprove uses intelligent algorithms and looks at several parameters to determine when and what to re-check. For example, we use a MD5 key to determine if the page has changed; if the page has not changed there is no need for a recheck.
- The default delay between HTTP requests is 200 milliseconds. Pauses of any time up to 20,000 milliseconds between requests will be added automatically if we suspect the crawler is affecting the site's performance.
- If necessary, pauses between HTTP requests can be added manually by Siteimprove.
- We automatically stop crawling the site if we get several time outs or if we notice internal errors from the website server.
- Siteimprove can limit the crawl by page number, page level, or by the number of links.
- The crawl can be configured to start at a particular time/day by request.
- Siteimprove can exclude parts of the site from a crawl.
- By request, we can check the site less frequently than every 5 days.
- By default, we limit the number of simultaneous crawls running on one account to two at a time.
If you would like any of the above settings changed for a crawl on your website, please contact Siteimprove Support.
How long does it take to crawl my entire website?
Below you can see estimated times for a site to be crawled in relation to the number of pages on the site.
Note: These are approximations. As each site is unique, crawl times can be longer or shorter. Crawls are queued and in some cases may not start until several hours after the crawl is ordered.
|Pages||Time approximation for a crawl to complete (avg.)|
Are all checks performed during the crawl?
No. Many checks are performed after the crawl is complete. The image below can be used as a rough guide to illustrate checks that will typically continue after the crawl has ended.
What are some typical reasons for a problem with a website crawl?
For information on this see the article "Typical reasons for crawl problems".