The Siteimprove Suite is primarily used on publicly available websites but can under specific conditions also be utilized on internal and non-public websites such as Intranets, pre-production, and staging websites.
Siteimprove uses web-based crawlers to index and check your websites for errors. This article outline some of the requirements and considerations when it comes to crawling non-public websites.
- Requirements to check sites behind a login
- Whitelisting of IP addresses
Note: Before a non-public website is crawled, customers must sign and return the Siteimprove indemnification statement. If you have not already returned the signed statement, then please do so before you give access to Siteimprove. Reach out to your Siteimprove contact or technical support for a copy of the indemnification statement.
Requirements to check sites behind a login
In order to use the Siteimprove Content Suite on your non-public websites, you will, in addition to signing an indemnification statement, also need to meet the below requirements:
Access to the Website via the Internet
The website must be available over the internet, i.e. not only available on your internal network.
This requires one of the following:
- A subdomain pointing to the websites public IP-address
- A hostname paired with a public IP-address
- A public IP-address that leads directly to the website
Login Credentials for Password Protected Websites
Login credentials are required for password-protected websites. Please make sure that the user created for Siteimprove does NOT expire or use a password renewal policy. The user should NOT have access to modify or delete content on the website/intranet.
Supported Authentication methods
We support the following authentication methods:
- Basic Authentication - Please supply Username, Password, Domain, and Realm
- Windows Authentication - Please supply the username, Password and login domain. Some types of Windows Authentication are not compatible with our Perl-based crawlers.
- Token-based Login - We will send a GET request supplying an agreed upon token that will authenticate our crawlers for the session. Please supply the authentication URL and the token/etag you have assigned to Siteimprove.
- Form Based POST Request - We support a variety of POST login methods with pre-fetching of dynamic server-side generated variables. Please supply Username, Password and Login form URL.
- Sites that require Single sign-on (SSO) authentication
- Sites where the negotiation to establish a session is dynamic
The code will mimic a sequence of actions that closely resembles the interactions performed by a user to establish an authenticated session.
The authentication proxy steps through the HTTP interactions and records the cookies set in the negotiation, then for each subsequent request to the site, the relevant cookies are added to the request.
There’s no way of knowing if it is possible to crawl a site behind a login until we have tested the process. We are currently aware of the following constraints:
- We cannot crawl sites that require a Citrix/VPN based login.
- We cannot crawl sites that use multi-factor authentication.
- We are unable to support the use of Captcha
- Sites that are protected by non-standard (i.e. custom-made) security methods require extra time to configure. In a number of cases, it may not be possible to crawl these sites.
- In some cases you may need to allow (white-list) our crawlers IP addresses. The default IP address used by our crawler: 18.104.22.168.
- We are unable to support logins that use dynamic session storage.
Each login scenario is different and due to the complexities and security restrictions involved, configuring a login can take days, or weeks if it needs to be escalated to our development team.
In some cases, it may not be possible to do a full Accessibility check behind a login.
Whitelisting of IP addresses
In some cases, non-public website domains can be made accessible for certain IP addresses without requiring a login. This is called IP Whitelisting and can be configured by your domain administrator. If this is an option for your website, you can find the IP addresses to be whitelisted in the article "What IP addresses and User agents are used by Siteimprove?".
If you would like to crawl a site that is not publicly available, behind a firewall or requires a login/authentication, then please submit a Support ticket to request assistance adding the site to your subscription.