Can Siteimprove crawl an intranet and other non-public sites?
By Nicolaj Dannemann
The Siteimprove Suite is primarily used on publicly available websites but can under specific conditions also be utilized on internal and non-public websites such as Intranets, pre-production, and staging websites.
Siteimprove uses web-based crawlers to index and check your websites for errors. This article outlines some of the requirements and considerations when it comes to crawling non-public websites.
Note: Before a non-public website is crawled, customers must sign and return the Siteimprove indemnification statement. If you have not already returned the signed statement, then please do so before you give access to Siteimprove. Reach out to your Siteimprove contact or your account Customer Success Manager (CSM) for a copy of the indemnification statement.
Requirements to check sites behind a login
In order to use the Siteimprove Content Suite on your non-public websites, you will, in addition to signing an indemnification statement, also need to meet the below requirements:
Access to the Website via the Internet
The website must be available over the internet, i.e. not only available on your internal network.
This requires one of the following:
- A subdomain pointing to the websites public IP-address
- A hostname paired with a public IP-address
- A public IP-address that leads directly to the website
Login credentials for password-protected websites
Login credentials are required for password-protected websites. Please make sure that the user-created for Siteimprove does NOT expire or use a password renewal policy. The user should NOT have access to modify or delete content on the website/intranet.
Supported authentication methods
We support the following authentication methods:
- Basic authentication - Please supply Username, Password, Domain, and Realm
- Windows authentication - Please supply the Username, Password, and login domain.
- Token-based login - We will send a GET request supplying an agreed upon token that will authenticate our crawlers for the session. Please supply the authentication URL and the token/etag you have assigned to Siteimprove.
- Form-based POST request - We support a variety of POST login methods with pre-fetching of dynamic server-side generated variables. Please supply Username, Password, and Login form URL.
- VPN - We support a variety of VPN tunnel protocols. If you are configuring a VPN connection in cooperation with siteimprove please read - "Siteimprove VPN Technical Specifications".
Siteimprove authentication proxy
Siteimprove uses authentication proxy as an authentication layer. To establish a connection the login configuration needs to be customized by our technical support staff.
The code will mimic a sequence of actions that closely resembles the interactions performed by a user to establish an authenticated session.
The authentication proxy steps through the HTTP interactions and records the cookies set in the negotiation, then for each subsequent request to the site, the relevant cookies are added to the request.
There’s no way of knowing if it is possible to crawl a site behind a login until we have tested the process. We are currently aware of the following constraints:
- We cannot crawl sites that require a Citrix based login.
- We cannot crawl sites that use multi-factor authentication.
- We are unable to support the use of Captcha
- Sites that are protected by non-standard (i.e. custom-made) security methods require extra time to configure. In a number of cases, it may not be possible to crawl these sites.
- In some cases, you may need to allow our crawler IP addresses. The default IP address used by our crawler: 188.8.131.52.
- We are unable to support logins that use dynamic session storage.
- We cannot crawl authentication implementations that utilize dynamic local storage values.
Each login scenario is different and due to the complexities and security restrictions involved, configuring a login can take days, or weeks if it needs to be escalated to our development team.
In some cases, it may not be possible to do a full Accessibility check behind a login.
Allowing IP addresses
In some cases, non-public website domains can be made accessible for certain IP addresses without requiring a login. An allowlist can be configured by your domain administrator. If this is an option for your website, you can find the IP addresses to be allowed in the article "What IP addresses and User agents are used by Siteimprove?".
If you would like to crawl a site that is not publicly available, behind a firewall or requires a login/authentication, then please submit a Support ticket to request assistance adding the site to your subscription.