Can Siteimprove crawl an intranet and other non-public sites?

Modified on: Mon, 13 Feb, 2023 at 9:36 PM

The Siteimprove Suite is primarily used on publicly available websites but can under specific conditions also be utilized on internal and non-public websites such as Intranets, pre-production, and staging websites.

Siteimprove uses web-based crawlers to index and checks your websites for errors. This article outlines some of the requirements and considerations when it comes to crawling non-public websites.

Requirements to check sites behind a login
Siteimprove authentication proxy
Allowing IP addresses

Note: Before a non-public website is crawled, customers must sign and return the Siteimprove "non-public website statement" agreement. If you have not already returned the signed statement, then please do so before you give access to Siteimprove. Reach out to your Siteimprove contact or your account Customer Success Executive (CSE) for a copy of the "non-public website statement".

Requirements to check sites behind a login

In order to use the Siteimprove Content Suite on your non-public websites, you will, in addition to signing a "non-public website statement", also need to meet the below requirements:

Access to the Website via the Internet

The website must be available over the internet, i.e. not only available on your internal network.
This requires one of the following:

A subdomain pointing to the website's public IP-address
A hostname paired with a public IP-address
A public IP-address that leads directly to the website

Login credentials for password-protected websites

Login credentials are required for password-protected websites. Please make sure that the user created for Siteimprove does NOT expire or use a password renewal policy. The user should NOT have access to modify or delete content on the website/intranet.

Please note, that Siteimprove for security reasons does NOT provide E-mail addresses for this use.

Supported authentication methods

We support the following authentication methods:

Basic authentication - Please supply Username, Password, and Domain
Form-based POST request - We support a variety of POST login methods with pre-fetching of dynamic server-side generated variables. Please supply Username, Password, and Login form URL

Siteimprove authentication proxy

Siteimprove uses authentication proxy as an authentication layer. To establish a connection, the login configuration needs to be customized by our technical support staff.

The code will mimic a sequence of actions that closely resembles the interactions performed by a user to establish an authenticated session.

The authentication proxy steps through the HTTP interactions and records the cookies set in the negotiation, then for each subsequent request to the site, the relevant cookies are added to the request.

Considerations

There’s no way of knowing if it is possible to crawl a site behind a login until we have tested the process.

However, we are currently aware of the following constraints:

We cannot configure logins for sites using Citrix or Google as their login providers
We cannot crawl sites that use multifactor authentication, also known as 2FA or MFA
We cannot convey the crawler's traffic over a site-to-site VPN tunnel
We are unable to support the use of Captcha
Sites that are protected by non-standard (i.e. custom-made) security methods require extra time to configure. In a number of cases, it may not be possible to crawl these sites
In some cases, you may need to allow our crawler IP addresses

Each login scenario is different and due to the complexities and security restrictions involved, configuring a login can take days, or weeks if it needs to be escalated to our development team.

In some cases, it may not be possible to do a full Accessibility check behind a login.
For example, assets like style sheets, JavaScript, images etc. that are behind logins can cause inaccuracies in the accessibility results. This depends on the type of login and the login configuration on the specific website. To find out more or if you have any questions regarding this then please contact Siteimprove technical support.

Allowing IP addresses

In some cases, non-public website domains can be made accessible for certain IP addresses without requiring a login. An allow list can be configured by your domain administrator. If this is an option for your website, you can find the IP addresses to be allowed in the article "What IP addresses and User agents are used by Siteimprove?".

If you would like to crawl a site that is not publicly available, behind a firewall, or requires a login/authentication, then please submit a support ticket to request assistance adding the site to your subscription.

Did you find it helpful? Yes No