Skip to main content

Why are some PDFs not being checked by Siteimprove?

Sean Needham avatar


Customers who subscribe to the Quality Assurance and/or the Accessibility module can have PDFs checked for broken links and accessibility issues.

If you find that some PDFs on your website are not being checked by Siteimprove, then we recommend you consider the following:

Note: Only Administrators and Account Owners can edit crawl settings.

Review PDFs to be checked settings

Are the PDF included in the URL match?

Review the URL match used to identify PDFs that will be checked.

  1. Go to Settings > Content > Crawl settings
  2. Select the site that you are investigating
  3. Select the tab “PDF Content check”

Here, you are presented with configuration options, links to ‘PDFs to be checked’, ‘PDFs NOT being checked’, and URL matches for those PDFs included in the check. Within the configuration section you can add an extra URL match to include PDFs that aren't currently being checked.

  1. To include all PDFs that have been found by the crawler but are not included in checks, add “/” within the match field.
  2. Click ‘Add match’

PDF check configuration within crawl settings

Are the PDFs seen as internal?

Only PDFs seen as internal to your website will be checked.
To check that a PDF is seen as internal, go to Quality insurance > Inventory > Documents. 
External PDFs listed are normally on an external domain and will not be checked. If these PDFs should be seen as internal then you can edit the crawl settings, see “Aliases and exclusions: How to add and remove content from a crawl”.

Exclusions/alias settings

PDFs may also be removed from the check by exclusion and alias settings. Review the URL matches added and compare them with the PDF URLs. 

  1. Go to Settings > Content > Crawl settings
  2. Select the site that you are investigating
  3. Click on the Exclude or Alias tab to review the settings
  4. To edit these settings, please refer to the article “Aliases and exclusions: How to add and remove content from a crawl

Note: Changing these settings can result in a variation in the number of pages being checked on your website.

Other limitations

There can be other reasons why PDFs are not checked, for example:

  • PDF checking must be included in your subscription and you must be within the number of documents. This information is available in your Siteimprove agreement.
  • PDFs over 20 MB will not be checked
  • PDFs need to be the correct MIME type. PDFs that do not identify themselves with the value 'application/pdf' will not be checked
  • Your website's Robots.txt file may direct our crawler not to check a section of the website
  • PDF stopped by firewalls or behind an authentication
  • PDFs that are images will not be checked
  • A PDF will not be crawled if the PDF-Link is broken
  • If your site is set up to crawl your XML-sitemap only, and if PDFs are not shown in that sitemap as links, then the PDFs will not be checked.
  • If the link to PDF is only found via a page that is inserted into your site using a Single Page Check then the PDF will not be crawled. This is because the Single Page Check analyzes the HTML of the page and checks all available links on it – it does not crawl any further than that. Single Page Checks are typically inserted via the CMS Plugin, Siteimprove Ads, an Integration (Marketing Automation), or directly via the Siteimprove platform.
  • If the PDF level exceeds the max page level setting within the website structure, it will not be checked. The default is 50 levels. Contact Siteimprove to have this increased if required.

If you have any questions regarding this, please contact Siteimprove technical support.

Was this article helpful?

1 out of 1 found this helpful