Skip to main content

Aliases and exclusions: How to add and remove content from a crawl

Sean Needham avatar


Aliases and exclusions can be used to further specify what domains, folders or pages should be included/excluded from your website crawl.

Note: Only Administrators or Account owners can add/edit aliases and exclusions.

Changes related to aliases and exclusions can be configured under Settings > Content > Crawl settings.

What is an exclusion?

An exclusion is a method of specifying what pages should not be crawled using a URL match (e.g. an exclusion of /archive/ would let the crawler know to skip over any page with a URL containing "/archive/").

Matching pages will not be checked for broken links, misspellings, accessibility or SEO issues. They will not be included in the Site Inventory.

Example reasons to add an exclusion: 

  • The URLs (pages/links) should not be checked, e.g. Archive.
  • You have duplicate pages on your website already being checked, e.g. ?sort=ascen.
  • Your website contains anchor-links, domain.com/page1/#contact, domain.com/page2/#contact and domain.com/page3/#contact that are not real pages but seen as duplicates, e.g. exclusion is /#contact.
  • You have a large number of links leading to URLs with the same pattern (for example different intranet pages) our crawler cannot access and are therefore sees as broken links (403 Forbidden).

Note: When setting up exclusions only a partial match on the link is needed. A match of "/archive/" will apply to all links and pages containing "/archive/". 

Also, if you exclude a page, you also exclude the links on that page unless the crawler can navigate to those links via a different page. Consider the structure in this example below, where each letter represents a page. If you exclude page C, that means that the crawler will never find pages E, F, and G (unless they are linked to from another page).Picture of page hierarchy with to level is page A linking to pages B and C, then B links to page D , pabe C links to E and pange E links to F and G

How do I add an exclusion on my site?

  1. Select Settings > Content > Crawl Settings.
  2. Select the site for which you would like to add the exclusions.
  3. Click Exclude.Adding_an_Exclusion_to_a_site
  4. Type in the URL of the exclusion match and click "Create exclusion".
  5. These setting changes will take effect after your next website crawl.

 

What is an alias?

An alias helps our crawler better determine what content is considered "internal" or "external" to your website using a URL match.

For example, an alias can be used specify whether pages on a subdomain should be included in your website crawl results (internal - will be checked) or factored out (external - will not be checked).

Reasons to add an alias include: 

  • You just got responsibility for a new subdomain (e.g. https://news.example.com) on your website domain (e.g. https://www.example.com) and you'd like it to be checked as part of the original site.
  • You want to remove a section (e.g. /calendar/) from being crawled but you'd still like any links on your main site to that section to be identified as broken if found.

Internal content

An internal page is considered a part of your site and will be checked for broken links, misspellings, accessibility issues, etc. Content is treated as internal unless you select the "Crawl as external content" option when adding an alias.

Note: A link to the aliased domain must exist on the website for our crawler to index it. If the link is not available, then contact technical support, who can add an 'extra index URL' to achieve the same purpose. For example, if you want https://myothersite.demosite.com to be considered part of your site https://demo.com then, in addition to adding an alias, you will need to have a link to https://myothersite.demosite.com on at least one page of https://demosite.com.

External content

External content is not considered part of your site and will not be checked for broken links, misspellings, accessibility issues, etc. Content is treated as external if you select the "Crawl as external content" option when adding an alias.

You would add an alias for external content if you want to make sure links to that content are not broken but you do not necessarily want to check the content on the pages itself.

For example, if https://www.demosite.com/calendar/ is added as an alias, with "Crawl as external content" selected, any link URL containing 'https://www.demosite.com/calendar/' will be checked to make sure it is available (i.e. not broken). However, the content and links on the page(s) associated with that URL will not be evaluated.

How do I add an alias on my website?

Note: When setting up an alias only a partial match on the link is needed. A match of "/calendar/" will apply to all links and pages containing "/calendar/".

  1. Select Settings > Content > Crawl Settings.
  2. Select the site for which you would like to add the Alias.
  3. Add the domain or URL match for the Alias you are adding. 
  4. Select "Crawl as external content" if you are creating an external Alias. Do not select this option for an internal Alias.
  5. Click on "Create alias".
  6. These setting changes will take effect after your next website crawl.
    Adding_an_alias

Note: If you are setting up a domain alias, only a domain name is required. Typing in "example.com" automatically ensures that all subdomains are included; i.e. www.example.com, news.example.com, and any other subdomains that you may have. Conversely, if you identify a subdomain by typing in the alias news.example.com, only this subdomain will be included.

Is it possible to use regular expressions for exclusions and aliases?

Yes, it is possible. Please contact our Technical Support team who will help you set up exclusions/aliases with regular expressions.

Was this article helpful?

11 out of 17 found this helpful