Readability: How does Siteimprove identify content on my page?

Modified on: Thu, 27 Oct, 2022 at 12:33 PM

Siteimprove's readability algorithm is a best-effort to identify content on your page.

We identify content using a machine learning model, that attempts to detect the parts of web pages that are actual text content. This means that we try to disregard template elements, that in technical terms are called “boilerplate” content (e.g. sitewide navigation, header, footer, etc.)

We also disregard sentences that are less than 6 words long or contain less than 45 characters.” This is done in order to increase the accuracy of the Readability results provided by our machine learning model.

Including/excluding content for Readability

If there’s content that we are currently not considering for Readability, you can add the class SI-CONTENT-YES so that it is included.

And, if there’s content on the page you don’t want Siteimprove to consider for the readability calculations, you can use the class SI-CONTENT-NO to exclude it.

Note: It's important to be aware that using the class SI-CONTENT-YES means readability results can become inaccurate. When this class is used, we will no longer disregard content that reduces the accuracy of our Readability checks (e.g., sentences with less than 6 words).

Web pages greatly differ from each other, in structure and content, so it is often difficult to identify which parts of a page are in fact content.

Identifying content

Here are some of the main methods we use to identify content:

What is the enclosing HTML tag of the text? Some tags are predominantly used for content, and some for boilerplate. We also look at the attributes of the tag, especially the "class" attribute.
Text-to-tag ratio: template elements usually contain much more HTML markup compared to areas with actual content.
Where on the page is the text located? The first and last part of an HTML document tends to be the header and footer, respectively.
To which degree are the surrounding text blocks content or template?

The above criteria are defined from what would generally be considered good or standard practice in HTML markup.

Additional resources on Readability

Did you find it helpful? Yes No