Readability: How does Siteimprove identify content on my page?
By Guðrún Gústafsdóttir
Firstly, Siteimprove's readability algorithm is a best-effort to identify content on your page.
We identify content using a machine learning model, that attempts to detect the parts of webpages that are actual text content. This means that we try to disregard template elements, what in technical terms is called “boilerplate” content (e.g. sitewide navigation, header, footer, etc.)
However, webpages greatly differ from each other, in structure and content, so it is often difficult to identify which parts of a page are in fact content.
Here are some of the main ways that we try to identify content:
- What is the enclosing HTML tag of the text? Some tags are predominantly used for content, some for boilerplate. We also look at the attributes of the tag, especially the "class" attribute.
- Text-to-tag ratio: template elements usually contain much more HTML markup compared to areas with actual content.
- Where on the page is the text located? The very first and very last part of an HTML document tends to be header and footer, respectively.
- To which degree are the surrounding text blocks content or template?
The above criteria are defined from what would generally be considered good or normal practice in HTML markup.