This article goes into the nitty-gritty details of how reading mode parses certain metadata about articles from webpages.
This article is part three in a series on web reading mode and reading mode parsers.
What metadata is extracted? and why?
Every reading mode wants to display an appropriate title for the displayed article. However, some also include a byline as well. A byline is often the very first paragraph below the headline of an article; crediting the article author. Bylines often include the publication time as well as the name of the author.
The following table gives an overview of which implementations use which parsers, the metadata they extract, and the method they use to extract it.
Product | Parser | Metadata | ||
---|---|---|---|---|
Title | Author | Date | ||
GNOME Web | Mozilla Readability | document.title |
meta author | none |
Mozilla Firefox | Readability-specific | |||
Vivaldi | none | |||
Yandex Browser | ||||
Samsung Browser (Android) | ||||
Apple Safari | Safari Reader | Safari-specific | ||
Maxthon | Maxthon Reader | Maxthon-specific | none | |
Microsoft Edge | EdgeHTML | Edge-specific | ||
Microsoft Edge (Android) | DOM Distiller | DOM Distiller-specific | none | |
Google Chrome (Android) | ||||
Mercury Reader | Web Reader | Mercury-specific | meta author | Open Graph |
Instapaper | Instaparser | Open Graph | ||
Mozilla Pocket | Unknown | Site-specific parsing |
Dates, when included, are displayed in the format they’re in when detected on the page. None of the parsers try to parse the date string or do any kind of localization on it.
The rest of the article will go into details about each of the above parsers and how they pick out metadata. Before I get into that, I’ll quickly address a common problem which may very well be the reason why you’re reading this article.
Avoiding duplicate bylines and dates
You may see an unwanted (in reading mode implementations that doesn’t support it) or duplicated (in reading mode implementations that does) byline or publication date. It can be difficult to work out how to get rid of these.
The following markup will solve this problem in all reading mode implementations assuming that the parser in question can properly detect both the author and publication time. This works even in browsers that don’t make use of the byline.
Note that the above isn’t a one-stop answer to proper metadata-extraction in all reading mode implementations. However, the above markup example will allow the parser to remove the byline paragraph. The dateline or byline will be removed from the document (assuming they’ve matched the value held by the parser); leaving a less than 25-characters paragraph which, as discussed in Part 1, will be removed from the document also.
Now that that’s out of the way, lets get into the real nitty-gritty details of reading mode metadata-extraction.
Author name in Mozilla Readability, Maxthon Reader, and Mercury Reader
This is the most straight-forward and even web standard compliant detection method. Mozilla Readability, Maxthon Reader, and Mercury Reader will find the first instance of an meta[name="author"]
element in the document and use its value as the primary candidate for the author name.
The below example will set the author name to “Cave Johnson”:
Mozilla Readability as well as Maxthon Reader also supports other detection mechanisms including *[rel="author"]
, but these methods are unreliable and difficult to work with as they involve DOM-distance and depth-difference calculations from the perceived beginning of the article body text.
Titles in Mozilla Readability-method
There are two primary methods used to determining the article title and both derive it from the first <title>
element in the document (same as the document.title
property). GNOME Web is the exception here and just uses the document.title
verbatim.
The first method looks for a segment separator consisting of the character sequence: space + separator + space; where the separator is one of the following characters: -
, >
, »
, |
, \
, or /
. If such a sequence if found, the leftmost segment is considered the candidate for the document title. A document.title
of “Article Title - Example Website” would get “Article Title” as the candidate title.
If the first method fails, the second method is tried instead. This method looks for a word followed by a colon (:
) and a space character and the second segment from the left is considered the title candidate. A document.title
of “Example Website: Article Title: Extra” would get “Article Title” as the candidate title.
There’s no support for right-to-left (RTL) script systems and languages.
In both of our above examples, the candidate title is “Article Title”. The candidate title is then evaluated against the following criteria:
- The title must contain at least five space-separated words
- The title must contain between 16–150 characters (sorry CJK languages!)
Both the example candidates above resulted in the candidate title “Article Title”; which doesn’t pass the above criteria. It has fewer than five words and is less than 16 characters. In this situation, the original document.title
is used instead.
Many websites and content management systems use other segment separators, such as mid-dots (·) and bullets (•), which also aren’t compatible with Readability. WordPress for example, uses an en-dash (–) instead of a hyphen-minus (-) character as the segment separator as is as such not supported by default.
Safari Reader
Apple’s fork of Readability has little in common with Mozilla Readability and is more than twice the size.
Safari Reader finds the title by visually examining the document to find what it determines to be the most likely document title. Candidates are evaluated by how near they are to what Safari Reader determines is the visual top of the article as well as their Levenshtein distance from the document.title
.
Once the title has been selected, Safari can pick out the author byline from elements with class names such as author-name
, article-author
, among others, or other candidates like a[rel="author"]
. The publication date is usually the date
element nearest the title, or an element with class names like dateline
or entry-date
. The candidate that’s visually nearest the title is usually selected.
Safari sources the Schema.org microdata and Open Graph Protocol metadata from the document, but these are never used to populate the byline. They’re only used to confirm the selection after candidates have already been selected.
To ensure Safari Reader compatibility, you can add class names it wants to see. In the end, Safari Reader requires trial and error to ensure compatibility.
Maxthon Reader
Maxthon Reader is a fork of Mozilla Readability, but also clearly inspired by Safari Reader, and adjusted and optimized for Chinese, Japanese, and Korean (CJK) language websites. Maxthon Reader employs several site-specific fixes for popular Chinese-language news websites.
The title selection algorithm is mostly based on going through all the h1
–h6
elements in the document, and measuring their visual size, measuring their distance from the visual top of the document, and determining whether they’re partial matches of the document.title
. The title has to be at least four characters long. The document.title
is used in its entirety if it can’t find a matching headline in the document.
Microsoft Edge
Edge is the only web browser that has published any guidelines for web developers! 🎉
The title is selected from a meta[name="title"
element. Its value should match either an h1
–h3
element (preferably with a title
class name), OR yield a partial match in the document.title
. That is astonishingly straight forward compared to many of the other methods discussed in this article, and it’s even officially documented!
The author byline is selected from the first element in the article that contains a byline-name
class name. The publication date is likewise the first element that contains the dateline
class name, OR it can pick a meta[name="displaydate"]
element. Simple, right?
However, Edge seems quite buggy with these two items and will often ignore the byline and dateline entirely. Reloading and trying again usually helps.
A complete example for Microsoft Edge looks like this:
Edge also recognizes some other (undocumented) detection methods like meta[name="auhtor"]
, and class names like entry-date
, publish-date
, and .fn.auhtor
.
Chrome DOM Distiller
The Chrome DOM Distiller by Google is only reluctantly included in Chrome for Android and hidden under an option in the accessibility menu. It’s featured more prominently in Microsoft Edge for Android, however.
I frankly don’t have any idea of it works. Chrome DOM Distiller is more than 23 000 lines of Java and I haven’t been able to make heads or tails of how it works.
Mercury Reader
Mercury Reader, available as a browser extension to Google Chrome, uses the h1
–h4
nearest to the top of what it determines to be the article.
As discussed above, Mercury Reader extracts the article author from the meta[name="author"]
element. It extracts the publication date through the Open Graph Protocol which is the same method we’ll discuss next.
Instaparser (Open Graph Protocol)
I was surprised to learn that only Instaparser uses Open Graph Protocol metadata. This is the metadata tags that Facebook and other social networks use to extract metadata from websites and generate rich-link sharing experiences. They’re widely supported by all major news websites and blogs.
I’ve discussed limitations of RDFa parsers used by social media websites in the past. The property attribute is supposed to contain space-separated values, but the social networks as well as Instaparser can only process metadata if the property attribute is a single value and contains no spaces. Instapaper also has issues recognizing meta tags if they’ve got unexpected extra attributes.
Instaparser uses a non-standards compliant Resource Description Framework in Attribute (RDFa) parser to extract Open Graph Protocol (OGP) metadata including the title and byline author complete with publication date
Lastly, there’s Pocket and my recommendation isn’t to bother with it at all. Pocket doesn’t publish any recommendations for web developers. The only thing that’s known about their parser is that it relies heavily on site-specific parsing.
I’ve run the same test documents through Pocket from different URLs and ended up with different results. Pocket’s parser picks up odd links in the middle of the article or in surrounding site navigation and sets it as the author. Pocket also struggles with articles that are split up into sections with subheadings, often selecting a random section in the middle of the article instead of the full article.
I’ve contacted Pocket several times asking for details about their parser, and asking specifically for help resolving compatibility issues with my own websites to no avail. Pocket will either work to support your site or they won’t. There isn’t much you as a web developer can do to influence it.