Why are Facebook and Twitter so bad at parsing RDFa metadata?

Facebook and Twitter both chose to use RDFa for their “optimized link sharing” metadata formats. Well, it would seem Twitter didn’t realize that was what they had done until a later stage. In any case, why are both of these social media platforms with vast engineering resources at their disposal so bad at parsing RDFa data from webpages?

The gist of this article is to highlight how Facebook chose RDFa — and Twitter walked backwards into it, but don’t even implement a minimal RDFa parser to retreive metadata from websites. This means for example that a meta tag that stores a content attribute applicable in multiple vocabularies (Twitter Cards and OpenGraph) must needlessly be duplicated, and that even the meta elements themselves are redundant. But I’m getting ahead of myself, let me get back on track.

RDFa “ultra-light”

Twitter Cards’ is Twitter’s name for their tiny metadata format for enabling rich-snippets in shared links. These snippets can include details such as a thumbnail and the title of the link being shared among the platform’s users. Twitter Cards rely on markup that appears to be RDFa at first glance, but is actually the apparent results of cargo-cult copying something resembling a metadata format without a real understanding of their purpose nor context on the web.

Facebook’s ‘OpenGraph Protocol’ (OGP) is more technically capable, if only in theory, as it’s based on an “ultra-light” variant of RDFa. From the available documentation, the main differences between RDFa and whatever OGP is using is that the latter can only be set on meta elements inside the head element. This is not a restriction from RDFa nor really necessary for OGP’s purposes, but it’s an arbitrary limitation set by Facebook’s engineers.

Speaking of RDF attributes (RDFa), you should be at least somewhat familiar with RDFa Core syntax to follow along with this article. Here is a quick refresher: RDF attributes allow embedding semantically meaningful content within HTML tags. The syntax relevant to this article is limited to embedded key-value pairs of metadata using the property attribute (key) and content attribute (value):

@content: a string, for supplying machine-readable content […].”

@property: a white space separated list of [terms.]”

Note that there is no mention of the name attribute as this holds no significance in RDFa whatsoever. When used in examples in this article, it’s for the benefit of non-RDFa metadata consumers and legacy systems (and to make a point). I’ll get back to this in the next section.

For brevity, please just imagine that all the examples in this article have already declared the following namespaces. Neither Twitter nor Facebook care about namespaces and only support their own hardcoded namespace prefixes. That is a discussion for another time, however.

<html prefix="dc: http://purl.org/dc/terms/#
              og: http://ogp.me/ns#
              twitter: https://dev.twitter.com/cards/markup#"> …

Twitter has not provided a namespace URL for the Twitter Cards vocabulary. However, they do have one as shown in the above example which is used by convention.

I’ve done some experiments to see how Twitter and Faceboook’s bots process metadata in Twitter Cards and OpenGraph Protocol format. The next two section deals with extracting the page title using these formats.

Lets observe Twitterbot parsing

Even though Twitter incorrectly recommends specifying the key-values in name attributes, I’ll totally disregard that and use property attributes in all examples. The property attribute is the correct choice as per the RDFa standard and it’s also supported by Twitterbot. The decision to use name is likely based on a poor design decision that I suspect is rooted in having engineers not being familiar with RDFa when they designed their RDFa-imitation format.

Let’s just look at some non-working but perfectly valid examples:

<meta content="Great Title" name="title" property="twitter:title">
<meta content="Great Title" property="dc:title twitter:title">
<meta content="Great Title" data-sub="marine" property="twitter:title">

Then please compare the above examples to these working examples:

<meta content="Great Title" property="twitter:title" name="title">
<meta property=" twitter:title " name="title" content="Great Title">

All five examples are perfectly valid and should have worked. Yet, Twitter can only read the metadata from the last two examples. Worryingly, significance seems to be placed in the order of the element attributes. The contents of the property attribute is also treated as an exact-match key rather than a list of tokens.

There is only two conclusions that can be drawn from the differences in the working and non-working examples:

  1. Twitter engineers didn’t know anything about HTML and RDFa when they designed Twitter Cards.
  2. Twitterbot is committing the greatest sin imaginable when parsing HTML! They’re regex-soup-matching tags and attributes rather than actually parsing the markup! They’re doing it quite badly too.
Video of dramatically turning shocked owl

Shocked owl find it shocking that Twitter doesn’t actually parse HTML

Shock and horror aside, what does this mean for the web? More redundant markup. In a perfect world every metadata consumer would have agreed on a common core of metadata descriptors. In lieu of one metadata standard, every metadata consumer should at least be expected to parse RDFa data in a sensible manner. The following example should have covered all bases:

Lets observe Facebot parsing

But wait — what about Facebook? Their OpenGraph Protocol implementation seems to be actually designed around an understanding of RDFa syntax and not just cargo-culting (imitating) the syntax like we see in Twitter Cards. Facebook’s facebookexternalhit bot, lovingly known as Facebot, seems to be somewhat more competent than Twitterbot and it’s error messages indicate that it’s actually parsing the markup. However, the implementation still leaves a lot to be desired.

Let’s look at some non-working but perfectly valid examples:

<meta content="Great Title" property="dc.title og:title">
<meta content="Great Title" property="  og:title  ">
<meta content="Great Title" name="title" property="og:title">

Then please compare the above examples to these working examples:

<meta content="Great Title" property="og.title">
<meta content="Great Title" property="og.title" name="title">

Facebot, exactly like Twitterbot, seems to not have got the memo about the property attribute being a space-separated list of terms. Facebot will only do an exact match of property="og:title", without even handling any whitespace stripping. Twitterbot at least handles whitespace stripping even though it doesn’t understand that it can be a space separated list.

Like Twitterbot, Facebot treats the name property as significant even though it holds no significance in RDFa parsing. The two bots only has a problem with any unexpected attributes if they appear before a property in the element. Again, I’d like to remind readers that the order of attributes holds no significance in either RDFa nor HTML.

Wouldn’t you know … facebookexternalhit bot is also regex-soup-matching tags and attributes rather than parsing the data like any metadata consumer should!

Video of dramatically turning shocked owl

Shocked owl is shocked over the state of metadata extraction from webpages.

Enough with the stupid owls!

How should this have been solved?

The answer to this question is almost always to use XPath on a parsed XML representation like DOM or XDM. Parsing raw HTML documents with XPath is unlikely to succeed as authors write in HTML and XPath only operate on well-formatted XML documents. Luckily, there are standardized processes for processing HTML into a parsed XML representation like the DOM or XDM. This document as displayed in your web browser right now is an example of that process. There are plenty of libraries for developers to choose from that can create such representations from HTML documents.

The following XPath will read out OpenGraph Protocol title metadata from a document and respecting the current limitations in the OGP standard:

  ' ', normalize-space(@property), ' '),
  ' og:title ')]/@content

Or if you actually want full RDFa support, then this will let you parse the value from either the content attribute or fallback to the element text node if no content attribute is set. This can read the data from any element (the title element seems appropriate, or maybe a h1 element somewhere further down on the page) and not just the meta element.

    ' ', normalize-space(@property), ' '),
    ' og:title ')]/@content
    ' ', normalize-space(@property), ' '),
    ' og:title ')]/@text()
)[position() = 1]

The above XPath would return the string “Great Title” in accordance with the RDFa standard from all of following examples:

<title content="Great Title" property="og:title">Okay Title</title>
<meta name="title" property="og:title" content="Great Title">
<meta content="Great Title" name="title" property="dc:title og:title">
<h1 property="og:title">Great Title</h1>

In conclusion

—so is XPath the right tool for Twitterbot and Facebot? Well, yeah. Probably. Why don’t they use it? They don’t care in the slightest. Publishers will adjust their markup to work with the weak implementations that are in actual use.

To answer the leading question of the headline: Facebook and Twitter are really bad at extracting metadata from websites in their own standard formats because they’re not parsing the data but rather use metadata extraction techniques based on wishful thinking. If their solution had been suggested as an answer to this problem on the StackOverflow coding forum, it would have been down-voted to oblivion.

Now that you’ve read this article and are better informed, there is no excuse for you not to implement proper document parsing in your own projects. Don’t repeat the mistakes of Facebook and Twitter.

1 comment

  1. I didn’t even recognize these “social tags” as RDFa. It’s such a shame Facebook and friends didn’t choose to use an established metadata standard like the Dublin Core.

Leave a Reply

Your email address will not be published. Be courteous and on-topic. Comments are moderated prior to publication.