Unpicking Web Metadata

Published: 19 May 2016

Note: this is the second of a series of posts on the genesis of and ideas behind our project on Editorial Algorithms. Having derived a number of features (length, tone, readability) which we thought could help automate some of the rather tedious work needed to curate content, we tackled the following question: assuming we want to curate “the best of the internet” on a daily basis, how much of these features were already available for us to use in web metadata, and how much would we have to compute ourselves?

We’ve been working on a project that explores ways in which we can automatically extract editorial metadata (such as topic, entities, language and tone) from web content. Our aim is to be able to surface content from across the web that not only matches an audience’s specific interests, but also surprises them. We’re hot on the tail of the holy grail: not only do we want accuracy in terms of ‘here’s what you asked for’, we also want serendipity.

To make serendipity a possibility, we knew we’d need more than just a few search algorithms. We needed to build a detailed metadata model around each piece of web content we collected.

The first step in getting towards this level of modelling, was to understand the type and quality of metadata openly ‘shared’ by publishers. So, we did a targeted audit. We chose 30 publishers we thought were fairly representative of the web, and unpicked their metadata habits. We wanted to learn what metadata was easily extractable, what was consistent, if anything, and what might be useful to us.

A study of 30 publishers doesn’t look particularly extensive, especially considering the size of the web, but a small and well-chosen sample set, along with a swift and unrecorded review of dozens of other articles, was enough for us learn what we needed.

Here’s what we did and what we discovered.

We built a tool called The Metadata Extractor

One of our developers built a tool called The Metadata Extractor. (I’m sure we could have come up with a far more exciting name, like The Metadator™, but we didn’t.)

The tool allows you to enter a web page URL, and hit ‘extract’. It then lists all the immediately extractable metadata associated with that page. You can watch an example of it in action using a BBC news article in the video at the top of this post.

We logged metadata from 30 publishers in a spreadsheet

Using the Metadator™, we logged all the different types of metadata per published page in a spreadsheet. The spreadsheet did a good job of quickly showing us which metadata elements were most consistent. 

What we learned 

Our goal was “to understand the type and quality of metadata openly ‘shared’ by publishers” - that is, the structured information published on the web about and alongside articles.

Essentially, we audited two sources of metadata information:

  • Metadata associated with articles in the publisher’s syndication feeds (using formats such as RSS and Atom, which for the sake of simplicity we’ll call ‘RSS feeds’)

  • Metadata found in HTML markup

In terms of HTML markup, we mostly focused on explicit metadata such as title, keywords, description etc., which are usually automatically added by a content management or publishing system to help improve discoverability of the article via search engines.

The web is a messy and inconsistent place

We expected metadata across feeds to be pretty consistent in structure - we assumed that publishers would follow a clear syndication standard so that aggregators, and other similar software, could easily share their articles. In practice, this was only partly true. 

Some fields in the syndication formats returned by the feeds, such as title, publication date, description and author - as well as keywords, to some extent - were quite consistent.

Other metadata fields were, however, very inconsistent:  

  • Article description fields ranged from just a few words to the contents of the full article, usually scraped from the first few paragraphs of the text. Very occasionally, a custom-written description was provided.

  • Publishing date suffered from misconfigured clocks and careless administrators. Dates ranged from 1970 to a timeline more akin to human occupation of Mars.

  • Fields such as author were often blank.

  • The keywords field, while often filled, generally used a completely different vocabulary and granularity from publisher to publisher making them useless, even in aggregate.

The above observations also apply to the ‘meta tags’ in web page HTML code, largely because they’re ‘filled in’ by the same content management systems.

There is however one set of structured information that’s reasonably well maintained and consistent across publishers. Social media.  

Social media is the number one incentive for metadata consistency

The Open Graph protocol (OPG) is the most consistent metadata element across publishers. Here’s why:

“The Open Graph protocol enables any web page to become a rich object in a social graph. For instance, this is used on Facebook to allow any web page to have the same functionality as any other object on Facebook.”

The same is true for Twitter.

The OPG tags <http://ogp.me/ns#title>, <http://ogp.me/ns#site_name> and <http://ogp.me/ns#description> are particularly well used. 

In short, publishers, content strategists, platform developers - anyone who cares about content metadata - will only decide to support and maintain metadata when there’s a really decent incentive, and social media is it. 

Build your own (aggregator) or go home

We wanted to see if we could make use of innate metadata in accurately extracting information about topic, entities, language and tone of online content. The answer was no. 

Even a conservative audit of web content metadata shows that whilst publishers are fairly consistent when it comes to their own metadata habits, on aggregate, metadata across publishers is very inconsistent.

We did make efforts to standardise and harmonise the data, but before long concluded that the reality of web publishing is messy, and, at least currently, the only way of building a metadata model - never mind a detailed metadata model - around web content is either to rely on one of the big aggregators (e.g. search engines like Google) or to build our own.

 

And so, we started exploring how much metadata we could reliably derive, retrieve and infer from web content. We knew we could derive a lot from raw web page text, but once we’d done that, what could we do with it?

 

Watch out for the next blog post in this series: How algorithms could facilitate editorial decisions in the future.



  • Internet Research and Future Services section

    The Internet Research and Future Services section is an interdisciplinary team of researchers, technologists, designers, and data scientists who carry out original research to solve problems for the BBC. Our work focuses on the intersection of audience needs and public service values, with digital media and machine learning. We develop research insights, prototypes and systems using experimental approaches and emerging technologies.

Search by Tag:

Rebuild Page

The page will automatically reload. You may need to reload again if the build takes longer than expected.

Useful links

Demo mode

Hides preview environment warning banner on preview pages.

Theme toggler

Select a theme and theme mode and click "Load theme" to load in your theme combination.

Theme:
Theme Mode: