There is no “metadata”

For a while I’ve been avoiding using the term metadata for a few reasons. I’ve had a few conversations with people about why and so I thought I’d jot the thoughts down here.

First of all, the main reason I stopped using the term is because it means too many different things. Wikipedia recognises metadata as an ambiguous term

The term metadata is an ambiguous term which is used for two fundamentally different concepts (types). Although the expression “data about data” is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at design time the application contains no data. In this case the correct description would be “data about the containers of data”. Descriptive metadata, on the other hand, is about individual instances of application data, the data content. In this case, a useful description (resulting in a disambiguating neologism) would be “data about data content” or “content about content” thus metacontent. Descriptive, Guide and the National Information Standards Organization concept of administrative metadata are all subtypes of metacontent.

and even within the world of descriptive metadata the term is used in many different ways.

I have always found a better, more accurate, complete and consistent term. Such as catalogueprovenanceauditlicensing and so on. I haven’t come across a situation yet where a more specific term hasn’t helped everyone understand the data better.

Data is just descriptions of things and if you say what aspects of a thing you are describing then everyone gets a better sense of what they might do with that. Once we realise that data is just descriptions of things, written in a consistent form to allow for analysis, we can see the next couple of reasons to stop using metadata.

Meta is a relative term. Ralph Swick of W3C is quoted as saying

What’s metadata to you, is someone else’s fundamental data.

That is to say, wether you consider something meta or not depends totally on your context and the problem you’re trying to solve. Often several people in the room will consider this differently.

If we combine that thought with the more specific naming of our data then we get the ability to think about descriptions of descriptions of descriptions. Which brings me on to something else I observe. By thinking in terms of data and metadata we talk, and think, in a vocabulary limited to two layers. Working with Big Data and Graphs I’ve learnt that’s not enough.

Taking the example of data about TV programming from todays RedBee post we could say:

  1. The Mentalist is a TV Programme
  2. The Mentalist is licensed to Channel 5 for broadcast in the UK
  3. The Mentalist will be shown at 21.00 on Thursday 12 April 2012

Statement 2 in that list is licensing data, statement 3 is schedule data. This all comes under the heading of descriptive metadata. Now, RedBee are a commercial organisation who put constraints on the use of their data. So we also need to be able to say things like

  • Statements 1, 2 and 3 are licensed to BBC for competitor analysis

This statement is also licensing data, about the metadata… So what is it? Descriptive metametadata?

Data about data is not a special case. Data is just descriptions of things and remains so wether the things being described are people, places, TV programmes or other data.

That’s why I try to replace the term metadata with something more useful whenever I can.

NESTA Birmingham

Friday afternoon was an interesting few hours, Simon Whitehouse of Digital Birmingham had organised an event for anyone interested in putting in a bid for the NESTA Make It Local competition.

I got to meet Hadley Beeman who has been putting together some really exciting ideas on crowdsourcing data conversion for the public data that’s been released recently — hoping to help get the data out of Excel and other tabular formats and into something more flexible.

The NESTA competition is focussed on bringing together local government and local digital media businesses; bids have to be led by a local authority, must use local firms for implementation and must use previously unreleased data. As Simon pointed out, that puts those who gave already released data at a disadvantage to those who haven’t, though helping those who haven’t started releasing data to get going can’t be a bad thing.

Talking with others there brought out some great ideas

Government Data, Openness and Making Money

Over on the UK Government Data Developers group there’s been a great discussion about openness, innovation and how Government makes money from its data; and of course if it should make money. I can’t link to the discussion as the group is closed – sign up, it’s a great group.

Philosophically there’s always the stance that Government data has already been paid for by the public through general taxation.

Tim Berners-Lee even says so in his guest piece for Times Online.

This is data that has already been collected and paid for by the taxpayer, and the internet allows it to be distributed much more cheaply than before. Governments can unlock its value by simply letting people use it.

While that’s true, the role of Government is to maximise the return we get on our taxes so if more money can be made from the assets we have then surely we should.

This is where discussion breaks of into various arguments as to where on the spectrum licensing of Government data should sit, and how open to re-use it should be.

The discussion covers notions of Copyleft licensing, attribution, commercial and non-commercial use as well as models of innovation.

What I always come back to is the notion that to make money you have to have something that is not “open”, a scarce resource. I have a blog post talking about that in the context of software and the web that’s been drafted but not finished for some time, so I’m coming at this from a point of existing thinking.

To make money something has to be closed.

In the case of creative works, the thing that is closed is the right to produce copies (closed through Copyright law). An author makes money by selling that right (or a limited subset of it) to a publisher who makes money from exploiting the right to copy. The publisher has exclusivity.

In the case of open source software companies the dominant model is support and consultancy. They make money by exploiting the specialist knowledge they have in their heads – a careful balance exists for companies doing this between making the product great and needing the support revenue. This balance leads to other monetization strategies, like using the closed nature of being the only place to go for that software to sell default slots in the software (think search boxes), or advertising.

In the case of closed-source commercial software it is the code, the product itself, that remains closed.

Commercial organisations with data assets have to keep the data closed in order to make money. The Government, however, does not. The Government can give the data away for free because it has something else that is closed – the UK economy. To be a part of the (legitimate) UK economy you have to pay taxes, giving the UK a 20% to 40% share of all profits.

If people find ways to make money using Government data those taxes dwarf any potential licensing fee – can you imagine a commercial data provider asking for up to 40% of a company’s profit as the cost of a data license?

This is why it makes sense for the Government to make data available with as few restrictions as possible – ultimately that means Public Domain.

That seems to be the direction the mailing list is heading thanks to some great contributors. If open data, government data and innovation interest you then sign up and join in.

Schneier on Security: A Taxonomy of Social Networking Data

A Taxonomy of Social Networking Data

At the Internet Governance Forum in Sharm El Sheikh this week, there was a conversation on social networking data. Someone made the point that there are several different types of data, and it would be useful to separate them. This is my taxonomy of social networking data.

from Schneier on Security: A Taxonomy of Social Networking Data.

Follow the link for a useful breakdown of data in any community site or service.

Ito World: Visualising Transport Data for Data.gov.uk

It can be hard to make meaningful information from huge amounts of data, a graph and a table doesn’t always communicate all it should do. We have been working hard on technology to visualise big datasets into compelling stories that humans can understand. We were really pleased with what we came up with in just one and a half days, see for yourself

from Ito World: Visualising Transport Data for Data.gov.uk.

Nice work on visualizing traffic data.

Putting Government Data online – Design Issues

Government data is being put online to increase accountability, contribute valuable information about the world, and to enable government, the country, and the world to function more efficiently. All of these purposes are served by putting the information on the Web as Linked Data. Start with the “low-hanging fruit”. Whatever else, the raw data should be made available as soon as possible. Preferably, it should be put up as Linked Data. As a third priority, it should be linked to other sources. As a lower priority, nice user interfaces should be made to it — if interested communities outside government have not already done it. The Linked Data technology, unlike any other technology, allows any data communication to be composed of many mixed vocabularies. Each vocabulary is from a community, be it international, national, state or local; or specific to an industry sector. This optimizes the usual trade-off between the expense and difficulty of getting wide agreement, and the practicality of working in a smaller community. Effort toward interoperability can be spend where most needed, making the evolution with time smoother and more productive.

from Tim Berners-Lee Putting Government Data online – Design Issues.

Sir Tim Berners-Lee to advise the Government on public information delivery – PublicTechnology.net

From: Sir Tim Berners-Lee to advise the Government on public information delivery – PublicTechnology.net

The Prime Minister has announced the appointment of the man credited with inventing the World Wide Web, Sir Tim Berners-Lee as expert adviser on public information delivery. The announcement was part of a statement on constitutional reform made in the House of Commons this afternoon.

Sir Tim Berners-Lee, who is currently director of the World Wide Web Consortium which overseas the web’s continued development. He will head a panel of experts who will advise the Minister for the Cabinet Office on how government can best use the internet to make non-personal public data as widely available as possible.

He will oversee the work to create a single online point of access for government held public data and develop proposals to extend access to data from the wider public sector, including selecting and implementing common standards. He will also help drive the use of the internet to improve government consultation processes.

TimBL talked about this at TED2009 and the video is below:

This is fantastic news, of course. Ambitious timescales, following the lead of the Obama administration, opening up government data for re-use as well as public oversight. All very good things.

The technical challenges in doing this will be very interesting. First off, the service will undoubtedly by Linked Data – the pattern of the Semantic Web or Web of Data. TimBL has been describing the efforts of the Linked Open Data community as “the web done right” for some time now. Linked data is also the approach taken by the US administration and is really starting to gather pace just like the early days of the document web. That will be interesting to see as it’s a different discipline to developing a basic html site with a different set of balances and trade-offs in the data modeling, granularity, URI design and so on.

Second up will be scaling to meet the traffic demand. As both a high profile linked data service and UK government data it will be highly in demand from day one. Coping with peak traffic loads is not technically difficult as long as someone has their eye on that ball from the start. It’s likely that demand for this data will be global, at least from those exploring what has been published, so traffic could get very high indeed. One of the aspects that might make this easier is that it will almost certainly be read-only for the foreseeable future, and that allows far more flexibility (and simplicity) in the approach to scaling.

Talking of it being read-only… Being a high profile data-source there will need to be a focus on securing it, not to prevent access, but to prevent unauthorised changes. Given the current atmosphere surrounding MPs expense claims and the level of voting in the recent European parliament elections it seems obvious that this will be a target for disgruntled and technically adept individuals both here and abroad. The read-only nature of the service helps make this easier, as does the linked data approach as that is the same in many security respects to the web of documents we have today – that is, securing it is well understood.

Definitely a project to watch closely.

[Disclosure – I work for Talis, a software company that offers a semantic web platform for doing this kind of publishing]

Panlibus » Blog Archive » Library of Congress launch Linked Data Subject Headings

Agree with this summary from Richard

On the surface, to those not yet bought in to the potential of Linked Data, and especially Linked Open Data, this may seem like an interesting but not necessarily massive leap forward. I believe that what underpins the fairly simple functional user interface they provide will gradually become core to bibliographic data becoming a first-class citizen in the web of data.

Overnight this uri ‘http://id.loc.gov/authorities/sh85042531’ has now become the globally available, machine and human readable, reliable source for the description for the subject heading of ‘Elephants’ containing links to its related terms (in a way that both machines and humans can navigate). This means that system developers and integrators can rely upon that link to represent a concept, not necessarily the way they want to [locally] describe it. This should facilitate the ability for disparate systems and services to simply share concepts and therefore understanding – one of the basic principles behind the Semantic Web.

from Panlibus » Blog Archive » Library of Congress launch Linked Data Subject Headings.

Great to see LoC doing this stuff and getting it out there.

ldow2008 Open Data Commons

Another paper from LDOW2008 that I worked on with Tom Heath and Paul Miller. The Open Data Commons licensing is about providing clear licensing for data shared on the web. It’s not like Creative Commons because it is for data that doesn’t qualify for Copyright protection, whereas Creative Commons relies on an underlying Copyright ownership.

Open Data Commons, A License for Open Data is predominantly a position paper explaining what’s been happening with Open Data Commons and its predecessor the Talis Community License.