Semantic Web

Distributed, Linked Data has significant implications for Intellectual Property Rights in Data.

What P2P networks have done for distribution of digital media is phenomenal. It is possible, easy even, to get almost any TV show, movie, track or album you can think of by searching one of the many torrent sites. As fast as the media industry take down one site through legal action another has appeared to take its place.

I don’t want to discuss the legal, moral or social implications of this, but discuss how the internet changes the nature of our relationship with media – and data. The internet is a great big copying machine, true enough, but it’s also a fabric that allows mass co-operation. It’s that mass peer-to-peer co-operation that makes so much content available for free; content that is published freely by its creator as well as infringing content.

Sharing of copyrighted content is always likely to be infringing on p2p networks, regardless of any tricks employed, but for data the situation may be different and the Linked Data web has real implications in this space.

Taking the Royal Mail’s Postcode Address File as my working example, because it’s been in the news recently as a result of the work done by ErnestMarples.com, I’ll attempt to show how the Linked Data web changes the nature of data publishing and intellectual property.

First, in case you’re not familiar, a quick introduction to Linked Data. In Linked Data we use http web addresses (which we call URIs) not only to refer to documents containing data but also to refer to real-world things and abstract concepts. We then combine those URIs with properties and values to make statements about the things the URIs represent. So, I might say that my postcode is referred to by the URI http://someorg.example.com/myaddress/postcode. Asking for that URI in the browser would then redirect you to a document containing data about my postcode, maybe sending you to http://someorg.example.com/myaddress/postcode.rdf if you asked for data and http://someorg.example.com/myaddress/postcode.html if you asked for a web page (that’s called content negotiation). All of that works today and organisations like the UK Government, BBC, New York Times and others are publishing data this way.

Also worth noting is the distinction between Linked Data (the technique described above) and Linked Open Data, the output of the W3C’s Linking Open Data project. An important distinction as I’m talking about how commercially owned and protected databases may be disrupted by Linked Data, whereas Linked Open Data is data that is already published on the web under an Open license.

Now, Royal Mail own the Postcode Address File, and other postcode data such as geo co-ordinates. They are covered in the UK under Copyright and Database Right (which for which bits is a different story) so we assume it is “owned”. The database contains more than 28 million postcodes, so publishing my own postcode could not be considered an infringement in any meaningful way, publishing the data for all the addresses within a single postcode would also be unlikely to infringe as it’s such a small fraction of the total data.

So I might publish some data like this (the format is Turtle, a way to write down Linked Data)

<http://someorg.example.com/myaddress/postcode>
  a paf:Postcode;
  paf:correctForm "B37 7YB";
  paf:normalisedForm "b377yb";
  geo:long -1.717336;
  geo:lat 52.467971;
  paf:ordnanceSurveyCode "SP1930085600";
  paf:gridRefEast 41930;
  paf:gridRefNorth 28560;
  paf:containsAddress <http://someorg.example.com/myaddress/postcode#1>;
  paf:googleMaps <http://maps.google.co.uk/maps?hl=en&source=hp&q=B377YB&ie=UTF8&hq=&hnear=Birmingham,+West+Midlands+B377YB,+United+Kingdom&gl=uk&ei=Zs8HS_KVNNOe4QbIpITTCw&ved=0CAgQ8gEwAA&t=h&z=16>.

<http://someorg.example.com/myaddress/postcode#1>
  a paf:Address;
  paf:organisationName "Talis Group Ltd";
  paf:dependentThoroughfareName "Knight's Court";
  paf:thoroughfareName "Solihull Parkway";
  paf:dependentLocality "Birmingham Business Park";
  paf:postTown "Birmingham";
  paf:postcode <http://someorg.example.com/myaddress/postcode>.

I’ve probably made some mistakes in terms of the PAF properties as it’s a long time since I worked with PAF, but it’s clear enough to make my point with. So, I publish this file on my own website as a way of describing the office where I work. That’s not an infringement of any rights in the data and perfectly legitimate thing to do with the address.

As the web of Linked Data takes off, and the same schema become commonly used for this kind of thing, we start to build a substantial copy of the original database. This time, however, the database is not on a single server as ErnestMarples.com was, but spread across the web of Linked Data. There is no single infringing organisation who can be made to take the data down again. If I were responsible for the revenue brought in from sales of PAF licenses this would be a concern, but not major as the distributed nature means it can’t be queried.

The distributed nature of the web means the web itself can’t be queried, but we already know how to address that technically – we built large aggregations of the web, index them and call them search engines. That is also already happening for the Linked Data web. As with the web of documents, some people are attempting to create general purpose search engines over the data and others specialised search engines for specific areas of interest. It’s easy to see that areas of value, such as address data, are likely to attract specialist attention.

Here though, while the individual documents do not infringe, an aggregate of many of them would start to infringe. The defence of crowd-sourcing used in other contexts (such as Open Street Map) does not apply here as the PAF is not factual data – the connection between a postcode and address can only have come from one place, PAF, and is owned by Royal Mail however it got into the database.

So, with the aggregate now infringing it can be taken down through request, negotiation or due process. The obvious answer to that might be for the aggregate to hold the URIs only, not the actual values of the data. This would leave it without a useful search mechanism, however. This could be addressed by having a well-known URI structure as I used in the example data. We might have

<http://addresssearch.example.net/postcodes/B37_7YB>
  owl:sameAs <http://someorg.example.com/myaddress/postcode>

This gets around the data issue, but the full list of postcodes itself may be considered infringing and they are clearly visible in the URIs. Taking them out of the URIs would leave no mechanism to go from a known postcode to data about it and addresses within it, the main use case for PAF. It doesn’t take long to make the link with other existing technology though, an area where we want to match a short string we know with an original short string, but cannot make the original short string available in clear text… Passwords.

Password storage uses one-way hashes so that the password is not available in its original form once stored, but requests to login can be matched by putting the attempted password through the same hash. Hashes are commonplace in the P2P world for a variety of functions, so are well-known and could be applied by an aggregator, or co-operative group, to solve this problem.

If I push the correct form of “B37 7YB” through MD5, I get “bdd2a7bf68119d001ebfd7f86c13a4c7″, but there is no way to get back from that to the original postcode. So a service that uses hashed URIs would not be publishing the postcode list in a directly useable form, but could be searched easily by anyone knowing how the URIs were structured and hashed.

<http://addresssearch.example.net/postcodes/bdd2a7bf68119d001ebfd7f86c13a4c7>
  owl:sameAs <http://someorg.example.com/myaddress/postcode>

Of course, a specialist address service, advertising address lookups and making money could still be considered as infringing by the courts regardless of the technical mechanisms employed, but what of more general aggregations or informal co-operative sites? sameAs, a site for sharing sameAs statements, already provides the infrastructure that would be needed for a service like this and the ease with which sites that do this can be setup and mirrored would make it hard to defend against using the law in the same way that torrent listing sites are difficult for the film and music industries to stop. Regardless of the technical approach and the degree to which that provide legal and practical defence, this is still publishing data in a way that is against the spirit of Copyright and Database Right.

The situation I describe above is one where many, many organisations and individuals are publishing data in a consistent form and that is likely to happen over the next few years for common data like business addresses and phone numbers, but much less likely for less mainstream data. The situation with addresses is one where it is clear there is a reason to publish your individual data other than to be part of the whole, in more contrived cases where the only reason to publish is to contribute to a larger aggregate the notion of fair-use for a small amount of the data may not stand up. That is, over the longer term, address data will not be crowd-sourced – people deliberately creating a dataset – but web-sourced – the data will be on the web anyway.

We can see from this small example that the kinds of data that may be vulnerable to distributed publishing in this way are wide-ranging. The Dewey Decimal Classification scheme used by libraries, Telephone directories (with lookups in both directions), Gracenote’s music database, Legal case numbering schemes, could all be published this way. The problem, of course, is that the data has to be distributed sufficiently that no individual host can be seen as infringing. For common data this will happen naturally, but the co-ordination overhead for a group trying to do this pro-actively would be significant; though that might be solved easily by someone else thinking about how to do this.

As I see it a small group of unfunded individuals would have difficulty achieving the level of distribution necessary to be defensible. Though could 1% of a database be considered infringing? Could/Would 100 people use their existing domains to publish 1% of the PAF each? Would 200 join in for ½% each? Then, of course, there are the usual problems of trust, timeliness and accuracy associated with non-authoritative publication.

These problems not withstanding, Linked Data has the potential to provide a global database at web-scale. Ways of querying that web of data will be invented, what I sketch out above is just one very basic approach. The change the web of data brings has huge implications for data owners and intellectual property rights in data.

Government Data, Openness and Making Money

Friday, November 20th, 2009 | Internet Social Impact, Open Data, Semantic Web | 2 Comments

Over on the UK Government Data Developers group there’s been a great discussion about openness, innovation and how Government makes money from its data; and of course if it should make money. I can’t link to the discussion as the group is closed – sign up, it’s a great group.

Philosophically there’s always the stance that Government data has already been paid for by the public through general taxation.

Tim Berners-Lee even says so in his guest piece for Times Online.

This is data that has already been collected and paid for by the taxpayer, and the internet allows it to be distributed much more cheaply than before. Governments can unlock its value by simply letting people use it.

While that’s true, the role of Government is to maximise the return we get on our taxes so if more money can be made from the assets we have then surely we should.

This is where discussion breaks of into various arguments as to where on the spectrum licensing of Government data should sit, and how open to re-use it should be.

The discussion covers notions of Copyleft licensing, attribution, commercial and non-commercial use as well as models of innovation.

What I always come back to is the notion that to make money you have to have something that is not “open”, a scarce resource. I have a blog post talking about that in the context of software and the web that’s been drafted but not finished for some time, so I’m coming at this from a point of existing thinking.

To make money something has to be closed.

In the case of creative works, the thing that is closed is the right to produce copies (closed through Copyright law). An author makes money by selling that right (or a limited subset of it) to a publisher who makes money from exploiting the right to copy. The publisher has exclusivity.

In the case of open source software companies the dominant model is support and consultancy. They make money by exploiting the specialist knowledge they have in their heads – a careful balance exists for companies doing this between making the product great and needing the support revenue. This balance leads to other monetization strategies, like using the closed nature of being the only place to go for that software to sell default slots in the software (think search boxes), or advertising.

In the case of closed-source commercial software it is the code, the product itself, that remains closed.

Commercial organisations with data assets have to keep the data closed in order to make money. The Government, however, does not. The Government can give the data away for free because it has something else that is closed – the UK economy. To be a part of the (legitimate) UK economy you have to pay taxes, giving the UK a 20% to 40% share of all profits.

If people find ways to make money using Government data those taxes dwarf any potential licensing fee – can you imagine a commercial data provider asking for up to 40% of a company’s profit as the cost of a data license?

This is why it makes sense for the Government to make data available with as few restrictions as possible – ultimately that means Public Domain.

That seems to be the direction the mailing list is heading thanks to some great contributors. If open data, government data and innovation interest you then sign up and join in.

ShelterIt – My digital think-tank: On identity

Wednesday, October 28th, 2009 | Library Tech, Semantic Web | 1 Comment

Did you notice what just happened? I used used an URI as an identifier for a subject. If you popped that URI into your browser, it will take you to WikiPedia’s article on the book and provide a lot of info there in human prose about this book, and this would make it rather easy for Bob to say that, yes indeed, that’s the same book I’ve got. So now we’ve got me and Bob agreeing that we have the same book.

from ShelterIt – My digital think-tank: On identity.

Great piece by Alexander Johannesen about the future of library data, semantic web and the difficulties of getting from here to there.

Ito World: Visualising Transport Data for Data.gov.uk

Tuesday, October 27th, 2009 | Open Data, Semantic Web | 1 Comment

It can be hard to make meaningful information from huge amounts of data, a graph and a table doesn’t always communicate all it should do. We have been working hard on technology to visualise big datasets into compelling stories that humans can understand. We were really pleased with what we came up with in just one and a half days, see for yourself

from Ito World: Visualising Transport Data for Data.gov.uk.

Nice work on visualizing traffic data.

Intensional and Extensional Sets

Wednesday, August 19th, 2009 | Ontologies, Semantic Web | No Comments

One of my collegaues called the other day and asked if we still relied on the distinction between intensional and extensional sets (really intensionally and extensionally defined sets). Yes, even more so now.

from Intensional and Extensional Sets.

If you don’t know the difference (I didn’t) then it’s worth reading.

If you use numbers…

Thursday, August 13th, 2009 | Random Thought, Semantic Web | 2 Comments

If you use numbers...

Still not sure what’s for the best, but the idea that opaque URIs are better because they’re language independent doesn’t ring true for me. A word is just as opaque as a GUID if you don’t speak the language, but for those who can read it may be far clearer and easier to work with.

Conversation with Bruce D’Arcus on Motivation for MODS Ontology « Musings

Tuesday, August 11th, 2009 | Library Tech, Ontologies, Semantic Web | No Comments

The problem from my standpoint is that MODS has some really odd, library-specific, design choices that I don’t think map very well to the wider world. A central concept like mods:name, with mods:role as a child of that, really makes no sense, and conflicts with more common modeling you see in DC, FRBR ,etc.

It’s semantics are also really loose.

So you have to ask yourself, just how linked could a MODS view in RDF really be?

from Conversation with Bruce D’Arcus on Motivation for MODS Ontology.

Excel RDF

Wednesday, July 22nd, 2009 | Semantic Web | 4 Comments

Introduction

When world’s collide sometimes things happen that can be useful. This is not one of those useful things, but a collision of two worlds none-the-less…

ExcelRDF is a proposed serialisation for RDF using the Microsoft Excel Spreadsheet format. This work was inspired by the discussions in the semantic web community about Linked Data and whether or not it mandates the use of RDF. This document is not trying to prove a point, insult anyone or come down on either side of the argument. I just noticed that it hadn’t been done and it didn’t seem too difficult. Of course, that it hadn’t been done should have been enough of a warning to me that it is not, in any sense, desirable.

Conventions used in this document

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Overview

When a server receives a HTTP or HTTPS request for a resource that is described in RDF and the client indicates that it is willing to accept content of type application/vnd.ms-excel the server MAY respond with a Microsoft Excel spreadsheet meeting the following conventions.

  • The spreadsheet MUST contain one or more sheets that meet the following conventions.
  • A sheet SHOULD contain zero or more rows of RDF data.
  • Column A of each non-empty row of a sheet SHOULD contain a URI indicating the Subject of a statement.
  • Column B of each non-empty row of a sheet SHOULD contain a URI indicating the Property of a statement.
  • Column C MAY contain either a URI or a literal value as the Object of a statement.
  • If Column C contains a literal value then Column D MAY contain a language identifier in accordance with IETF BCP 47.
  • If Column C contains a literal value then Column E MAY contain a type specifier indicating the type of the literal value.

Example

The attached example Microsoft Excel spreadsheet contains RDF from the dbpedia project describing Annette Island Airport using the conventions described above.

ExcelRDF Example File

Use Cases

ExcelRDF may be useful where it is desirable to produce charts showing characteristics of a dataset, such as the relative distribution of types within a dataset. Perhaps analysing the count of particular properties. I can think of no obvious way to assess graph characteristics such as linkiness, but you could do things like word counts in literals, or working out how of the literal data is in French.

ExcelRDF may be useful where a specific contract, policy or agreement means that the data must be delivered as an Excel spreadsheet while the underlying data is more useful in RDF.

ExcelRDF may be useful if you wish to be deliberately obtuse.

What else? « Web of Data

Tuesday, July 21st, 2009 | Semantic Web | No Comments

great explanation from Dan Brickley:

The non-RDF bits of the data Web are – roughly – going to be the leaves on the tree. The bit that links it all together will be, as you say, the typed links, loose structuring and so on that come with RDF. This is also roughly analagous to the HTML Web: you find JPEGs, WAVs, flash files and so on linked in from the HTML Web, but the thing that hangs it all together isn’t flash or audio files, it’s the linky extensible format: HTML. For data, we’ll see more RDF than HTML (or RDFa bridging the two). But we needn’t panic if people put non-RDF data up online…. it’s still better than nothing. And as the LOD scene has shown, it can often easily be processed and republished by others. People worry too much! :)

from What else? « Web of Data.

Paul Miller is right… and so is Ian Davis

Monday, July 20th, 2009 | Semantic Web | 8 Comments

Paul Miller, a good friend and ex-colleague, has been having a tough time arguing that perhaps Linked Data doesn’t need RDF. Don’t misunderstand that, he thinks RDF is a Good Thing and Best Practice for Linked Data. But he thinks a dogmatic stance is unhelpful.

The problem, I contend, comes when well-meaning and knowledgeable advocates of both Linked Data and RDF conflate the two and infer, imply or assert that ‘Linked Data’ can only be Linked Data if expressed in RDF.

This dogmatism makes me deeply uncomfortable, and I find myself unable to agree with the underlying premise.

In the twitter stream that Paul links to there is some comment reminding people that RDF can take many forms, not just RDF/XML.

kidehen: @andypowe11 re. #rdf, it’s the data model for #linkeddata based #metadata. Remember #rdf != RDF/XML, no escaping RDF model re. #linkeddata.

Ian Davis (my boss) took a strong stance saying that if things weren’t RDF then they weren’t linked data. Perhaps the very thing Paul sees as a dogmatic stance. Ironic as Ian is far from dogmatic. But Ian is defending the term Linked Data, not saying that’s the only way to publish data on the web…

TallTed: @iand “I think LD better for many cases, but there are times i’d rather hv a spreadsheet.” What? Can a spreadsheet not hold #LinkedData?

Well, it seems to me both Paul and Ian are right to a strong degree and are essentially arguing over only one thing – the meaning of the term Linked Data.

Paul quote Tim Berners-Lee’s design note on Linked Data:

1. Use URIs as names for things

2. Use HTTP URIs so that people can look up those names

3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)

4. Include links to other URIs. so that they can discover more things.

The emphasis is Paul’s. I would emphasise a different point:

4. Include links to other URIs. so that they can discover more things.

And in point four lies the reason that Ian is saying a spreadsheet isn’t Linked Data, even if it’s on the web and even if it’s linked to. The only standard for describing how one resource relates to others using URIs is RDF. Sure, you can put URIs into a spreadsheet, but there is no standard interpretation of what the sheets, rows and columns mean. Sure, you can put URIs into a CSV file, but again, there is no standard interpretation of what the fields mean.

The end result of that is data published on the web that can be linked to but not from.

At this early time, though, Paul argues that what we really want is to get more and more data published and open. We all agree on that, I know. Ian does for sure, he runs Data Incubator for exactly that reason – well, that and helping show those publishing spreadsheets and CSV why they should move to RDF and Linked Data.

In the comments on Paul’s post Justin (another senior manager at Talis) says:

Yes the same mistake was made with the rise of the web.

Once you had URIs and HTTP you already had plain text which is a perfectly good way to encode content. By adopting the STANDARD convention of HTML, all sort of existing text based formats with their various mark ups were locked out. That locked out a lot of content that already existed and required anyone who wanted to play to convert existing content into a html format.

Of course it did have the small side effect that to consume web content you only needed a browser that understood one convention i.e. html.

The same is true of RDF. XML is the equivalent of ascii in this regard.

And that’s the point. XML is the equivalent of ASCII, as is a spreadsheet or a CSV file, not because they’re simple, but because they have no mechanism for embedding the relationships and links necessary to link out from your data. Yes, they can contain URIs and clients can decide to make those into links, but there is no way to describe the meaning.

I agree with both side of this argument – If it isn’t RDF then it isn’t Linked Data, but I wouldn’t keep pushing that point if someone was willing to publish data yet unable or unwilling to publish RDF (in any of its many forms).

Search

What I'm Doing...