What P2P networks have done for distribution of digital media is phenomenal. It is possible, easy even, to get almost any TV show, movie, track or album you can think of by searching one of the many torrent sites. As fast as the media industry take down one site through legal action another has appeared to take its place.
I don’t want to discuss the legal, moral or social implications of this, but discuss how the internet changes the nature of our relationship with media – and data. The internet is a great big copying machine, true enough, but it’s also a fabric that allows mass co-operation. It’s that mass peer-to-peer co-operation that makes so much content available for free; content that is published freely by its creator as well as infringing content.
Sharing of copyrighted content is always likely to be infringing on p2p networks, regardless of any tricks employed, but for data the situation may be different and the Linked Data web has real implications in this space.
Taking the Royal Mail’s Postcode Address File as my working example, because it’s been in the news recently as a result of the work done by ErnestMarples.com, I’ll attempt to show how the Linked Data web changes the nature of data publishing and intellectual property.
First, in case you’re not familiar, a quick introduction to Linked Data. In Linked Data we use http web addresses (which we call URIs) not only to refer to documents containing data but also to refer to real-world things and abstract concepts. We then combine those URIs with properties and values to make statements about the things the URIs represent. So, I might say that my postcode is referred to by the URI http://someorg.example.com/myaddress/postcode. Asking for that URI in the browser would then redirect you to a document containing data about my postcode, maybe sending you to http://someorg.example.com/myaddress/postcode.rdf if you asked for data and http://someorg.example.com/myaddress/postcode.html if you asked for a web page (that’s called content negotiation). All of that works today and organisations like the UK Government, BBC, New York Times and others are publishing data this way.
Also worth noting is the distinction between Linked Data (the technique described above) and Linked Open Data, the output of the W3C’s Linking Open Data project. An important distinction as I’m talking about how commercially owned and protected databases may be disrupted by Linked Data, whereas Linked Open Data is data that is already published on the web under an Open license.
Now, Royal Mail own the Postcode Address File, and other postcode data such as geo co-ordinates. They are covered in the UK under Copyright and Database Right (which for which bits is a different story) so we assume it is “owned”. The database contains more than 28 million postcodes, so publishing my own postcode could not be considered an infringement in any meaningful way, publishing the data for all the addresses within a single postcode would also be unlikely to infringe as it’s such a small fraction of the total data.
So I might publish some data like this (the format is Turtle, a way to write down Linked Data)
paf:correctForm "B37 7YB";
paf:organisationName "Talis Group Ltd";
paf:dependentThoroughfareName "Knight's Court";
paf:thoroughfareName "Solihull Parkway";
paf:dependentLocality "Birmingham Business Park";
I’ve probably made some mistakes in terms of the PAF properties as it’s a long time since I worked with PAF, but it’s clear enough to make my point with. So, I publish this file on my own website as a way of describing the office where I work. That’s not an infringement of any rights in the data and perfectly legitimate thing to do with the address.
As the web of Linked Data takes off, and the same schema become commonly used for this kind of thing, we start to build a substantial copy of the original database. This time, however, the database is not on a single server as ErnestMarples.com was, but spread across the web of Linked Data. There is no single infringing organisation who can be made to take the data down again. If I were responsible for the revenue brought in from sales of PAF licenses this would be a concern, but not major as the distributed nature means it can’t be queried.
The distributed nature of the web means the web itself can’t be queried, but we already know how to address that technically – we built large aggregations of the web, index them and call them search engines. That is also already happening for the Linked Data web. As with the web of documents, some people are attempting to create general purpose search engines over the data and others specialised search engines for specific areas of interest. It’s easy to see that areas of value, such as address data, are likely to attract specialist attention.
Here though, while the individual documents do not infringe, an aggregate of many of them would start to infringe. The defence of crowd-sourcing used in other contexts (such as Open Street Map) does not apply here as the PAF is not factual data – the connection between a postcode and address can only have come from one place, PAF, and is owned by Royal Mail however it got into the database.
So, with the aggregate now infringing it can be taken down through request, negotiation or due process. The obvious answer to that might be for the aggregate to hold the URIs only, not the actual values of the data. This would leave it without a useful search mechanism, however. This could be addressed by having a well-known URI structure as I used in the example data. We might have
This gets around the data issue, but the full list of postcodes itself may be considered infringing and they are clearly visible in the URIs. Taking them out of the URIs would leave no mechanism to go from a known postcode to data about it and addresses within it, the main use case for PAF. It doesn’t take long to make the link with other existing technology though, an area where we want to match a short string we know with an original short string, but cannot make the original short string available in clear text… Passwords.
Password storage uses one-way hashes so that the password is not available in its original form once stored, but requests to login can be matched by putting the attempted password through the same hash. Hashes are commonplace in the P2P world for a variety of functions, so are well-known and could be applied by an aggregator, or co-operative group, to solve this problem.
If I push the correct form of “B37 7YB” through MD5, I get “bdd2a7bf68119d001ebfd7f86c13a4c7”, but there is no way to get back from that to the original postcode. So a service that uses hashed URIs would not be publishing the postcode list in a directly useable form, but could be searched easily by anyone knowing how the URIs were structured and hashed.
Of course, a specialist address service, advertising address lookups and making money could still be considered as infringing by the courts regardless of the technical mechanisms employed, but what of more general aggregations or informal co-operative sites? sameAs, a site for sharing sameAs statements, already provides the infrastructure that would be needed for a service like this and the ease with which sites that do this can be setup and mirrored would make it hard to defend against using the law in the same way that torrent listing sites are difficult for the film and music industries to stop. Regardless of the technical approach and the degree to which that provide legal and practical defence, this is still publishing data in a way that is against the spirit of Copyright and Database Right.
The situation I describe above is one where many, many organisations and individuals are publishing data in a consistent form and that is likely to happen over the next few years for common data like business addresses and phone numbers, but much less likely for less mainstream data. The situation with addresses is one where it is clear there is a reason to publish your individual data other than to be part of the whole, in more contrived cases where the only reason to publish is to contribute to a larger aggregate the notion of fair-use for a small amount of the data may not stand up. That is, over the longer term, address data will not be crowd-sourced – people deliberately creating a dataset – but web-sourced – the data will be on the web anyway.
We can see from this small example that the kinds of data that may be vulnerable to distributed publishing in this way are wide-ranging. The Dewey Decimal Classification scheme used by libraries, Telephone directories (with lookups in both directions), Gracenote’s music database, Legal case numbering schemes, could all be published this way. The problem, of course, is that the data has to be distributed sufficiently that no individual host can be seen as infringing. For common data this will happen naturally, but the co-ordination overhead for a group trying to do this pro-actively would be significant; though that might be solved easily by someone else thinking about how to do this.
As I see it a small group of unfunded individuals would have difficulty achieving the level of distribution necessary to be defensible. Though could 1% of a database be considered infringing? Could/Would 100 people use their existing domains to publish 1% of the PAF each? Would 200 join in for ½% each? Then, of course, there are the usual problems of trust, timeliness and accuracy associated with non-authoritative publication.
These problems not withstanding, Linked Data has the potential to provide a global database at web-scale. Ways of querying that web of data will be invented, what I sketch out above is just one very basic approach. The change the web of data brings has huge implications for data owners and intellectual property rights in data.