RDF, Big Data and The Semantic Web

I’ve been meaning to write this post for a little while, but things have been busy. So, with this afternoon free I figured I’d write it now.

I’ve spent the last 7 years working intensively with data. Mostly not with RDBMSs, but with different Big Data and Linked Data tools. Over the past year things have changed enormously.

The Semantic Web

A lot has been talked about the Semantic Web for a long time now. In fact, I often advise people to search for Linked Data rather than Semantic Web as the usefulness of the results in a practical context is vast. The semantic web has been a rather unfortunately academic endeavour that has been very hard for many developers to get into. In contrast, Linked Data has seen explosive growth over the past five years. It hasn’t gone mainstream though.

What does show signs of going mainstream is the schema.org initiative. This creates a positive feedback loop between sites putting structured data into their pages and search engines giving those sites more and better leads as a result.

Much has been said about Microdata killing RDF blah, blah but that’s not important. What is important is that publishing machine-understandable data on the web is going mainstream.

As an aside, as Microdata extends to solve the problems it currently has (global identifiers and meaningful links) it becomes just another way to write down the RDF model anyway. RDF is an abstract model, not a data format, and at the moment Microdata is a simplified subset of that model.

Big Data and NoSQL

In the meantime another data meme has also grown enormously. In fact, it has dwarfed Linked Data in the attention it has captured. That trend is Big Data and NoSQL.

In Planning for Big Data, Edd talks about the three Vs:

To clarify matters, the three Vs of volume, velocity and variety are commonly used to characterize different aspects of big data. They’re a helpful lens through which to view and understand the nature of the data and the software plat- forms available to exploit them. Most probably you will contend with each of the Vs to one degree or another.

Most Big Data projects are really focussed on volume. They have large quantities, terabytes or petabytes, of uniform data. Often this data is very simple in structure, such as tweets. Fewer projects are focussed on velocity, being able to handle data coming in quickly and even fewer on variety, having unknown or widely varied data.

You can see how the Hadoop toolset is tuned to this and also how the NoSQL communities focus mostly on denormalisation of data. This is a good way to focus resources if you have large volumes of relatively simple, highly uniform data and a specific use-case or queries.

Apart from Neo4J, which is the odd-one-out in the Big Data community this is the approach.

RDF

So, while we wait for the semantic web to evolve, what is RDF good for today?

That third V of the Big Data puzzle is where I’ve been helping people use graphs of data (and that’s what RDF is, a graph model). Graphs are great where you have a variety of data that you want to link up. Especially if you want to extend the data often and if you want to extend the data programmatically — i.e. you don’t want to commit to a complete, constraining schema up-front.

The other aspect of that variety in data that graphs help with is querying. As Jem Rayfield (BBC News & Sport) explains, using a graph makes the model simpler to develop and query.

Graph data models can reach higher levels of variety in the data before they become unwieldy. This allows more data to be mixed and queried together. Mixing in more data adds more context and more context adds allows for more insight. Insight is what we’re ultimately trying to get at with any data analysis. That’s why the intelligence communities have been using graphs for many years now.

What we’re seeing now, with the combination of Big Data and graph technologies, is the ability to add value inside the enterprise. Graphs are useful for data analysis even if you don’t intend to publish the data on the semantic web. Maybe even especially then.

Microsoft, Oracle and IBM are all playing in the Big Data space and have been for some time. What’s less well-known and less ready for mainstream is that they all have projects in the graph database space: DB2 NoSQL Graph StoreOracle Database Semantic TechnologiesConnected Services Framework 3.0.

Behind the scenes, in the enterprise, is probably the biggest space where graphs will be adding real value over the next few years; solving the variety problem in Big Data.

Getting over-excited about Dinosaurs…

I had the great pleasure, a few weeks ago, of working with Tom Scott and Michael Smethurst at the BBC on extensions to the Wildlife Ontology that sits behind Wildlife Finder.

In case you hadn’t spotted it (and if you’re reading this I can’t believe you haven’t) Wildlife Finder provides its information in HTML and RDF — Linked Data, providing a machine-readable version of the documents for those who want to extend or build on top of it. Readers of this blog will have seen Wildlife Finder showcased in many, many Linked Data presentations.

The initial data modelling work was a joint venture between Tom Scott of BBC and Leigh Dodds of Talis and they built an ontology that is simple, elegant and extensible. So, when I got a call asking if I could help them add Dinosaurs into the mix I was chuffed — getting paid to talk about dinosaurs!

Like most children, and we’re all children really, I got over-excited and rushed up to London to find out more. Tom and I spent some time working through changes and he, being far more knowledgeable than I on these matters, let me down gently.

Dinosaurs, of course, are no different to other animals in Wildlife Finder — other than being dead for a while longer…

This realisation made me feel a little below average in the biology department I can tell you. It’s one of those things you stumble across that is so obvious once someone says it to you and yet may well not have occurred to you without a lot of thought.

 

You're not the one and only…

The chorus of Chesney Hawkes‘ song goes “I am the one and only”, a huge pop hit with teenage girls in the 1990s, but what does that have to do with SemTech 2010?

I was in the exhibit space yesterday evening and there was so much really interesting stuff. I had some really great conversations. Talking about storage implementations with Franz and revelytix (and drinking their excellent margaritas), looking at vertical search with Semantifi and having a great discussion about scaling with the guys from Oracle.

A really useful exhibition of some great technology companies in the semweb space.

So why the Chesney reference? Well, several of the exhibitors started out with

we’re the only end-user semantic web application available today

and

we have the first foo bar baz server that does blah blah blah

and

we are the first and only semantic search widget linker

and all I could hear in my head every time it was said was Chesney… “You are the one and only” only they’re not.

For all of the exhibitors that said they were first or only I had serious doubts, having seen other things very similar. Maybe their ‘first’ was very specific — I was the first blogger at SemTech to write a summary of the first two days that included a reference to Colibri…

The problem with these statements is that they are damaging, how much depends on the listener. If the listener is new to the semweb and believe the claim then it makes our market look niche, immature and specialist. If the listener is informed and does not believe the claim it makes your business look like marketeers who will lie to impress people. Either way it’s not a positive outcome. Please stop.

Semtech 2010, San Francisco

Powell Street, San FranciscoSan Francisco is such a very beautiful city. The blue sky, clean streets and the cable cars. A short walk and you’re on the coast, with the bridges and islands.

I’ve been to San Francisco before, for less than 24 hours and I only got to see the bridge from the plane window as I flew out again so it’s especially nice to be here for a week.

I’m here with colleagues from Talis for SemTech 2010.

We’ve had some great sessions so far. I sat in on the first day of OWLED 2010 and having seen a few bio-informatics solutions using OWL this was an interesting session. First up was Michel Dumontier talking about Relational patterns in OWL and their application to OBO. Michel talked about the integration of OWL with OBO so that OWL can be generated from OBO. He talked about adding OWL definitions to the OBO flat file format as OBO’s flat file format doesn’t currently allow for all of the statements you want to be able to make in OWL. In summary, they’ve put together what looks like a macro expansion language so that short names on OBO can be expanded into the correct class definitions in OWL. This kind of ongoing integration with existing syntaxes and formats is really interesting as it opens up more options than simply replacing systems.

The session went on to talk about water spectroscopy, quantum mechanics and chemicals, all described using OWL techniques. This is heavy-weight ontology modelling and very interesting to see description logic applied and delivering real value to these datasets. You can get the full papers online linked from the OWLED 2010 Schedule.

On Monday evening we had the opening sessions for Semtech, the first being Eric A. Franzon, Semantic Universe and Brian Sletten, Bosatsu Consulting, Inc. giving a presentation entitled Semantics for the Rest of Us. Now, this started out with one of the best analogous explanations I’ve ever heard – so obvious once you’re seen it done. Eric and Brian compared the idea of mashing up data with mashing up music, mixing tracks with appropriate tempos and pitches to create new, interesting and exciting pieces of music; such wonders as The Prodigy and Enya, or Billy Idol vs Pink. Such a wonderfully simple way to explain. The music analogy continued with Linked Data being compared with the Harmonica, “Easy to play; takes work to master”. From here, though, we left the business/non-technical track and started to delve into code examples and other technical aspects of Semantic Web – a shame as it blemished what was otherwise an inspiring talk.

There was the Chairman’s presentation, “What Will We Be Saying About Semantics This Year?”. Having partaken of the free wine I’m afraid we ducked out for some dinner. Colibri is a little mexican restaurant near the Hilton, Union Square.

Bernadette Hyland, Zepheira, at SemTech 2010That was Monday, and I’ve now spent all of Tuesday in the SemTech tutorial sessions. This morning David Wood and Bernadette of Semantic Web consultancy Zepheira did an excellent session on Linked Enterprise Data. The talk comes ahead of a soon-to-be-published book, Linked Enterprise Data which is full of case studies authored by those directly involved with real-world enterprise linked data projects. Should be a good book.

One of the things I liked most about the session was the mythbusting, this happened throughout, but Bernadette put up, and busted, three myths explicitly. These three myths apply to many aspects of the way enterprises work, but having them show up clearly from the case studies is very useful to know.

Myth: One authoritative, centralized system for data is necessary to ensure quality and proper usage.

Reality: In many cases there is no “one right way” to curate and view the data. What is right for one department can limit or block another.

Myth: If we centralize control, no one will be able to use the data in the wrong way.

Reality: If you limit users, they will find a way to take the data elsewhere –> decentralization

Myth: We can have one group who will provide reporting to meet everyone’s data analysis needs.

Reality: One group cannot keep up with all the changing ways in which people need to use data and it is very expensive.

Next up I was really interested to hear Duane Degler talk on interfaces for the Semantic Web, unfortunately I misunderstood the pitch for the session and it was far more introductory than I was looking for, with a whole host of examples of interfaces and visualisations for structured data – all of which I’d seen (and studied) before.

With a conference as full as SemTech there’s far more going on than you can get into, the conference is many tracks wide at times. I considered the New Business and Marketing Models with Semantics and Linked Data panel featuring Ian Davis (from Talis) alongside Scott Brinker, ion interactive, inc., Michael F. Uschold and Rachel Lovinger, Razorfish. It looked from Twitter to be an interesting session.

I decided instead to attend the lightning sessions, a dozen presenters in the usual strict 5 minutes each format. Here are a few of my highlights:

Could SemTech Run Entirely on Excel? Lee Feigenbaum, Cambridge Semantics Inc — Lee demonstrated how data in Microsoft Excel could be published as Linked Data using Anzo for Excel. I have to say his rapid demo was very impressive, taking a typical multi-sheet workbook, generating an ontology from it automagically and syncing the data back and forth to Anzo; he then created a simple HTML view from the data using a browser-based point-and-click tool. All in 5 minutes, just.

My colleague Leigh Dodds presented fanhu.bz in 4 minutes 50 seconds. It was great to see a warm reception for it on twitter. Fanhu.bz tries to surface existing communities around BBC programmes, giving a place to see what people are saying, and how people are feeling, about their favourite TV shows.

My final highlight would jute, presented by Sean McDonald. Jute is a network visualisation tool with some nice features allowing you to pick properties of the data and configure them as visual attributes instead of having the relationship on the graph. One example shown was a graph of US politicians in which their Democrat or Republican membership was initially shown as a relationship to each party, this makes the graph hard to read, but jute makes it possible to reconfigure that property as a color attribute on the node, changing the politicians into red and blue nodes, removing the visual complexity of the party membership. A very nice tool for viewing graphs.

Then out for dinner at Puccini and Pinetti — not cheap, but the food was very good. The wine was expensive, but very good with great recommendations from the staff.

Great day.

Distributed, Linked Data has significant implications for Intellectual Property Rights in Data.

What P2P networks have done for distribution of digital media is phenomenal. It is possible, easy even, to get almost any TV show, movie, track or album you can think of by searching one of the many torrent sites. As fast as the media industry take down one site through legal action another has appeared to take its place.

I don’t want to discuss the legal, moral or social implications of this, but discuss how the internet changes the nature of our relationship with media – and data. The internet is a great big copying machine, true enough, but it’s also a fabric that allows mass co-operation. It’s that mass peer-to-peer co-operation that makes so much content available for free; content that is published freely by its creator as well as infringing content.

Sharing of copyrighted content is always likely to be infringing on p2p networks, regardless of any tricks employed, but for data the situation may be different and the Linked Data web has real implications in this space.

Taking the Royal Mail’s Postcode Address File as my working example, because it’s been in the news recently as a result of the work done by ErnestMarples.com, I’ll attempt to show how the Linked Data web changes the nature of data publishing and intellectual property.

First, in case you’re not familiar, a quick introduction to Linked Data. In Linked Data we use http web addresses (which we call URIs) not only to refer to documents containing data but also to refer to real-world things and abstract concepts. We then combine those URIs with properties and values to make statements about the things the URIs represent. So, I might say that my postcode is referred to by the URI http://someorg.example.com/myaddress/postcode. Asking for that URI in the browser would then redirect you to a document containing data about my postcode, maybe sending you to http://someorg.example.com/myaddress/postcode.rdf if you asked for data and http://someorg.example.com/myaddress/postcode.html if you asked for a web page (that’s called content negotiation). All of that works today and organisations like the UK Government, BBC, New York Times and others are publishing data this way.

Also worth noting is the distinction between Linked Data (the technique described above) and Linked Open Data, the output of the W3C’s Linking Open Data project. An important distinction as I’m talking about how commercially owned and protected databases may be disrupted by Linked Data, whereas Linked Open Data is data that is already published on the web under an Open license.

Now, Royal Mail own the Postcode Address File, and other postcode data such as geo co-ordinates. They are covered in the UK under Copyright and Database Right (which for which bits is a different story) so we assume it is “owned”. The database contains more than 28 million postcodes, so publishing my own postcode could not be considered an infringement in any meaningful way, publishing the data for all the addresses within a single postcode would also be unlikely to infringe as it’s such a small fraction of the total data.

So I might publish some data like this (the format is Turtle, a way to write down Linked Data)

<http://someorg.example.com/myaddress/postcode>
  a paf:Postcode;
  paf:correctForm "B37 7YB";
  paf:normalisedForm "b377yb";
  geo:long -1.717336;
  geo:lat 52.467971;
  paf:ordnanceSurveyCode "SP1930085600";
  paf:gridRefEast 41930;
  paf:gridRefNorth 28560;
  paf:containsAddress <http://someorg.example.com/myaddress/postcode#1>;
  paf:googleMaps <http://maps.google.co.uk/maps?hl=en&source=hp&q=B377YB&ie=UTF8&hq=&hnear=Birmingham,+West+Midlands+B377YB,+United+Kingdom&gl=uk&ei=Zs8HS_KVNNOe4QbIpITTCw&ved=0CAgQ8gEwAA&t=h&z=16>.

<http://someorg.example.com/myaddress/postcode#1>
  a paf:Address;
  paf:organisationName "Talis Group Ltd";
  paf:dependentThoroughfareName "Knight's Court";
  paf:thoroughfareName "Solihull Parkway";
  paf:dependentLocality "Birmingham Business Park";
  paf:postTown "Birmingham";
  paf:postcode <http://someorg.example.com/myaddress/postcode>.

I’ve probably made some mistakes in terms of the PAF properties as it’s a long time since I worked with PAF, but it’s clear enough to make my point with. So, I publish this file on my own website as a way of describing the office where I work. That’s not an infringement of any rights in the data and perfectly legitimate thing to do with the address.

As the web of Linked Data takes off, and the same schema become commonly used for this kind of thing, we start to build a substantial copy of the original database. This time, however, the database is not on a single server as ErnestMarples.com was, but spread across the web of Linked Data. There is no single infringing organisation who can be made to take the data down again. If I were responsible for the revenue brought in from sales of PAF licenses this would be a concern, but not major as the distributed nature means it can’t be queried.

The distributed nature of the web means the web itself can’t be queried, but we already know how to address that technically – we built large aggregations of the web, index them and call them search engines. That is also already happening for the Linked Data web. As with the web of documents, some people are attempting to create general purpose search engines over the data and others specialised search engines for specific areas of interest. It’s easy to see that areas of value, such as address data, are likely to attract specialist attention.

Here though, while the individual documents do not infringe, an aggregate of many of them would start to infringe. The defence of crowd-sourcing used in other contexts (such as Open Street Map) does not apply here as the PAF is not factual data – the connection between a postcode and address can only have come from one place, PAF, and is owned by Royal Mail however it got into the database.

So, with the aggregate now infringing it can be taken down through request, negotiation or due process. The obvious answer to that might be for the aggregate to hold the URIs only, not the actual values of the data. This would leave it without a useful search mechanism, however. This could be addressed by having a well-known URI structure as I used in the example data. We might have

<http://addresssearch.example.net/postcodes/B37_7YB>
  owl:sameAs <http://someorg.example.com/myaddress/postcode>

This gets around the data issue, but the full list of postcodes itself may be considered infringing and they are clearly visible in the URIs. Taking them out of the URIs would leave no mechanism to go from a known postcode to data about it and addresses within it, the main use case for PAF. It doesn’t take long to make the link with other existing technology though, an area where we want to match a short string we know with an original short string, but cannot make the original short string available in clear text… Passwords.

Password storage uses one-way hashes so that the password is not available in its original form once stored, but requests to login can be matched by putting the attempted password through the same hash. Hashes are commonplace in the P2P world for a variety of functions, so are well-known and could be applied by an aggregator, or co-operative group, to solve this problem.

If I push the correct form of “B37 7YB” through MD5, I get “bdd2a7bf68119d001ebfd7f86c13a4c7”, but there is no way to get back from that to the original postcode. So a service that uses hashed URIs would not be publishing the postcode list in a directly useable form, but could be searched easily by anyone knowing how the URIs were structured and hashed.

<http://addresssearch.example.net/postcodes/bdd2a7bf68119d001ebfd7f86c13a4c7>
  owl:sameAs <http://someorg.example.com/myaddress/postcode>

Of course, a specialist address service, advertising address lookups and making money could still be considered as infringing by the courts regardless of the technical mechanisms employed, but what of more general aggregations or informal co-operative sites? sameAs, a site for sharing sameAs statements, already provides the infrastructure that would be needed for a service like this and the ease with which sites that do this can be setup and mirrored would make it hard to defend against using the law in the same way that torrent listing sites are difficult for the film and music industries to stop. Regardless of the technical approach and the degree to which that provide legal and practical defence, this is still publishing data in a way that is against the spirit of Copyright and Database Right.

The situation I describe above is one where many, many organisations and individuals are publishing data in a consistent form and that is likely to happen over the next few years for common data like business addresses and phone numbers, but much less likely for less mainstream data. The situation with addresses is one where it is clear there is a reason to publish your individual data other than to be part of the whole, in more contrived cases where the only reason to publish is to contribute to a larger aggregate the notion of fair-use for a small amount of the data may not stand up. That is, over the longer term, address data will not be crowd-sourced – people deliberately creating a dataset – but web-sourced – the data will be on the web anyway.

We can see from this small example that the kinds of data that may be vulnerable to distributed publishing in this way are wide-ranging. The Dewey Decimal Classification scheme used by libraries, Telephone directories (with lookups in both directions), Gracenote’s music database, Legal case numbering schemes, could all be published this way. The problem, of course, is that the data has to be distributed sufficiently that no individual host can be seen as infringing. For common data this will happen naturally, but the co-ordination overhead for a group trying to do this pro-actively would be significant; though that might be solved easily by someone else thinking about how to do this.

As I see it a small group of unfunded individuals would have difficulty achieving the level of distribution necessary to be defensible. Though could 1% of a database be considered infringing? Could/Would 100 people use their existing domains to publish 1% of the PAF each? Would 200 join in for ½% each? Then, of course, there are the usual problems of trust, timeliness and accuracy associated with non-authoritative publication.

These problems not withstanding, Linked Data has the potential to provide a global database at web-scale. Ways of querying that web of data will be invented, what I sketch out above is just one very basic approach. The change the web of data brings has huge implications for data owners and intellectual property rights in data.

Government Data, Openness and Making Money

Over on the UK Government Data Developers group there’s been a great discussion about openness, innovation and how Government makes money from its data; and of course if it should make money. I can’t link to the discussion as the group is closed – sign up, it’s a great group.

Philosophically there’s always the stance that Government data has already been paid for by the public through general taxation.

Tim Berners-Lee even says so in his guest piece for Times Online.

This is data that has already been collected and paid for by the taxpayer, and the internet allows it to be distributed much more cheaply than before. Governments can unlock its value by simply letting people use it.

While that’s true, the role of Government is to maximise the return we get on our taxes so if more money can be made from the assets we have then surely we should.

This is where discussion breaks of into various arguments as to where on the spectrum licensing of Government data should sit, and how open to re-use it should be.

The discussion covers notions of Copyleft licensing, attribution, commercial and non-commercial use as well as models of innovation.

What I always come back to is the notion that to make money you have to have something that is not “open”, a scarce resource. I have a blog post talking about that in the context of software and the web that’s been drafted but not finished for some time, so I’m coming at this from a point of existing thinking.

To make money something has to be closed.

In the case of creative works, the thing that is closed is the right to produce copies (closed through Copyright law). An author makes money by selling that right (or a limited subset of it) to a publisher who makes money from exploiting the right to copy. The publisher has exclusivity.

In the case of open source software companies the dominant model is support and consultancy. They make money by exploiting the specialist knowledge they have in their heads – a careful balance exists for companies doing this between making the product great and needing the support revenue. This balance leads to other monetization strategies, like using the closed nature of being the only place to go for that software to sell default slots in the software (think search boxes), or advertising.

In the case of closed-source commercial software it is the code, the product itself, that remains closed.

Commercial organisations with data assets have to keep the data closed in order to make money. The Government, however, does not. The Government can give the data away for free because it has something else that is closed – the UK economy. To be a part of the (legitimate) UK economy you have to pay taxes, giving the UK a 20% to 40% share of all profits.

If people find ways to make money using Government data those taxes dwarf any potential licensing fee – can you imagine a commercial data provider asking for up to 40% of a company’s profit as the cost of a data license?

This is why it makes sense for the Government to make data available with as few restrictions as possible – ultimately that means Public Domain.

That seems to be the direction the mailing list is heading thanks to some great contributors. If open data, government data and innovation interest you then sign up and join in.

ShelterIt – My digital think-tank: On identity

Did you notice what just happened? I used used an URI as an identifier for a subject. If you popped that URI into your browser, it will take you to WikiPedia’s article on the book and provide a lot of info there in human prose about this book, and this would make it rather easy for Bob to say that, yes indeed, that’s the same book I’ve got. So now we’ve got me and Bob agreeing that we have the same book.

from ShelterIt – My digital think-tank: On identity.

Great piece by Alexander Johannesen about the future of library data, semantic web and the difficulties of getting from here to there.

Ito World: Visualising Transport Data for Data.gov.uk

It can be hard to make meaningful information from huge amounts of data, a graph and a table doesn’t always communicate all it should do. We have been working hard on technology to visualise big datasets into compelling stories that humans can understand. We were really pleased with what we came up with in just one and a half days, see for yourself

from Ito World: Visualising Transport Data for Data.gov.uk.

Nice work on visualizing traffic data.

If you use numbers…

If you use numbers...

Still not sure what’s for the best, but the idea that opaque URIs are better because they’re language independent doesn’t ring true for me. A word is just as opaque as a GUID if you don’t speak the language, but for those who can read it may be far clearer and easier to work with.