Semantic Web

You’re not the one and only…

Thursday, June 24th, 2010 | Semantic Web, Software Business | 2 Comments

The chorus of Chesney Hawkes‘ song goes “I am the one and only”, a huge pop hit with teenage girls in the 1990s, but what does that have to do with SemTech 2010?

I was in the exhibit space yesterday evening and there was so much really interesting stuff. I had some really great conversations. Talking about storage implementations with Franz and revelytix (and drinking their excellent margaritas), looking at vertical search with Semantifi and having a great discussion about scaling with the guys from Oracle.

A really useful exhibition of some great technology companies in the semweb space.

So why the Chesney reference? Well, several of the exhibitors started out with

we’re the only end-user semantic web application available today

and

we have the first foo bar baz server that does blah blah blah

and

we are the first and only semantic search widget linker

and all I could hear in my head every time it was said was Chesney… “You are the one and only” only they’re not.

For all of the exhibitors that said they were first or only I had serious doubts, having seen other things very similar. Maybe their ‘first’ was very specific — I was the first blogger at SemTech to write a summary of the first two days that included a reference to Colibri…

The problem with these statements is that they are damaging, how much depends on the listener. If the listener is new to the semweb and believe the claim then it makes our market look niche, immature and specialist. If the listener is informed and does not believe the claim it makes your business look like marketeers who will lie to impress people. Either way it’s not a positive outcome. Please stop.

Semtech 2010, San Francisco

Tuesday, June 22nd, 2010 | Linked Data, Semantic Web | 3 Comments

Powell Street, San FranciscoSan Francisco is such a very beautiful city. The blue sky, clean streets and the cable cars. A short walk and you’re on the coast, with the bridges and islands.

I’ve been to San Francisco before, for less than 24 hours and I only got to see the bridge from the plane window as I flew out again so it’s especially nice to be here for a week.

I’m here with colleagues from Talis for SemTech 2010.

We’ve had some great sessions so far. I sat in on the first day of OWLED 2010 and having seen a few bio-informatics solutions using OWL this was an interesting session. First up was Michel Dumontier talking about Relational patterns in OWL and their application to OBO. Michel talked about the integration of OWL with OBO so that OWL can be generated from OBO. He talked about adding OWL definitions to the OBO flat file format as OBO’s flat file format doesn’t currently allow for all of the statements you want to be able to make in OWL. In summary, they’ve put together what looks like a macro expansion language so that short names on OBO can be expanded into the correct class definitions in OWL. This kind of ongoing integration with existing syntaxes and formats is really interesting as it opens up more options than simply replacing systems.

The session went on to talk about water spectroscopy, quantum mechanics and chemicals, all described using OWL techniques. This is heavy-weight ontology modelling and very interesting to see description logic applied and delivering real value to these datasets. You can get the full papers online linked from the OWLED 2010 Schedule.

On Monday evening we had the opening sessions for Semtech, the first being Eric A. Franzon, Semantic Universe and Brian Sletten, Bosatsu Consulting, Inc. giving a presentation entitled Semantics for the Rest of Us. Now, this started out with one of the best analogous explanations I’ve ever heard – so obvious once you’re seen it done. Eric and Brian compared the idea of mashing up data with mashing up music, mixing tracks with appropriate tempos and pitches to create new, interesting and exciting pieces of music; such wonders as The Prodigy and Enya, or Billy Idol vs Pink. Such a wonderfully simple way to explain. The music analogy continued with Linked Data being compared with the Harmonica, “Easy to play; takes work to master”. From here, though, we left the business/non-technical track and started to delve into code examples and other technical aspects of Semantic Web – a shame as it blemished what was otherwise an inspiring talk.

There was the Chairman’s presentation, “What Will We Be Saying About Semantics This Year?”. Having partaken of the free wine I’m afraid we ducked out for some dinner. Colibri is a little mexican restaurant near the Hilton, Union Square.

Bernadette Hyland, Zepheira, at SemTech 2010That was Monday, and I’ve now spent all of Tuesday in the SemTech tutorial sessions. This morning David Wood and Bernadette of Semantic Web consultancy Zepheira did an excellent session on Linked Enterprise Data. The talk comes ahead of a soon-to-be-published book, Linked Enterprise Data which is full of case studies authored by those directly involved with real-world enterprise linked data projects. Should be a good book.

One of the things I liked most about the session was the mythbusting, this happened throughout, but Bernadette put up, and busted, three myths explicitly. These three myths apply to many aspects of the way enterprises work, but having them show up clearly from the case studies is very useful to know.

Myth: One authoritative, centralized system for data is necessary to ensure quality and proper usage.

Reality: In many cases there is no “one right way” to curate and view the data. What is right for one department can limit or block another.

Myth: If we centralize control, no one will be able to use the data in the wrong way.

Reality: If you limit users, they will find a way to take the data elsewhere –> decentralization

Myth: We can have one group who will provide reporting to meet everyone’s data analysis needs.

Reality: One group cannot keep up with all the changing ways in which people need to use data and it is very expensive.

Next up I was really interested to hear Duane Degler talk on interfaces for the Semantic Web, unfortunately I misunderstood the pitch for the session and it was far more introductory than I was looking for, with a whole host of examples of interfaces and visualisations for structured data – all of which I’d seen (and studied) before.

With a conference as full as SemTech there’s far more going on than you can get into, the conference is many tracks wide at times. I considered the New Business and Marketing Models with Semantics and Linked Data panel featuring Ian Davis (from Talis) alongside Scott Brinker, ion interactive, inc., Michael F. Uschold and Rachel Lovinger, Razorfish. It looked from Twitter to be an interesting session.

I decided instead to attend the lightning sessions, a dozen presenters in the usual strict 5 minutes each format. Here are a few of my highlights:

Could SemTech Run Entirely on Excel? Lee Feigenbaum, Cambridge Semantics Inc — Lee demonstrated how data in Microsoft Excel could be published as Linked Data using Anzo for Excel. I have to say his rapid demo was very impressive, taking a typical multi-sheet workbook, generating an ontology from it automagically and syncing the data back and forth to Anzo; he then created a simple HTML view from the data using a browser-based point-and-click tool. All in 5 minutes, just.

My colleague Leigh Dodds presented fanhu.bz in 4 minutes 50 seconds. It was great to see a warm reception for it on twitter. Fanhu.bz tries to surface existing communities around BBC programmes, giving a place to see what people are saying, and how people are feeling, about their favourite TV shows.

My final highlight would jute, presented by Sean McDonald. Jute is a network visualisation tool with some nice features allowing you to pick properties of the data and configure them as visual attributes instead of having the relationship on the graph. One example shown was a graph of US politicians in which their Democrat or Republican membership was initially shown as a relationship to each party, this makes the graph hard to read, but jute makes it possible to reconfigure that property as a color attribute on the node, changing the politicians into red and blue nodes, removing the visual complexity of the party membership. A very nice tool for viewing graphs.

Then out for dinner at Puccini and Pinetti — not cheap, but the food was very good. The wine was expensive, but very good with great recommendations from the staff.

Great day.

Distributed, Linked Data has significant implications for Intellectual Property Rights in Data.

What P2P networks have done for distribution of digital media is phenomenal. It is possible, easy even, to get almost any TV show, movie, track or album you can think of by searching one of the many torrent sites. As fast as the media industry take down one site through legal action another has appeared to take its place.

I don’t want to discuss the legal, moral or social implications of this, but discuss how the internet changes the nature of our relationship with media – and data. The internet is a great big copying machine, true enough, but it’s also a fabric that allows mass co-operation. It’s that mass peer-to-peer co-operation that makes so much content available for free; content that is published freely by its creator as well as infringing content.

Sharing of copyrighted content is always likely to be infringing on p2p networks, regardless of any tricks employed, but for data the situation may be different and the Linked Data web has real implications in this space.

Taking the Royal Mail’s Postcode Address File as my working example, because it’s been in the news recently as a result of the work done by ErnestMarples.com, I’ll attempt to show how the Linked Data web changes the nature of data publishing and intellectual property.

First, in case you’re not familiar, a quick introduction to Linked Data. In Linked Data we use http web addresses (which we call URIs) not only to refer to documents containing data but also to refer to real-world things and abstract concepts. We then combine those URIs with properties and values to make statements about the things the URIs represent. So, I might say that my postcode is referred to by the URI http://someorg.example.com/myaddress/postcode. Asking for that URI in the browser would then redirect you to a document containing data about my postcode, maybe sending you to http://someorg.example.com/myaddress/postcode.rdf if you asked for data and http://someorg.example.com/myaddress/postcode.html if you asked for a web page (that’s called content negotiation). All of that works today and organisations like the UK Government, BBC, New York Times and others are publishing data this way.

Also worth noting is the distinction between Linked Data (the technique described above) and Linked Open Data, the output of the W3C’s Linking Open Data project. An important distinction as I’m talking about how commercially owned and protected databases may be disrupted by Linked Data, whereas Linked Open Data is data that is already published on the web under an Open license.

Now, Royal Mail own the Postcode Address File, and other postcode data such as geo co-ordinates. They are covered in the UK under Copyright and Database Right (which for which bits is a different story) so we assume it is “owned”. The database contains more than 28 million postcodes, so publishing my own postcode could not be considered an infringement in any meaningful way, publishing the data for all the addresses within a single postcode would also be unlikely to infringe as it’s such a small fraction of the total data.

So I might publish some data like this (the format is Turtle, a way to write down Linked Data)

<http://someorg.example.com/myaddress/postcode>
  a paf:Postcode;
  paf:correctForm "B37 7YB";
  paf:normalisedForm "b377yb";
  geo:long -1.717336;
  geo:lat 52.467971;
  paf:ordnanceSurveyCode "SP1930085600";
  paf:gridRefEast 41930;
  paf:gridRefNorth 28560;
  paf:containsAddress <http://someorg.example.com/myaddress/postcode#1>;
  paf:googleMaps <http://maps.google.co.uk/maps?hl=en&source=hp&q=B377YB&ie=UTF8&hq=&hnear=Birmingham,+West+Midlands+B377YB,+United+Kingdom&gl=uk&ei=Zs8HS_KVNNOe4QbIpITTCw&ved=0CAgQ8gEwAA&t=h&z=16>.

<http://someorg.example.com/myaddress/postcode#1>
  a paf:Address;
  paf:organisationName "Talis Group Ltd";
  paf:dependentThoroughfareName "Knight's Court";
  paf:thoroughfareName "Solihull Parkway";
  paf:dependentLocality "Birmingham Business Park";
  paf:postTown "Birmingham";
  paf:postcode <http://someorg.example.com/myaddress/postcode>.

I’ve probably made some mistakes in terms of the PAF properties as it’s a long time since I worked with PAF, but it’s clear enough to make my point with. So, I publish this file on my own website as a way of describing the office where I work. That’s not an infringement of any rights in the data and perfectly legitimate thing to do with the address.

As the web of Linked Data takes off, and the same schema become commonly used for this kind of thing, we start to build a substantial copy of the original database. This time, however, the database is not on a single server as ErnestMarples.com was, but spread across the web of Linked Data. There is no single infringing organisation who can be made to take the data down again. If I were responsible for the revenue brought in from sales of PAF licenses this would be a concern, but not major as the distributed nature means it can’t be queried.

The distributed nature of the web means the web itself can’t be queried, but we already know how to address that technically – we built large aggregations of the web, index them and call them search engines. That is also already happening for the Linked Data web. As with the web of documents, some people are attempting to create general purpose search engines over the data and others specialised search engines for specific areas of interest. It’s easy to see that areas of value, such as address data, are likely to attract specialist attention.

Here though, while the individual documents do not infringe, an aggregate of many of them would start to infringe. The defence of crowd-sourcing used in other contexts (such as Open Street Map) does not apply here as the PAF is not factual data – the connection between a postcode and address can only have come from one place, PAF, and is owned by Royal Mail however it got into the database.

So, with the aggregate now infringing it can be taken down through request, negotiation or due process. The obvious answer to that might be for the aggregate to hold the URIs only, not the actual values of the data. This would leave it without a useful search mechanism, however. This could be addressed by having a well-known URI structure as I used in the example data. We might have

<http://addresssearch.example.net/postcodes/B37_7YB>
  owl:sameAs <http://someorg.example.com/myaddress/postcode>

This gets around the data issue, but the full list of postcodes itself may be considered infringing and they are clearly visible in the URIs. Taking them out of the URIs would leave no mechanism to go from a known postcode to data about it and addresses within it, the main use case for PAF. It doesn’t take long to make the link with other existing technology though, an area where we want to match a short string we know with an original short string, but cannot make the original short string available in clear text… Passwords.

Password storage uses one-way hashes so that the password is not available in its original form once stored, but requests to login can be matched by putting the attempted password through the same hash. Hashes are commonplace in the P2P world for a variety of functions, so are well-known and could be applied by an aggregator, or co-operative group, to solve this problem.

If I push the correct form of “B37 7YB” through MD5, I get “bdd2a7bf68119d001ebfd7f86c13a4c7″, but there is no way to get back from that to the original postcode. So a service that uses hashed URIs would not be publishing the postcode list in a directly useable form, but could be searched easily by anyone knowing how the URIs were structured and hashed.

<http://addresssearch.example.net/postcodes/bdd2a7bf68119d001ebfd7f86c13a4c7>
  owl:sameAs <http://someorg.example.com/myaddress/postcode>

Of course, a specialist address service, advertising address lookups and making money could still be considered as infringing by the courts regardless of the technical mechanisms employed, but what of more general aggregations or informal co-operative sites? sameAs, a site for sharing sameAs statements, already provides the infrastructure that would be needed for a service like this and the ease with which sites that do this can be setup and mirrored would make it hard to defend against using the law in the same way that torrent listing sites are difficult for the film and music industries to stop. Regardless of the technical approach and the degree to which that provide legal and practical defence, this is still publishing data in a way that is against the spirit of Copyright and Database Right.

The situation I describe above is one where many, many organisations and individuals are publishing data in a consistent form and that is likely to happen over the next few years for common data like business addresses and phone numbers, but much less likely for less mainstream data. The situation with addresses is one where it is clear there is a reason to publish your individual data other than to be part of the whole, in more contrived cases where the only reason to publish is to contribute to a larger aggregate the notion of fair-use for a small amount of the data may not stand up. That is, over the longer term, address data will not be crowd-sourced – people deliberately creating a dataset – but web-sourced – the data will be on the web anyway.

We can see from this small example that the kinds of data that may be vulnerable to distributed publishing in this way are wide-ranging. The Dewey Decimal Classification scheme used by libraries, Telephone directories (with lookups in both directions), Gracenote’s music database, Legal case numbering schemes, could all be published this way. The problem, of course, is that the data has to be distributed sufficiently that no individual host can be seen as infringing. For common data this will happen naturally, but the co-ordination overhead for a group trying to do this pro-actively would be significant; though that might be solved easily by someone else thinking about how to do this.

As I see it a small group of unfunded individuals would have difficulty achieving the level of distribution necessary to be defensible. Though could 1% of a database be considered infringing? Could/Would 100 people use their existing domains to publish 1% of the PAF each? Would 200 join in for ½% each? Then, of course, there are the usual problems of trust, timeliness and accuracy associated with non-authoritative publication.

These problems not withstanding, Linked Data has the potential to provide a global database at web-scale. Ways of querying that web of data will be invented, what I sketch out above is just one very basic approach. The change the web of data brings has huge implications for data owners and intellectual property rights in data.

Government Data, Openness and Making Money

Friday, November 20th, 2009 | Internet Social Impact, Open Data, Semantic Web | 2 Comments

Over on the UK Government Data Developers group there’s been a great discussion about openness, innovation and how Government makes money from its data; and of course if it should make money. I can’t link to the discussion as the group is closed – sign up, it’s a great group.

Philosophically there’s always the stance that Government data has already been paid for by the public through general taxation.

Tim Berners-Lee even says so in his guest piece for Times Online.

This is data that has already been collected and paid for by the taxpayer, and the internet allows it to be distributed much more cheaply than before. Governments can unlock its value by simply letting people use it.

While that’s true, the role of Government is to maximise the return we get on our taxes so if more money can be made from the assets we have then surely we should.

This is where discussion breaks of into various arguments as to where on the spectrum licensing of Government data should sit, and how open to re-use it should be.

The discussion covers notions of Copyleft licensing, attribution, commercial and non-commercial use as well as models of innovation.

What I always come back to is the notion that to make money you have to have something that is not “open”, a scarce resource. I have a blog post talking about that in the context of software and the web that’s been drafted but not finished for some time, so I’m coming at this from a point of existing thinking.

To make money something has to be closed.

In the case of creative works, the thing that is closed is the right to produce copies (closed through Copyright law). An author makes money by selling that right (or a limited subset of it) to a publisher who makes money from exploiting the right to copy. The publisher has exclusivity.

In the case of open source software companies the dominant model is support and consultancy. They make money by exploiting the specialist knowledge they have in their heads – a careful balance exists for companies doing this between making the product great and needing the support revenue. This balance leads to other monetization strategies, like using the closed nature of being the only place to go for that software to sell default slots in the software (think search boxes), or advertising.

In the case of closed-source commercial software it is the code, the product itself, that remains closed.

Commercial organisations with data assets have to keep the data closed in order to make money. The Government, however, does not. The Government can give the data away for free because it has something else that is closed – the UK economy. To be a part of the (legitimate) UK economy you have to pay taxes, giving the UK a 20% to 40% share of all profits.

If people find ways to make money using Government data those taxes dwarf any potential licensing fee – can you imagine a commercial data provider asking for up to 40% of a company’s profit as the cost of a data license?

This is why it makes sense for the Government to make data available with as few restrictions as possible – ultimately that means Public Domain.

That seems to be the direction the mailing list is heading thanks to some great contributors. If open data, government data and innovation interest you then sign up and join in.

ShelterIt – My digital think-tank: On identity

Wednesday, October 28th, 2009 | Library Tech, Semantic Web | 1 Comment

Did you notice what just happened? I used used an URI as an identifier for a subject. If you popped that URI into your browser, it will take you to WikiPedia’s article on the book and provide a lot of info there in human prose about this book, and this would make it rather easy for Bob to say that, yes indeed, that’s the same book I’ve got. So now we’ve got me and Bob agreeing that we have the same book.

from ShelterIt – My digital think-tank: On identity.

Great piece by Alexander Johannesen about the future of library data, semantic web and the difficulties of getting from here to there.

Ito World: Visualising Transport Data for Data.gov.uk

Tuesday, October 27th, 2009 | Open Data, Semantic Web | 1 Comment

It can be hard to make meaningful information from huge amounts of data, a graph and a table doesn’t always communicate all it should do. We have been working hard on technology to visualise big datasets into compelling stories that humans can understand. We were really pleased with what we came up with in just one and a half days, see for yourself

from Ito World: Visualising Transport Data for Data.gov.uk.

Nice work on visualizing traffic data.

Intensional and Extensional Sets

Wednesday, August 19th, 2009 | Ontologies, Semantic Web | No Comments

One of my collegaues called the other day and asked if we still relied on the distinction between intensional and extensional sets (really intensionally and extensionally defined sets). Yes, even more so now.

from Intensional and Extensional Sets.

If you don’t know the difference (I didn’t) then it’s worth reading.

If you use numbers…

Thursday, August 13th, 2009 | Random Thought, Semantic Web | 2 Comments

If you use numbers...

Still not sure what’s for the best, but the idea that opaque URIs are better because they’re language independent doesn’t ring true for me. A word is just as opaque as a GUID if you don’t speak the language, but for those who can read it may be far clearer and easier to work with.

Conversation with Bruce D’Arcus on Motivation for MODS Ontology « Musings

Tuesday, August 11th, 2009 | Library Tech, Ontologies, Semantic Web | No Comments

The problem from my standpoint is that MODS has some really odd, library-specific, design choices that I don’t think map very well to the wider world. A central concept like mods:name, with mods:role as a child of that, really makes no sense, and conflicts with more common modeling you see in DC, FRBR ,etc.

It’s semantics are also really loose.

So you have to ask yourself, just how linked could a MODS view in RDF really be?

from Conversation with Bruce D’Arcus on Motivation for MODS Ontology.

Excel RDF

Wednesday, July 22nd, 2009 | Semantic Web | 4 Comments

Introduction

When world’s collide sometimes things happen that can be useful. This is not one of those useful things, but a collision of two worlds none-the-less…

ExcelRDF is a proposed serialisation for RDF using the Microsoft Excel Spreadsheet format. This work was inspired by the discussions in the semantic web community about Linked Data and whether or not it mandates the use of RDF. This document is not trying to prove a point, insult anyone or come down on either side of the argument. I just noticed that it hadn’t been done and it didn’t seem too difficult. Of course, that it hadn’t been done should have been enough of a warning to me that it is not, in any sense, desirable.

Conventions used in this document

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Overview

When a server receives a HTTP or HTTPS request for a resource that is described in RDF and the client indicates that it is willing to accept content of type application/vnd.ms-excel the server MAY respond with a Microsoft Excel spreadsheet meeting the following conventions.

  • The spreadsheet MUST contain one or more sheets that meet the following conventions.
  • A sheet SHOULD contain zero or more rows of RDF data.
  • Column A of each non-empty row of a sheet SHOULD contain a URI indicating the Subject of a statement.
  • Column B of each non-empty row of a sheet SHOULD contain a URI indicating the Property of a statement.
  • Column C MAY contain either a URI or a literal value as the Object of a statement.
  • If Column C contains a literal value then Column D MAY contain a language identifier in accordance with IETF BCP 47.
  • If Column C contains a literal value then Column E MAY contain a type specifier indicating the type of the literal value.

Example

The attached example Microsoft Excel spreadsheet contains RDF from the dbpedia project describing Annette Island Airport using the conventions described above.

ExcelRDF Example File

Use Cases

ExcelRDF may be useful where it is desirable to produce charts showing characteristics of a dataset, such as the relative distribution of types within a dataset. Perhaps analysing the count of particular properties. I can think of no obvious way to assess graph characteristics such as linkiness, but you could do things like word counts in literals, or working out how of the literal data is in French.

ExcelRDF may be useful where a specific contract, policy or agreement means that the data must be delivered as an Excel spreadsheet while the underlying data is more useful in RDF.

ExcelRDF may be useful if you wish to be deliberately obtuse.

Search

What I'm Doing...