Choosing URIs, not a five minute task.

This post originally appeared on the Talis Consulting Blog.

Chris Keene at Sussex is having a tough time making a decision on his URIs so I thought I’d wade in and muddy the waters a little.

He’s following advice from the venerable Designing URI Sets for the UK Public Sector. An eleven page document from the heady days of October 2009.

Chris discusses the choice between data.lib.sussex.ac.uk and www.sussex.ac.uk/library/ in terms of elegance, data merging and running infrastructure. He’s leaning toward data.lib.sussex.ac.uk on the basis that data.organisation.tld is the prevailing wind.

There are many more aspects worth considering, and while data.organisation.tld may be a way to get up and running quickly you might get longer term benefit from more consideration; after all we don’t want these URIs to change.

The key requirements are outlined well in ‘Designing URI Sets’ as follows

3. In particular, the domain will:

  • Expect to be maintained in perpetuity
  • Not contain the name of the department or agency currently defining and naming a concept, as that may be re-assigned
  • Support a direct response, or redirect to department/agency servers
  • Ensure that concepts do not collide
  • Require the minimum of central administration and infrastructure costs
  • Be scalable for throughput, performance, resilience

These are all key points, but one in particular stands out for me in terms of choosing the hostname part of a URI

  • Not contain the name of the department or agency currently defining and naming a concept, as that may be re-assigned

That simple sentence contains a lot more than at first reading and suggests that any or all of the concepts defined in the data may become someone else’s responsibility in time. I think over time we will see this becoming key to the longevity of URIs, along with much better redirect maintenance.

The approach data.gov.uk has taken is to break the data into broad subject areas within which many different types of data might sit – education.data.gov.uk, transport.data.gov.uk, crime.data.gov.uk, health.data.gov.uk and so on. This is one example of breaking up the hosts and while right now they all point to one cluster of web servers they can be moved around to allow hosting in different places.

This is good, yet I can’t help thinking that those subject matter areas are really rather broad. Then there are others that seem to work on a different axis, statistics.data.gov.uk and research.data.gov.uk. Leaving me confused at first glance as to where the responsibility for publishing crime research would lie. Then there is patents.data.gov.uk, not “innovation” or “invention” but “patents”, the things listed.

Data.gov.uk has done a great job trailblazing, making and publishing their decisions and allowing others to learn from them, develop on them and contribute back. I think we can push their thinking on hostnames still further. If we consider Linked Data to be descriptions of things, rather than publishing data, then directories of those things would be useful.

For example, we could give somebody the responsibility of publishing a list of all schools in the UK at schools.example.gov and that would be one part of the puzzle. A different group may have the responsibility of publishing the list of all universities and yet another the list of all companies at companies.example.gov.

Of course, we would expect all of these to interlink, patents.example.gov would have links to companies.example.gov and universities.example.gov to document the ownership of patents. We’d expect to see links in schools.example.gov to inspections.schools.example.gov and so on.

Notice that I’ve dropped the word data from those examples, as much of this is about making machine (and human) readable descriptions of things. It’s only because we describe lots of things at the same time and describe them uniformly we call it data.

I’d still expect health.example.gov to appear as well, but the responsibility would be one of aggregating what could be considered health data in order to support querying; it would aggregate doctors.example.gov, hospitals.examples.gov and more. I would expect as many of these aggregates to pop up as are useful.

Of course, in this approach, as in the current data.gov.uk approach, everyone who wants to say something about a particular doctor, school or patent has to be able to get access to that host to say it and, perhaps, conflicting things said by different people get mixed up.

At this point you’re probably thinking well, we might as well just use data.organisation.tld and be done with it then. Unfortunately that simple moves the same design decisions from the hostname to the resource part of the URI, the bit after the hostname. You still have to make decisions and with only one hostname your hosting options are drastically reduced.

Data.gov.uk places the type of thing in the resource part of the URI using what they call concept/reference pairs:

2. Examples of concept/reference pairs:
• road/M5
• school/123
3. The concept/reference construct may be repeated as necessary, for example:
• road/M5/junction/24
• school/123/class/5

I tend to do this slightly differently, using container/reference pairs so I would use “roads” rather than “road” as this lends itself better to then putting listings at those URIs.

The antithesis

We can often learn something by turning an approach on its head. In this case I wonder what would happen if we embraced the idea that many people will have different world-views about the same thing, their own two-penneth so to speak. None of them necessarily authoritative.

In that case we end up with me publishing data on data.my.domain and you publishing data about the same things on data.your.domain. Just as happens all over the web today. If I choose my domains carefully then maybe I can hand bits on as I find someone else to run them better, as above, but always there is more than world view.

There are two common ways to make this work and be interconnected. A common approach is to use owl:sameAs to indicate that data.my.domain/Winston_Churchill and data.your.domain/Winston_Churchill are describing the very same thing. The OWL community is not entirely supportive of that use.

The other approach is to use the annotation pattern and rdfs:seeAlso; in which case documents describing a resource live in many places, but they agree on a single, canonical, URI.

So what would that mean for Sussex?

Well, I’m not sure.

Fortunately, Chris has a limited decision to make right now, choosing a URI, or URIs, for the Mass Observation Archive. It is for this he is considering data.lib.sussex.ac.uk and www.sussex.ac.uk/library/.

Thinking about changing responsibilities over time, I have to say I would choose neither. It is perfectly conceivable that the mass observation may at some time move and not be under the remit of the University of Sussex Library, or even the university.

I would choose a hostname that can travel with the archive wherever it may live. Fortunately it already has one, http://www.massobs.org.uk/. Ideally the catalogue would live it something like catalogue.massobs.org.uk or maybe massobs.org.uk/archive or something like that.

My leaning on this is really because this web of data isn’t something separate from the web of documents, it’s “as well as” and “part of” the web as one whole thing. data.anything makes it somehow different; which in essence it’s not.

Postscript

Oh, on just one more thing…

URI type, for example one of:
• id – Identifier URI
• doc – Document URI, Representation URI
• def – Ontology URI
• set – Set URI

Personally, I really dislike this URI pattern. It leaves the distinguishing piece early in the URI, making it harder to spot the change as the server redirects and harder to select or change when working with the URIs.

I much prefer the pattern

/container/reference to mean the resource
/container/reference.rdf for the rdf/xml
/container/reference.html for the html

and expanding to

/container/reference.json, /container/reference.nt, /container/reference.xml and on and on.

My reasoning is simple, I can copy and paste the document URI from the address bar, paste it to curl on the command line and simply backspace a few to trim off the extension. Also, in the browser or wget, this pattern gives us files named something.html and something.rdf by default. Much easier to work with in most tools.

In summary, I don't like writing more code than I have to…

* This post first appeared on the Talis Consulting blog.

I opened my mailbox the other morning to a question from David Norris at BBC. They’ve been doing a lot of Linked Data work and we’ve been helping them on projects for a good while now.

The question surrounds an ongoing debate within their development community and is a very fine question indeed:

We are looking at our architecture for the Olympics. Currently, we have:

1. a data layer comprised of our Triple Store and Content store.
2. a service layer exposing a set of API’s returning RDF.
3. a presentation layer (PHP) to render the RDF into the HTML.

All fairly conventional – but we have two schools of thought:

Do the presentation developers take the RDF and walk the graph (say
using something like easyRDF) and pull out the properties they need.

Or:

Do we add a domain model in PHP on top of the easyRDF objects such that
developers are extracted from the RDF and can work with higher-level
domain objects instead, like athlete, race etc.

One group is adamant that we should only work with the RDF, because that
*is* the domain model and it’s a performance hit (especially in PHP) and
is just not the “Symantec Web way” to add another domain model.

Others advocate that adding a domain model is a standard OO approach and
is the ‘M’ in ‘MVC’: the fact that the data is RDF is irrelevant.

My opinion is that it comes down to the RDF data, and therefore the
ontology: if the RDF coming through to the presentation layer is large
and generic, it may benefit from having a model on top to provide more
high-level relevant domain objects. But if the RDF is already fairly
specific, like an athlete, then walking through the RDF that describes
that athlete is probably easy enough and wouldn’t require another model
on top of it. So I think it depends on the ontology being modelled close
enough to what the presentation layer needs.

What do you think? I’d be really interested in your view.

Having received it I figured a public answer would be really useful for people to consider and chime in on in the comments, David kindly agreed.

First up, the architecture in use here is nicely conventional; simple and effective. The triple store is storing metadata and the XML content store is storing documents. We would tend to put everything into the triple store by either re-modelling the XML in RDF or using XML Literals, but this group need very fast document querying using xpath and the like, so keeping their existing XML content store is a very sensible move. Keep PHP, or replace it with a web scripting language of your choice, and you have a typical setup for building webapps based on RDF.

The question is totally about what code and how much code to write in that presentation layer and why, the data storage and data access layers are sorted, giving back RDF. Having built a number of substantial applications on top of RDF stores, I do have some experience in this space and I’ve taken both of the approaches discussed above – converting incoming RDF to objects and simply working with the RDF.

Let’s get one thing out of the way – RDF, when modelled well, is domain-modelled data. With SQL databases there are a number of compromises required to fit within tables that create friction between a domain model and the underlying SQL schema (think many-to-many). Attempting to hide this is the life’s work of frameworks like Hibernate and much of Rails. If we model RDF as we would a SQL schema then we’ll have the same problems, but the IAs and developers in this group know how to model RDF well, so that shouldn’t be a problem.

With RDF being domain-modelled data, and a graph, it can be far simpler to map incoming RDF to objects in your code than it is with SQL databases. That makes the approach seem attractive. There are, however, some differences too. By looking at the differences we can get a feel for the problem.

Cardinality & Type

When consuming RDF we generally don’t want to make any assumptions about cardinality – how many of some property there will be. With properties in our data we can cope with this by making every member variable an array, or by keeping only the first value we find if we only ever expect one. Neither is ideal but both approaches work to map the RDF into object properties.

When we come to look at types, classes of things, we have a harder problem, though. It’s common, and useful, in RDF to make type statements about resources and very often a resource will have several types. Types are not a special case in RDF, just as with other properties there can be many of them. This presents a problem in mapping to an OOP object model where an object is of one type (with supertypes, admittedly). You can specify multiple types in many OOP languages, often through the use of interfaces, but you do this at the class level and it is consistent across all instances. In RDF we make type statements at the instance level, so a resource can be of many types. Mapping this, and maintaining a mapping in your OOP code will either a) be really hard or b) constrain what you can say in the data. Option b is not ideal as it can prevent others from doing stuff in the data and making more use of it.

Part of this mismatch on type systems comes from the OOP approach of combining data and behaviour into objects together. Over time this has been constrained and adapted in a number of ways (no multiple inheritance, introduction of interfaces) in order to make a codebase more manageable and hopefully prevent coders getting themselves too tied up in knots. RDF carries no behaviour, it’s a description of something, so the same constraints aren’t necessary. This is the main issue you face mapping RDF to an OOP object model.

Programming Style

What we have ended up with, in libraries like Moriarty, are tools that help us work with the graph quickly and easily. SimpleGraph has functions like get_subjects_of_type($t) which returns a simple array of all the resource URIs of that type. You can then use those in get_subject_subgraph($s) to extract part of the graph to hand off to something else, say a render function.

Moriarty’s SimpleGraph has quite a number of these kinds of functions for making sense of the graph without ever having to work with the underlying nested arrays directly. This pairs up very nicely with functions to do whatever it is you want to do.

$events = $graph->get_subjects_of_type(Ontologies::Sport.’Event’);
foreach ($events and $event) {
render_sporting_event($event);
}

Of course, functions in PHP and other scripting languages are global, and that’s really not nice, so we often want to scope those and that’s where objects tend to come back into play.

Say we’re rendering information about a sporting event the pseudocode might look something like this:

$events = $graph->get_subjects_of_type(Ontologies::Sport.’Event’);
foreach ($events and $event) {
SportingEvent::render($event);
}

This approach differs from a MVC approach because the graph isn’t routinely and completely converted into domain model objects, as that approach is very constraining. What it does is combine graph handling using SimpleGraph with objects for code scoping, but by late-binding of the graph parts and the objects used to present them, the graph is not constrained by the OOP approach.

If you’re using a more templated approach, so you don’t want a render() function, then simple objects that give access to the values for display is a good approach and can make the code more readable than using graph-centric functions throughout and also offer components that can be easily unit-tested.

Conclusion

Going back to the question, I would work mostly with the graph using simple tools that make accessing the data within it easier and I would group functionality for rendering different kinds of things into classes to provide scope. That’s not MVC, not really, but it’s close enough to it that you get the benefits of OOP and MVC without the overhead of keeping OOP and RDF models totally aligned.