Pairwise Comparisons of Large Datasets

It’s been a while since I last posted. Work’s been busy, interesting, challenging :) But now it’s the holidays and I have some time to write.

At work we’ve been building a small team around Big Data technologies; specifically Hadoop and Elasticsearch right now though those choices may change. Unlike many Big Data projects we’re not de-normalising our data for a specific application. We have several different applications and thoughts in mind so we want to keep options open. We’re working with graph-based data structures; Linked Data, essentially.

The first product we’re building is niche and the community of users are quite private about how they do business so as I’ve said before I won’t be talking much about that. That sounded kinda creepy 8-| they’re not the mafia, they’re really nice people!

What I can share with you is a little technique we’ve developed for doing pairwise comparisons in map/reduce.

We all know map/reduce is a great way to solve some kinds of problems and Hadoop is a great implementation that allows us to scale map/reduce solutions across many machines. One of the class of problems that is hard to do is pairwise comparisons. Let me first describe what I mean by a pairwise comparison…

Imagine you have a collection of documents. You want to know which ones are similar to which others. One way to do this is to compare every document with every other document and give the connection between them a similarity score. That is hard to do With a large collection of documents because of the number of comparisons – the problem is O(n²). Specifically, if we assume we don’t compare documents with themselves and that ɑ compared with β is the same as β compared with ɑ then the number of comparisons is (n²-n)/2.

If you want to scale this out across a cluster the specific difficulty is knowing what to compare next and what’s already been done. Most approaches I’ve seen use some central coordinator and require that every box in the cluster can access some central document store. Those cause more problems for very large sets.

Other approaches rely on re-defining the problem. One approach is to create some kind of initial grouping based on an attribute such as a subject classification and then only compare within those those groupings. That’s a great approach and is often very suitable. Another approach is to generate some kind of compound key describing the document and then connect all documents with the same key. That’s a great approach and means each document can have a key generated independently of the others. That scales really well but is not always possible.

What if we really do want to compare everything with everything else? That’s the situation I’ve been looking at.

Let’s simplify the example a little. We’ll use the words of the phonetic alphabet, alpha to zulu, to represent our set of documents:

Alpha Bravo Charlie Delta Echo Foxtrot Golf Hotel India Juliet Kilo Lima Mike November Oscar Papa Quebec Romeo Sierra Tango Uniform Victor Whiskey X-ray Yankee Zulu

A pairwise comparison can be viewed as a table with the same terms heading both rows and columns. This gives us a way of thinking about the workload. The smallest unit we can package as a piece of work is a cell in the table; the similarity score for which would be the comparison of the row and column headings.

Alpha Bravo Charlie Yankee Zulu
Alpha
Bravo
Charlie
Yankee
Zulu

The cells we need to calculate are highlighted in green. Using the cell as the unit of work is nice and simple – compare the similarity of two things – so being able to work at this level would be great. Thinking about map/reduce, the pair and their similarity score is the final result we’re looking for, so could be the output of the reducer code. That leaves us with the mapper to create pairs.

A simplistic approach to the mapper creating pairs would be to iterate all of the values:

Receiving ‘Alpha’ as input:
1) read ‘Alpha’ and ignore it
2) read ‘Bravo’ and output ‘Alpha, Bravo’
3) read ‘Charlie’ and output ‘Alpha, Charlie’

25) read ‘Yankee’ and output ‘Alpha, Yankee’
26) read ‘Zulu’ and output ‘Alpha, Zulu’

This is not a good approach it means the mapper will need to read all of the values for each input value. Remember that we can’t assume that the set will fit in memory, so can’t keep a full set inside each mapper to iterate quickly. The reading of values is then O(n²). The mapper has to do this in order to generate the pairs that will then be compared by the reducer. With this approach the mapper requires access to the full set of input values each time it processes. So, we’ve managed to remove the need for a central coordinator but not for a centrally accessible store.

What we need to find is a way of generating pairs without having to iterate the full input set multiple times. Our mental model of a table gives us a possible solution for that — coordinates. If we could generate pairs of values using coordinates as keys then the sort that occurs between the map and reduce will bring together pairs of values at the same coordinate — a coordinate identifying a cell:

1 2 3 25 26
Alpha Bravo Charlie Yankee Zulu
1 Alpha
2 Bravo
3 Charlie
25 Yankee
26 Zulu

This changes what our mapper needs to know. Rather than having to know every other value we need to know our position and every other coordinate. If we use sequential, incremented, values for the coords then we don’t need to query for those, we can simply calculate them. To do that, the mapper needs to know the row/column number of the current value it’s been given and the total number of rows/columns in the square. The total can be passed in as part of the job configuration.

Getting the position of our value within the input sequence is a little tricky. The TextInputFormat reads input files line by line and passes each line to the mapper. If the key it passed to the mapper were the line number that would make this problem very easy to solve. Unfortunately it passes the byte offset within the file. One way to know the position, then, would be to use fixed-lengths for the values. That way the byte offset divided by the fixed length would calculate the position. Alternatively we could pre-process the file and create a file of the form ’1 [tab] Alpha’ to provide the position explicitly. This requires that we perform a single-threaded pass over the entire input set to generate an incrementing position number — not ideal.

It also means that if your comparison takes less time than creating a position-indexed file then this approach won’t be useful to you. In our case it is useful.

The mapper logic for a coordinate approach becomes:

1) read ‘Alpha’
2) output ‘Alpha’ to the coordinates of cells where it should be compared.

A naive implementation of this would output ‘Alpha’ to cells 1,1 to 26,1 for the top row and 1,1 to 1,26 for the left most column. That would create a grid n² but we know we can optimise that to (n²-n)/2 in which case Alpha would be output to cells 1,2 to 1,26 only; the green cells in our example. A middle-position value, Lima, would be output on 1,12 to 11,12 and 11,13 to 11,26. This means the mappers only have to pass over the input values a single time – O(n).

in code:

public class PairwiseMap extends Mapper<Text, Text, Text, Text> {

private static void output_rows(int row, Text name, Context context) throws IOException, InterruptedException {

for (int col = 1; col < row; col++) {
 String key = String.format("%d,%d", row, col);
 context.write(new Text(key), name);
 }
}

private static void output_cols(int wordPosition, Text name, int total, Context context) throws IOException, InterruptedException {
 int column = wordPosition;
 for (int row = wordPosition + 1; row <= total; row++) {
  String key = String.format("%d,%d", row, column);
  context.write(new Text(key), name);
 }
}
@Override
 protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {
  int total = Integer.parseInt(context.getConfiguration().get("TotalInputValues"));
  int line = Integer.parseInt(new String(key.getBytes()));
  output_rows(line, value, context);
  output_cols(line, value, total, context);
 }
}

This solution is effective but the pre-processing and the need to know the total are both frustrating limitations.

I can’t think of a better way to get the position, either with input files in HDFS or with rows in a HBase table. If you have a super-clever way to know the position of a value in a sequence that would help a lot. Maybe a custom HBase input format might be a possibility.

Any suggestions for improvements would be gratefully received :)

 

Room Not Found

I was staying in a hotel last week. I arrived and they gave me room 404. I went to the fourth floor and looked and looked but couldn’t find it.

I went back down to reception and they were very apologetic; they’d recently combined it with 403 and it was no longer there. They checked me into room 200. I said ‘OK’.

#this_really_happened.

Big Data, Large Batches and My Mistake

This is week 9 for me in my new challenge at Callcredit. I wrote a bit about what we’re doing last time and can’t write much about the detail right now as the product we’re building is secret. Credit bureaus are a secretive bunch, culturally. Probably not a bad thing given what they know about us all.

Don’t expect a Linked Data tool or product. What we’re building is firmly in Callcredit’s existing domain.

As well as the new job, I’ve been reading Eric Ries’ The Lean Startup, tracking Big Data news and developing this app. This weekend the combination of these things became a perfect storm that let me to a D’Oh! moment.

One of the many key points in Lean Startup is to maximise learning by getting stuff out as quickly as possible. The main aspect of getting stuff out is to work in small batches. There are strong parallels here with Agile development practices and the need to get a single end-to-end piece of functionality working quickly.

This GigaOm piece on Hadoop’s days being numbered describes the need for faster, smaller batches too; in the context of data analysis responses and incremental changes to data. It introduces a number of tools, some of which I’ve looked at and some I haven’t.

The essence of moving to tools like Percolator, Dremel and Giraph is to reduce the time to learning; to shorten the time it takes to get round the data processing loop.

So, knowing all of this, why have I been working in large batches? I’ve spent the last few weeks building out quite detailed data conversions, but without a UI on the front to make any value from it! Why, given everything I know and all that I’ve experienced didn’t I build a narrow end-to-end system that could be quickly broadened out?

A mixture of reasons, all of which aren’t really valid, just tricks of the mind.

Yesterday I started to fix this and built a small batch, end-to-end, run that I can release soon for internal review.
:)

Linked Data, Big Data and Your Data

Week five of my new challenge and I figured I really should get around to scribbling down some thoughts. I talked in my last post about RDF and graphs being useful inside the enterprise; and here I am, inside an enterprise.

Callcredit is a data business. A mixture of credit reference agency (CRA) and consumer data specialists. As the youngest of the UK’s CRAs, 12 years old, it has built an enviable position and is one of few businesses growing strongly even in the current climate. I’ve worked with CRAs from the outside, during my time at Internet bank Egg. From inside there’s a lot to learn and some interesting opportunities.

Being a CRA, we hold a lot of data about the UK population – you. Some of this comes from the electoral roll, much of it from the banks. Banks share their data with the three CRAs in order to help prevent fraud and lower the risk of lending. We know quite a lot about you.

Actually, if you want to see what we know, check out your free credit report from Noddle – part of the group.

Given the kind of data we hold, you’d hope that we’re pretty strict about security and access. I was pleased to see that everyone is. Even the data that is classed as public record is well looked after; there’s a very healthy respect for privacy and security here.

The flip side to that is wanting those who should have access to be able to do their job the best way possible; and that’s where big data tools come in.

As in my previous post, variety in the data is a key component here. Data comes from lots of different places and folks here are already expert at correcting, matching and making consistent. Volume also plays a part. Current RDBMS systems here have in excess of 100 tables tracking not only data about you but also provenance data so we know where information came from and audit data so we know who’s been using it.

Over the past few weeks I’ve been working with the team here to design and start building a new product using a mix of Hadoop and Big Data® for the data tiers and ASP.net for the web UI, using Rob Vesse’s dotNetRDF. The product is commercially sensitive so I can’t tell you much about that yet, but I’ll be blogging some stuff about the technology and approaches we’re using as I can.

Fun :)

RDF, Big Data and The Semantic Web

I’ve been meaning to write this post for a little while, but things have been busy. So, with this afternoon free I figured I’d write it now.

I’ve spent the last 7 years working intensively with data. Mostly not with RDBMSs, but with different Big Data and Linked Data tools. Over the past year things have changed enormously.

The Semantic Web

A lot has been talked about the Semantic Web for a long time now. In fact, I often advise people to search for Linked Data rather than Semantic Web as the usefulness of the results in a practical context is vast. The semantic web has been a rather unfortunately academic endeavour that has been very hard for many developers to get into. In contrast, Linked Data has seen explosive growth over the past five years. It hasn’t gone mainstream though.

What does show signs of going mainstream is the schema.org initiative. This creates a positive feedback loop between sites putting structured data into their pages and search engines giving those sites more and better leads as a result.

Much has been said about Microdata killing RDF blah, blah but that’s not important. What is important is that publishing machine-understandable data on the web is going mainstream.

As an aside, as Microdata extends to solve the problems it currently has (global identifiers and meaningful links) it becomes just another way to write down the RDF model anyway. RDF is an abstract model, not a data format, and at the moment Microdata is a simplified subset of that model.

Big Data and NoSQL

In the meantime another data meme has also grown enormously. In fact, it has dwarfed Linked Data in the attention it has captured. That trend is Big Data and NoSQL.

In Planning for Big Data, Edd talks about the three Vs:

To clarify matters, the three Vs of volume, velocity and variety are commonly used to characterize different aspects of big data. They’re a helpful lens through which to view and understand the nature of the data and the software plat- forms available to exploit them. Most probably you will contend with each of the Vs to one degree or another.

Most Big Data projects are really focussed on volume. They have large quantities, terabytes or petabytes, of uniform data. Often this data is very simple in structure, such as tweets. Fewer projects are focussed on velocity, being able to handle data coming in quickly and even fewer on variety, having unknown or widely varied data.

You can see how the Hadoop toolset is tuned to this and also how the NoSQL communities focus mostly on denormalisation of data. This is a good way to focus resources if you have large volumes of relatively simple, highly uniform data and a specific use-case or queries.

Apart from Neo4J, which is the odd-one-out in the Big Data community this is the approach.

RDF

So, while we wait for the semantic web to evolve, what is RDF good for today?

That third V of the Big Data puzzle is where I’ve been helping people use graphs of data (and that’s what RDF is, a graph model). Graphs are great where you have a variety of data that you want to link up. Especially if you want to extend the data often and if you want to extend the data programmatically — i.e. you don’t want to commit to a complete, constraining schema up-front.

The other aspect of that variety in data that graphs help with is querying. As Jem Rayfield (BBC News & Sport) explains, using a graph makes the model simpler to develop and query.

Graph data models can reach higher levels of variety in the data before they become unwieldy. This allows more data to be mixed and queried together. Mixing in more data adds more context and more context adds allows for more insight. Insight is what we’re ultimately trying to get at with any data analysis. That’s why the intelligence communities have been using graphs for many years now.

What we’re seeing now, with the combination of Big Data and graph technologies, is the ability to add value inside the enterprise. Graphs are useful for data analysis even if you don’t intend to publish the data on the semantic web. Maybe even especially then.

Microsoft, Oracle and IBM are all playing in the Big Data space and have been for some time. What’s less well-known and less ready for mainstream is that they all have projects in the graph database space: DB2 NoSQL Graph StoreOracle Database Semantic TechnologiesConnected Services Framework 3.0.

Behind the scenes, in the enterprise, is probably the biggest space where graphs will be adding real value over the next few years; solving the variety problem in Big Data.

What next?

I’m leaving Talis.

For the past seven years I have had the great fortune to learn a huge amount from awesome people. That has put me in the position of having some great conversations about what I’ll be doing next; and those conversations are exciting. More on that in a later post. First, how can someone be happy to be leaving a great company?

Back in late 2004 I joined a small library software vendor with some interesting challenges, Talis. Since then we have become one of the best known Linked Data and Semantic Web brands in the world. On that journey I have learnt so much. I’ve learnt everything from an obscure late 1960s data format (MARC) to Big Data conversions using Hadoop. The technology has been the least of it though.

I’ve been rewarded with the opportunity to hear some of the smartest people in the world speak in some amazing places. I’ve pair-programmed with Ian Davis and had breakfast with Tim Berners-Lee; I’ve seen the Rockies in Banff and walked the Great Wall of China. As a result of our brand and the work we’ve done I’ve been invited to help write the first license for open data; train government departments on how to publish their data and talk about dinosaurs with Tom Scott at the BBC.

Talis has always been about the people. People in Talis (Talisians); people outside we’ve worked with and bounced ideas off; customers who have allowed us to help with exciting projects. I have made some great friends and been taught some humbling lessons.

Amongst the sharpest highlights has been an enormously rewarding day job. At the start re-imagining Talis Base and then Talis Prism; seeding an education-focussed business and recently building an expert, international consultancy.

I joined Talis expecting to stay for a few years and found the journey so rewarding it has kept me for so much longer. It’s now time for my journey and Talis to diverge as I think about doing something different.

I still have a couple of consulting engagements to finalise, so if you’re one of those then please don’t panic; we’ll be talking soon.

There is no “metadata”

For a while I’ve been avoiding using the term metadata for a few reasons. I’ve had a few conversations with people about why and so I thought I’d jot the thoughts down here.

First of all, the main reason I stopped using the term is because it means too many different things. Wikipedia recognises metadata as an ambiguous term

The term metadata is an ambiguous term which is used for two fundamentally different concepts (types). Although the expression “data about data” is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at design time the application contains no data. In this case the correct description would be “data about the containers of data”. Descriptive metadata, on the other hand, is about individual instances of application data, the data content. In this case, a useful description (resulting in a disambiguating neologism) would be “data about data content” or “content about content” thus metacontent. Descriptive, Guide and the National Information Standards Organization concept of administrative metadata are all subtypes of metacontent.

and even within the world of descriptive metadata the term is used in many different ways.

I have always found a better, more accurate, complete and consistent term. Such as catalogueprovenanceauditlicensing and so on. I haven’t come across a situation yet where a more specific term hasn’t helped everyone understand the data better.

Data is just descriptions of things and if you say what aspects of a thing you are describing then everyone gets a better sense of what they might do with that. Once we realise that data is just descriptions of things, written in a consistent form to allow for analysis, we can see the next couple of reasons to stop using metadata.

Meta is a relative term. Ralph Swick of W3C is quoted as saying

What’s metadata to you, is someone else’s fundamental data.

That is to say, wether you consider something meta or not depends totally on your context and the problem you’re trying to solve. Often several people in the room will consider this differently.

If we combine that thought with the more specific naming of our data then we get the ability to think about descriptions of descriptions of descriptions. Which brings me on to something else I observe. By thinking in terms of data and metadata we talk, and think, in a vocabulary limited to two layers. Working with Big Data and Graphs I’ve learnt that’s not enough.

Taking the example of data about TV programming from todays RedBee post we could say:

  1. The Mentalist is a TV Programme
  2. The Mentalist is licensed to Channel 5 for broadcast in the UK
  3. The Mentalist will be shown at 21.00 on Thursday 12 April 2012

Statement 2 in that list is licensing data, statement 3 is schedule data. This all comes under the heading of descriptive metadata. Now, RedBee are a commercial organisation who put constraints on the use of their data. So we also need to be able to say things like

  • Statements 1, 2 and 3 are licensed to BBC for competitor analysis

This statement is also licensing data, about the metadata… So what is it? Descriptive metametadata?

Data about data is not a special case. Data is just descriptions of things and remains so wether the things being described are people, places, TV programmes or other data.

That’s why I try to replace the term metadata with something more useful whenever I can.

Getting over-excited about Dinosaurs…

I had the great pleasure, a few weeks ago, of working with Tom Scott and Michael Smethurst at the BBC on extensions to the Wildlife Ontology that sits behind Wildlife Finder.

In case you hadn’t spotted it (and if you’re reading this I can’t believe you haven’t) Wildlife Finder provides its information in HTML and RDF — Linked Data, providing a machine-readable version of the documents for those who want to extend or build on top of it. Readers of this blog will have seen Wildlife Finder showcased in many, many Linked Data presentations.

The initial data modelling work was a joint venture between Tom Scott of BBC and Leigh Dodds of Talis and they built an ontology that is simple, elegant and extensible. So, when I got a call asking if I could help them add Dinosaurs into the mix I was chuffed — getting paid to talk about dinosaurs!

Like most children, and we’re all children really, I got over-excited and rushed up to London to find out more. Tom and I spent some time working through changes and he, being far more knowledgeable than I on these matters, let me down gently.

Dinosaurs, of course, are no different to other animals in Wildlife Finder — other than being dead for a while longer…

This realisation made me feel a little below average in the biology department I can tell you. It’s one of those things you stumble across that is so obvious once someone says it to you and yet may well not have occurred to you without a lot of thought.

 

Choosing URIs, not a five minute task.

This post originally appeared on the Talis Consulting Blog.

Chris Keene at Sussex is having a tough time making a decision on his URIs so I thought I’d wade in and muddy the waters a little.

He’s following advice from the venerable Designing URI Sets for the UK Public Sector. An eleven page document from the heady days of October 2009.

Chris discusses the choice between data.lib.sussex.ac.uk and www.sussex.ac.uk/library/ in terms of elegance, data merging and running infrastructure. He’s leaning toward data.lib.sussex.ac.uk on the basis that data.organisation.tld is the prevailing wind.

There are many more aspects worth considering, and while data.organisation.tld may be a way to get up and running quickly you might get longer term benefit from more consideration; after all we don’t want these URIs to change.

The key requirements are outlined well in ‘Designing URI Sets’ as follows

3. In particular, the domain will:

  • Expect to be maintained in perpetuity
  • Not contain the name of the department or agency currently defining and naming a concept, as that may be re-assigned
  • Support a direct response, or redirect to department/agency servers
  • Ensure that concepts do not collide
  • Require the minimum of central administration and infrastructure costs
  • Be scalable for throughput, performance, resilience

These are all key points, but one in particular stands out for me in terms of choosing the hostname part of a URI

  • Not contain the name of the department or agency currently defining and naming a concept, as that may be re-assigned

That simple sentence contains a lot more than at first reading and suggests that any or all of the concepts defined in the data may become someone else’s responsibility in time. I think over time we will see this becoming key to the longevity of URIs, along with much better redirect maintenance.

The approach data.gov.uk has taken is to break the data into broad subject areas within which many different types of data might sit – education.data.gov.uk, transport.data.gov.uk, crime.data.gov.uk, health.data.gov.uk and so on. This is one example of breaking up the hosts and while right now they all point to one cluster of web servers they can be moved around to allow hosting in different places.

This is good, yet I can’t help thinking that those subject matter areas are really rather broad. Then there are others that seem to work on a different axis, statistics.data.gov.uk and research.data.gov.uk. Leaving me confused at first glance as to where the responsibility for publishing crime research would lie. Then there is patents.data.gov.uk, not “innovation” or “invention” but “patents”, the things listed.

Data.gov.uk has done a great job trailblazing, making and publishing their decisions and allowing others to learn from them, develop on them and contribute back. I think we can push their thinking on hostnames still further. If we consider Linked Data to be descriptions of things, rather than publishing data, then directories of those things would be useful.

For example, we could give somebody the responsibility of publishing a list of all schools in the UK at schools.example.gov and that would be one part of the puzzle. A different group may have the responsibility of publishing the list of all universities and yet another the list of all companies at companies.example.gov.

Of course, we would expect all of these to interlink, patents.example.gov would have links to companies.example.gov and universities.example.gov to document the ownership of patents. We’d expect to see links in schools.example.gov to inspections.schools.example.gov and so on.

Notice that I’ve dropped the word data from those examples, as much of this is about making machine (and human) readable descriptions of things. It’s only because we describe lots of things at the same time and describe them uniformly we call it data.

I’d still expect health.example.gov to appear as well, but the responsibility would be one of aggregating what could be considered health data in order to support querying; it would aggregate doctors.example.gov, hospitals.examples.gov and more. I would expect as many of these aggregates to pop up as are useful.

Of course, in this approach, as in the current data.gov.uk approach, everyone who wants to say something about a particular doctor, school or patent has to be able to get access to that host to say it and, perhaps, conflicting things said by different people get mixed up.

At this point you’re probably thinking well, we might as well just use data.organisation.tld and be done with it then. Unfortunately that simple moves the same design decisions from the hostname to the resource part of the URI, the bit after the hostname. You still have to make decisions and with only one hostname your hosting options are drastically reduced.

Data.gov.uk places the type of thing in the resource part of the URI using what they call concept/reference pairs:

2. Examples of concept/reference pairs:
• road/M5
• school/123
3. The concept/reference construct may be repeated as necessary, for example:
• road/M5/junction/24
• school/123/class/5

I tend to do this slightly differently, using container/reference pairs so I would use “roads” rather than “road” as this lends itself better to then putting listings at those URIs.

The antithesis

We can often learn something by turning an approach on its head. In this case I wonder what would happen if we embraced the idea that many people will have different world-views about the same thing, their own two-penneth so to speak. None of them necessarily authoritative.

In that case we end up with me publishing data on data.my.domain and you publishing data about the same things on data.your.domain. Just as happens all over the web today. If I choose my domains carefully then maybe I can hand bits on as I find someone else to run them better, as above, but always there is more than world view.

There are two common ways to make this work and be interconnected. A common approach is to use owl:sameAs to indicate that data.my.domain/Winston_Churchill and data.your.domain/Winston_Churchill are describing the very same thing. The OWL community is not entirely supportive of that use.

The other approach is to use the annotation pattern and rdfs:seeAlso; in which case documents describing a resource live in many places, but they agree on a single, canonical, URI.

So what would that mean for Sussex?

Well, I’m not sure.

Fortunately, Chris has a limited decision to make right now, choosing a URI, or URIs, for the Mass Observation Archive. It is for this he is considering data.lib.sussex.ac.uk and www.sussex.ac.uk/library/.

Thinking about changing responsibilities over time, I have to say I would choose neither. It is perfectly conceivable that the mass observation may at some time move and not be under the remit of the University of Sussex Library, or even the university.

I would choose a hostname that can travel with the archive wherever it may live. Fortunately it already has one, http://www.massobs.org.uk/. Ideally the catalogue would live it something like catalogue.massobs.org.uk or maybe massobs.org.uk/archive or something like that.

My leaning on this is really because this web of data isn’t something separate from the web of documents, it’s “as well as” and “part of” the web as one whole thing. data.anything makes it somehow different; which in essence it’s not.

Postscript

Oh, on just one more thing…

URI type, for example one of:
• id – Identifier URI
• doc – Document URI, Representation URI
• def – Ontology URI
• set – Set URI

Personally, I really dislike this URI pattern. It leaves the distinguishing piece early in the URI, making it harder to spot the change as the server redirects and harder to select or change when working with the URIs.

I much prefer the pattern

/container/reference to mean the resource
/container/reference.rdf for the rdf/xml
/container/reference.html for the html

and expanding to

/container/reference.json, /container/reference.nt, /container/reference.xml and on and on.

My reasoning is simple, I can copy and paste the document URI from the address bar, paste it to curl on the command line and simply backspace a few to trim off the extension. Also, in the browser or wget, this pattern gives us files named something.html and something.rdf by default. Much easier to work with in most tools.

In summary, I don't like writing more code than I have to…

* This post first appeared on the Talis Consulting blog.

I opened my mailbox the other morning to a question from David Norris at BBC. They’ve been doing a lot of Linked Data work and we’ve been helping them on projects for a good while now.

The question surrounds an ongoing debate within their development community and is a very fine question indeed:

We are looking at our architecture for the Olympics. Currently, we have:

1. a data layer comprised of our Triple Store and Content store.
2. a service layer exposing a set of API’s returning RDF.
3. a presentation layer (PHP) to render the RDF into the HTML.

All fairly conventional – but we have two schools of thought:

Do the presentation developers take the RDF and walk the graph (say
using something like easyRDF) and pull out the properties they need.

Or:

Do we add a domain model in PHP on top of the easyRDF objects such that
developers are extracted from the RDF and can work with higher-level
domain objects instead, like athlete, race etc.

One group is adamant that we should only work with the RDF, because that
*is* the domain model and it’s a performance hit (especially in PHP) and
is just not the “Symantec Web way” to add another domain model.

Others advocate that adding a domain model is a standard OO approach and
is the ‘M’ in ‘MVC’: the fact that the data is RDF is irrelevant.

My opinion is that it comes down to the RDF data, and therefore the
ontology: if the RDF coming through to the presentation layer is large
and generic, it may benefit from having a model on top to provide more
high-level relevant domain objects. But if the RDF is already fairly
specific, like an athlete, then walking through the RDF that describes
that athlete is probably easy enough and wouldn’t require another model
on top of it. So I think it depends on the ontology being modelled close
enough to what the presentation layer needs.

What do you think? I’d be really interested in your view.

Having received it I figured a public answer would be really useful for people to consider and chime in on in the comments, David kindly agreed.

First up, the architecture in use here is nicely conventional; simple and effective. The triple store is storing metadata and the XML content store is storing documents. We would tend to put everything into the triple store by either re-modelling the XML in RDF or using XML Literals, but this group need very fast document querying using xpath and the like, so keeping their existing XML content store is a very sensible move. Keep PHP, or replace it with a web scripting language of your choice, and you have a typical setup for building webapps based on RDF.

The question is totally about what code and how much code to write in that presentation layer and why, the data storage and data access layers are sorted, giving back RDF. Having built a number of substantial applications on top of RDF stores, I do have some experience in this space and I’ve taken both of the approaches discussed above – converting incoming RDF to objects and simply working with the RDF.

Let’s get one thing out of the way – RDF, when modelled well, is domain-modelled data. With SQL databases there are a number of compromises required to fit within tables that create friction between a domain model and the underlying SQL schema (think many-to-many). Attempting to hide this is the life’s work of frameworks like Hibernate and much of Rails. If we model RDF as we would a SQL schema then we’ll have the same problems, but the IAs and developers in this group know how to model RDF well, so that shouldn’t be a problem.

With RDF being domain-modelled data, and a graph, it can be far simpler to map incoming RDF to objects in your code than it is with SQL databases. That makes the approach seem attractive. There are, however, some differences too. By looking at the differences we can get a feel for the problem.

Cardinality & Type

When consuming RDF we generally don’t want to make any assumptions about cardinality – how many of some property there will be. With properties in our data we can cope with this by making every member variable an array, or by keeping only the first value we find if we only ever expect one. Neither is ideal but both approaches work to map the RDF into object properties.

When we come to look at types, classes of things, we have a harder problem, though. It’s common, and useful, in RDF to make type statements about resources and very often a resource will have several types. Types are not a special case in RDF, just as with other properties there can be many of them. This presents a problem in mapping to an OOP object model where an object is of one type (with supertypes, admittedly). You can specify multiple types in many OOP languages, often through the use of interfaces, but you do this at the class level and it is consistent across all instances. In RDF we make type statements at the instance level, so a resource can be of many types. Mapping this, and maintaining a mapping in your OOP code will either a) be really hard or b) constrain what you can say in the data. Option b is not ideal as it can prevent others from doing stuff in the data and making more use of it.

Part of this mismatch on type systems comes from the OOP approach of combining data and behaviour into objects together. Over time this has been constrained and adapted in a number of ways (no multiple inheritance, introduction of interfaces) in order to make a codebase more manageable and hopefully prevent coders getting themselves too tied up in knots. RDF carries no behaviour, it’s a description of something, so the same constraints aren’t necessary. This is the main issue you face mapping RDF to an OOP object model.

Programming Style

What we have ended up with, in libraries like Moriarty, are tools that help us work with the graph quickly and easily. SimpleGraph has functions like get_subjects_of_type($t) which returns a simple array of all the resource URIs of that type. You can then use those in get_subject_subgraph($s) to extract part of the graph to hand off to something else, say a render function.

Moriarty’s SimpleGraph has quite a number of these kinds of functions for making sense of the graph without ever having to work with the underlying nested arrays directly. This pairs up very nicely with functions to do whatever it is you want to do.

$events = $graph->get_subjects_of_type(Ontologies::Sport.’Event’);
foreach ($events and $event) {
render_sporting_event($event);
}

Of course, functions in PHP and other scripting languages are global, and that’s really not nice, so we often want to scope those and that’s where objects tend to come back into play.

Say we’re rendering information about a sporting event the pseudocode might look something like this:

$events = $graph->get_subjects_of_type(Ontologies::Sport.’Event’);
foreach ($events and $event) {
SportingEvent::render($event);
}

This approach differs from a MVC approach because the graph isn’t routinely and completely converted into domain model objects, as that approach is very constraining. What it does is combine graph handling using SimpleGraph with objects for code scoping, but by late-binding of the graph parts and the objects used to present them, the graph is not constrained by the OOP approach.

If you’re using a more templated approach, so you don’t want a render() function, then simple objects that give access to the values for display is a good approach and can make the code more readable than using graph-centric functions throughout and also offer components that can be easily unit-tested.

Conclusion

Going back to the question, I would work mostly with the graph using simple tools that make accessing the data within it easier and I would group functionality for rendering different kinds of things into classes to provide scope. That’s not MVC, not really, but it’s close enough to it that you get the benefits of OOP and MVC without the overhead of keeping OOP and RDF models totally aligned.