Pairwise Comparisons of Large Datasets

It’s been a while since I last posted. Work’s been busy, interesting, challenging 🙂 But now it’s the holidays and I have some time to write.

At work we’ve been building a small team around Big Data technologies; specifically Hadoop and Elasticsearch right now though those choices may change. Unlike many Big Data projects we’re not de-normalising our data for a specific application. We have several different applications and thoughts in mind so we want to keep options open. We’re working with graph-based data structures; Linked Data, essentially.

The first product we’re building is niche and the community of users are quite private about how they do business so as I’ve said before I won’t be talking much about that. That sounded kinda creepy 8-| they’re not the mafia, they’re really nice people!

What I can share with you is a little technique we’ve developed for doing pairwise comparisons in map/reduce.

We all know map/reduce is a great way to solve some kinds of problems and Hadoop is a great implementation that allows us to scale map/reduce solutions across many machines. One of the class of problems that is hard to do is pairwise comparisons. Let me first describe what I mean by a pairwise comparison…

Imagine you have a collection of documents. You want to know which ones are similar to which others. One way to do this is to compare every document with every other document and give the connection between them a similarity score. That is hard to do With a large collection of documents because of the number of comparisons – the problem is O(n²). Specifically, if we assume we don’t compare documents with themselves and that ɑ compared with β is the same as β compared with ɑ then the number of comparisons is (n²-n)/2.

If you want to scale this out across a cluster the specific difficulty is knowing what to compare next and what’s already been done. Most approaches I’ve seen use some central coordinator and require that every box in the cluster can access some central document store. Those cause more problems for very large sets.

Other approaches rely on re-defining the problem. One approach is to create some kind of initial grouping based on an attribute such as a subject classification and then only compare within those those groupings. That’s a great approach and is often very suitable. Another approach is to generate some kind of compound key describing the document and then connect all documents with the same key. That’s a great approach and means each document can have a key generated independently of the others. That scales really well but is not always possible.

What if we really do want to compare everything with everything else? That’s the situation I’ve been looking at.

Let’s simplify the example a little. We’ll use the words of the phonetic alphabet, alpha to zulu, to represent our set of documents:

Alpha Bravo Charlie Delta Echo Foxtrot Golf Hotel India Juliet Kilo Lima Mike November Oscar Papa Quebec Romeo Sierra Tango Uniform Victor Whiskey X-ray Yankee Zulu

A pairwise comparison can be viewed as a table with the same terms heading both rows and columns. This gives us a way of thinking about the workload. The smallest unit we can package as a piece of work is a cell in the table; the similarity score for which would be the comparison of the row and column headings.

Alpha Bravo Charlie Yankee Zulu

The cells we need to calculate are highlighted in green. Using the cell as the unit of work is nice and simple – compare the similarity of two things – so being able to work at this level would be great. Thinking about map/reduce, the pair and their similarity score is the final result we’re looking for, so could be the output of the reducer code. That leaves us with the mapper to create pairs.

A simplistic approach to the mapper creating pairs would be to iterate all of the values:

Receiving ‘Alpha’ as input:
1) read ‘Alpha’ and ignore it
2) read ‘Bravo’ and output ‘Alpha, Bravo’
3) read ‘Charlie’ and output ‘Alpha, Charlie’

25) read ‘Yankee’ and output ‘Alpha, Yankee’
26) read ‘Zulu’ and output ‘Alpha, Zulu’

This is not a good approach it means the mapper will need to read all of the values for each input value. Remember that we can’t assume that the set will fit in memory, so can’t keep a full set inside each mapper to iterate quickly. The reading of values is then O(n²). The mapper has to do this in order to generate the pairs that will then be compared by the reducer. With this approach the mapper requires access to the full set of input values each time it processes. So, we’ve managed to remove the need for a central coordinator but not for a centrally accessible store.

What we need to find is a way of generating pairs without having to iterate the full input set multiple times. Our mental model of a table gives us a possible solution for that — coordinates. If we could generate pairs of values using coordinates as keys then the sort that occurs between the map and reduce will bring together pairs of values at the same coordinate — a coordinate identifying a cell:

1 2 3 25 26
Alpha Bravo Charlie Yankee Zulu
1 Alpha
2 Bravo
3 Charlie
25 Yankee
26 Zulu

This changes what our mapper needs to know. Rather than having to know every other value we need to know our position and every other coordinate. If we use sequential, incremented, values for the coords then we don’t need to query for those, we can simply calculate them. To do that, the mapper needs to know the row/column number of the current value it’s been given and the total number of rows/columns in the square. The total can be passed in as part of the job configuration.

Getting the position of our value within the input sequence is a little tricky. The TextInputFormat reads input files line by line and passes each line to the mapper. If the key it passed to the mapper were the line number that would make this problem very easy to solve. Unfortunately it passes the byte offset within the file. One way to know the position, then, would be to use fixed-lengths for the values. That way the byte offset divided by the fixed length would calculate the position. Alternatively we could pre-process the file and create a file of the form ‘1 [tab] Alpha’ to provide the position explicitly. This requires that we perform a single-threaded pass over the entire input set to generate an incrementing position number — not ideal.

It also means that if your comparison takes less time than creating a position-indexed file then this approach won’t be useful to you. In our case it is useful.

The mapper logic for a coordinate approach becomes:

1) read ‘Alpha’
2) output ‘Alpha’ to the coordinates of cells where it should be compared.

A naive implementation of this would output ‘Alpha’ to cells 1,1 to 26,1 for the top row and 1,1 to 1,26 for the left most column. That would create a grid n² but we know we can optimise that to (n²-n)/2 in which case Alpha would be output to cells 1,2 to 1,26 only; the green cells in our example. A middle-position value, Lima, would be output on 1,12 to 11,12 and 11,13 to 11,26. This means the mappers only have to pass over the input values a single time – O(n).

in code:

public class PairwiseMap extends Mapper<Text, Text, Text, Text> {

private static void output_rows(int row, Text name, Context context) throws IOException, InterruptedException {

for (int col = 1; col < row; col++) {
 String key = String.format("%d,%d", row, col);
 context.write(new Text(key), name);

private static void output_cols(int wordPosition, Text name, int total, Context context) throws IOException, InterruptedException {
 int column = wordPosition;
 for (int row = wordPosition + 1; row <= total; row++) {
  String key = String.format("%d,%d", row, column);
  context.write(new Text(key), name);
 protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {
  int total = Integer.parseInt(context.getConfiguration().get("TotalInputValues"));
  int line = Integer.parseInt(new String(key.getBytes()));
  output_rows(line, value, context);
  output_cols(line, value, total, context);

This solution is effective but the pre-processing and the need to know the total are both frustrating limitations.

I can’t think of a better way to get the position, either with input files in HDFS or with rows in a HBase table. If you have a super-clever way to know the position of a value in a sequence that would help a lot. Maybe a custom HBase input format might be a possibility.

Any suggestions for improvements would be gratefully received 🙂


Linked Data, Big Data and Your Data

Week five of my new challenge and I figured I really should get around to scribbling down some thoughts. I talked in my last post about RDF and graphs being useful inside the enterprise; and here I am, inside an enterprise.

Callcredit is a data business. A mixture of credit reference agency (CRA) and consumer data specialists. As the youngest of the UK’s CRAs, 12 years old, it has built an enviable position and is one of few businesses growing strongly even in the current climate. I’ve worked with CRAs from the outside, during my time at Internet bank Egg. From inside there’s a lot to learn and some interesting opportunities.

Being a CRA, we hold a lot of data about the UK population – you. Some of this comes from the electoral roll, much of it from the banks. Banks share their data with the three CRAs in order to help prevent fraud and lower the risk of lending. We know quite a lot about you.

Actually, if you want to see what we know, check out your free credit report from Noddle – part of the group.

Given the kind of data we hold, you’d hope that we’re pretty strict about security and access. I was pleased to see that everyone is. Even the data that is classed as public record is well looked after; there’s a very healthy respect for privacy and security here.

The flip side to that is wanting those who should have access to be able to do their job the best way possible; and that’s where big data tools come in.

As in my previous post, variety in the data is a key component here. Data comes from lots of different places and folks here are already expert at correcting, matching and making consistent. Volume also plays a part. Current RDBMS systems here have in excess of 100 tables tracking not only data about you but also provenance data so we know where information came from and audit data so we know who’s been using it.

Over the past few weeks I’ve been working with the team here to design and start building a new product using a mix of Hadoop and Big Data® for the data tiers and for the web UI, using Rob Vesse’s dotNetRDF. The product is commercially sensitive so I can’t tell you much about that yet, but I’ll be blogging some stuff about the technology and approaches we’re using as I can.

Fun 🙂

There is no “metadata”

For a while I’ve been avoiding using the term metadata for a few reasons. I’ve had a few conversations with people about why and so I thought I’d jot the thoughts down here.

First of all, the main reason I stopped using the term is because it means too many different things. Wikipedia recognises metadata as an ambiguous term

The term metadata is an ambiguous term which is used for two fundamentally different concepts (types). Although the expression “data about data” is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at design time the application contains no data. In this case the correct description would be “data about the containers of data”. Descriptive metadata, on the other hand, is about individual instances of application data, the data content. In this case, a useful description (resulting in a disambiguating neologism) would be “data about data content” or “content about content” thus metacontent. Descriptive, Guide and the National Information Standards Organization concept of administrative metadata are all subtypes of metacontent.

and even within the world of descriptive metadata the term is used in many different ways.

I have always found a better, more accurate, complete and consistent term. Such as catalogueprovenanceauditlicensing and so on. I haven’t come across a situation yet where a more specific term hasn’t helped everyone understand the data better.

Data is just descriptions of things and if you say what aspects of a thing you are describing then everyone gets a better sense of what they might do with that. Once we realise that data is just descriptions of things, written in a consistent form to allow for analysis, we can see the next couple of reasons to stop using metadata.

Meta is a relative term. Ralph Swick of W3C is quoted as saying

What’s metadata to you, is someone else’s fundamental data.

That is to say, wether you consider something meta or not depends totally on your context and the problem you’re trying to solve. Often several people in the room will consider this differently.

If we combine that thought with the more specific naming of our data then we get the ability to think about descriptions of descriptions of descriptions. Which brings me on to something else I observe. By thinking in terms of data and metadata we talk, and think, in a vocabulary limited to two layers. Working with Big Data and Graphs I’ve learnt that’s not enough.

Taking the example of data about TV programming from todays RedBee post we could say:

  1. The Mentalist is a TV Programme
  2. The Mentalist is licensed to Channel 5 for broadcast in the UK
  3. The Mentalist will be shown at 21.00 on Thursday 12 April 2012

Statement 2 in that list is licensing data, statement 3 is schedule data. This all comes under the heading of descriptive metadata. Now, RedBee are a commercial organisation who put constraints on the use of their data. So we also need to be able to say things like

  • Statements 1, 2 and 3 are licensed to BBC for competitor analysis

This statement is also licensing data, about the metadata… So what is it? Descriptive metametadata?

Data about data is not a special case. Data is just descriptions of things and remains so wether the things being described are people, places, TV programmes or other data.

That’s why I try to replace the term metadata with something more useful whenever I can.

Getting over-excited about Dinosaurs…

I had the great pleasure, a few weeks ago, of working with Tom Scott and Michael Smethurst at the BBC on extensions to the Wildlife Ontology that sits behind Wildlife Finder.

In case you hadn’t spotted it (and if you’re reading this I can’t believe you haven’t) Wildlife Finder provides its information in HTML and RDF — Linked Data, providing a machine-readable version of the documents for those who want to extend or build on top of it. Readers of this blog will have seen Wildlife Finder showcased in many, many Linked Data presentations.

The initial data modelling work was a joint venture between Tom Scott of BBC and Leigh Dodds of Talis and they built an ontology that is simple, elegant and extensible. So, when I got a call asking if I could help them add Dinosaurs into the mix I was chuffed — getting paid to talk about dinosaurs!

Like most children, and we’re all children really, I got over-excited and rushed up to London to find out more. Tom and I spent some time working through changes and he, being far more knowledgeable than I on these matters, let me down gently.

Dinosaurs, of course, are no different to other animals in Wildlife Finder — other than being dead for a while longer…

This realisation made me feel a little below average in the biology department I can tell you. It’s one of those things you stumble across that is so obvious once someone says it to you and yet may well not have occurred to you without a lot of thought.


In summary, I don't like writing more code than I have to…

* This post first appeared on the Talis Consulting blog.

I opened my mailbox the other morning to a question from David Norris at BBC. They’ve been doing a lot of Linked Data work and we’ve been helping them on projects for a good while now.

The question surrounds an ongoing debate within their development community and is a very fine question indeed:

We are looking at our architecture for the Olympics. Currently, we have:

1. a data layer comprised of our Triple Store and Content store.
2. a service layer exposing a set of API’s returning RDF.
3. a presentation layer (PHP) to render the RDF into the HTML.

All fairly conventional – but we have two schools of thought:

Do the presentation developers take the RDF and walk the graph (say
using something like easyRDF) and pull out the properties they need.


Do we add a domain model in PHP on top of the easyRDF objects such that
developers are extracted from the RDF and can work with higher-level
domain objects instead, like athlete, race etc.

One group is adamant that we should only work with the RDF, because that
*is* the domain model and it’s a performance hit (especially in PHP) and
is just not the “Symantec Web way” to add another domain model.

Others advocate that adding a domain model is a standard OO approach and
is the ‘M’ in ‘MVC’: the fact that the data is RDF is irrelevant.

My opinion is that it comes down to the RDF data, and therefore the
ontology: if the RDF coming through to the presentation layer is large
and generic, it may benefit from having a model on top to provide more
high-level relevant domain objects. But if the RDF is already fairly
specific, like an athlete, then walking through the RDF that describes
that athlete is probably easy enough and wouldn’t require another model
on top of it. So I think it depends on the ontology being modelled close
enough to what the presentation layer needs.

What do you think? I’d be really interested in your view.

Having received it I figured a public answer would be really useful for people to consider and chime in on in the comments, David kindly agreed.

First up, the architecture in use here is nicely conventional; simple and effective. The triple store is storing metadata and the XML content store is storing documents. We would tend to put everything into the triple store by either re-modelling the XML in RDF or using XML Literals, but this group need very fast document querying using xpath and the like, so keeping their existing XML content store is a very sensible move. Keep PHP, or replace it with a web scripting language of your choice, and you have a typical setup for building webapps based on RDF.

The question is totally about what code and how much code to write in that presentation layer and why, the data storage and data access layers are sorted, giving back RDF. Having built a number of substantial applications on top of RDF stores, I do have some experience in this space and I’ve taken both of the approaches discussed above – converting incoming RDF to objects and simply working with the RDF.

Let’s get one thing out of the way – RDF, when modelled well, is domain-modelled data. With SQL databases there are a number of compromises required to fit within tables that create friction between a domain model and the underlying SQL schema (think many-to-many). Attempting to hide this is the life’s work of frameworks like Hibernate and much of Rails. If we model RDF as we would a SQL schema then we’ll have the same problems, but the IAs and developers in this group know how to model RDF well, so that shouldn’t be a problem.

With RDF being domain-modelled data, and a graph, it can be far simpler to map incoming RDF to objects in your code than it is with SQL databases. That makes the approach seem attractive. There are, however, some differences too. By looking at the differences we can get a feel for the problem.

Cardinality & Type

When consuming RDF we generally don’t want to make any assumptions about cardinality – how many of some property there will be. With properties in our data we can cope with this by making every member variable an array, or by keeping only the first value we find if we only ever expect one. Neither is ideal but both approaches work to map the RDF into object properties.

When we come to look at types, classes of things, we have a harder problem, though. It’s common, and useful, in RDF to make type statements about resources and very often a resource will have several types. Types are not a special case in RDF, just as with other properties there can be many of them. This presents a problem in mapping to an OOP object model where an object is of one type (with supertypes, admittedly). You can specify multiple types in many OOP languages, often through the use of interfaces, but you do this at the class level and it is consistent across all instances. In RDF we make type statements at the instance level, so a resource can be of many types. Mapping this, and maintaining a mapping in your OOP code will either a) be really hard or b) constrain what you can say in the data. Option b is not ideal as it can prevent others from doing stuff in the data and making more use of it.

Part of this mismatch on type systems comes from the OOP approach of combining data and behaviour into objects together. Over time this has been constrained and adapted in a number of ways (no multiple inheritance, introduction of interfaces) in order to make a codebase more manageable and hopefully prevent coders getting themselves too tied up in knots. RDF carries no behaviour, it’s a description of something, so the same constraints aren’t necessary. This is the main issue you face mapping RDF to an OOP object model.

Programming Style

What we have ended up with, in libraries like Moriarty, are tools that help us work with the graph quickly and easily. SimpleGraph has functions like get_subjects_of_type($t) which returns a simple array of all the resource URIs of that type. You can then use those in get_subject_subgraph($s) to extract part of the graph to hand off to something else, say a render function.

Moriarty’s SimpleGraph has quite a number of these kinds of functions for making sense of the graph without ever having to work with the underlying nested arrays directly. This pairs up very nicely with functions to do whatever it is you want to do.

$events = $graph->get_subjects_of_type(Ontologies::Sport.’Event’);
foreach ($events and $event) {

Of course, functions in PHP and other scripting languages are global, and that’s really not nice, so we often want to scope those and that’s where objects tend to come back into play.

Say we’re rendering information about a sporting event the pseudocode might look something like this:

$events = $graph->get_subjects_of_type(Ontologies::Sport.’Event’);
foreach ($events and $event) {

This approach differs from a MVC approach because the graph isn’t routinely and completely converted into domain model objects, as that approach is very constraining. What it does is combine graph handling using SimpleGraph with objects for code scoping, but by late-binding of the graph parts and the objects used to present them, the graph is not constrained by the OOP approach.

If you’re using a more templated approach, so you don’t want a render() function, then simple objects that give access to the values for display is a good approach and can make the code more readable than using graph-centric functions throughout and also offer components that can be easily unit-tested.


Going back to the question, I would work mostly with the graph using simple tools that make accessing the data within it easier and I would group functionality for rendering different kinds of things into classes to provide scope. That’s not MVC, not really, but it’s close enough to it that you get the benefits of OOP and MVC without the overhead of keeping OOP and RDF models totally aligned.

What people find hard about Linked Data

This post originally appeared on Talis Consulting Blog.

Following on from the post I put up last talking about Linked Data training, I got asked what people find hard when learning about Linked Data for the first time. Delivering our training has given us a unique insight into that, across different roles, backgrounds and organisations — in several countries. We’ve taught hundreds of people in all.

It’s definitely true that people find Linked Data hard, but the learning curve is not really steep compared with other technologies. The main problem is there are a few steps along the way, certain things you have to grasp to be successful with this stuff.

I’ve broken those down into conceptual difficulties, the way we think, and practical problems. These are our perception, there are tasks in the course that are the specific what that people find difficult but I’m trying to surmise something beyond that and describe the why of these difficulties and how we might address them.

The main steps we find people have to climb (in no particular order) are Graph Thinking, URI/URL distinction, Open World Assumption, HTTP 303s, and Syntax…


Graph Thinking

The biggest conceptual problem learners seem to have is with what we call graph thinking. What I mean by graph thinking is the ability to think about data as a graph, a web, a network. We talk about it in the training material in terms of graphs, and start by explaining what a graph is (and that it’s not a chart!).

Non-programmers seem to struggle with this, not with understanding the concept, but with putting themselves above the data. It seems to me that most non-programmers we train find it very easy to think about the data from one point of view or another, but find it hard to think about the data in less specific use-cases.

Take the idea of a simple social network — friend-to-friend connections. Everyone can understand the list of someone’s friends, and on from there to friends-of-friends. The step-up seems to be in understanding the network as a whole, the graph. Thinking about the social graph, that your friends have friends and that your friends’ friends may also be your friends and it all forms an intertwined web, seems to be the thing to grasp. If you’re reading this, you may well be wondering what’s hard about that, but I can tell you that trying to think about Linked Data, this is a step up people have to take.

There’s no reason anyone should find this easy, in everyday life we’re always looking at information in a particular context, for a specific purpose and from an individual point-of-view.

For developers it can be even harder. Having worked with tables in the RDBMS for so long, many developers have adopted tables as their way of thinking about the problem. Even for those fluent in object-oriented design (a graph model) the practical implications of working with a graph of objects leads us to develop, predominantly, trees.

Don’t get me wrong, people understand the concept, however, even after experience we all seem to struggle to extract ourselves from our own specific view when modelling the data.

What can we do?

This will take time to change. As we see more and more data consumed in innovative ways we will start to grasp the importance of graph thinking and modelling outside of a single use-case. We can help this by really focussing on explaining the benefits of a graph model over trees and tables.

I hope we’ll see colleges and universities start to teach graph models more fully, putting less focus on the tables of the RDBMS and the trees of XML.

Examples like BBC Wildlife Finder, and other Linked Data sites, show the potential of graph thinking and the way it changes user experience.

For developers, tools such as the RDF mapping tools in Drupal 7 and emerging object/RDF persistence layers will help hugely.

Using URIs to name real things

In Linked Data we use URIs to name things, not just address documents, but as names to identify things that aren’t on the web, like people, places, concepts. When coming across Linked Data, knowing how to do this is another step people have to climb.

First they have to recognise that they need different URIs for the document and the thing the document describes. It’s a leap to understand:

  • that they can just make these up
  • that no meaning should be inferred from the words in it (and yet best practice is to make the readable)
  • that they can say things about other peoples’ URIs (though those statements won’t be de-referencable)
  • that they can choose their own URIs and URI patterns to work to

The information/non-information resource distinction forms part of this difficulty too. While for naive cases this is easy to understand, how a non-information resource gets de-referenced and you get back a description of it is difficult. The use of 303 redirects doesn’t help, and I’ll talk about that a little later in practical issues.

What can we do?

There are already resources discussing URI patterns and the trade-offs that we can point people to. These will help. What I find helps a lot is simply pointing out that they own their URIs, and that they should reclaim them from .Net or Java or PHP or whatever technology has subverted them. More on that below in supporting custom URIs.

As a community we could focus more on our own URIs, talking more about why we made the decisions we did; why natural keys, why GUIDs, why readable, why opaque?

Non-Constraining Nature (Open World Assumption)

Linked Data follows the open-world assumption — that something you don’t know may be said elsewhere. This is a sea-change for all developers and for most people working with data.

For developers, data storage os very often tied up with data validation. We use schema-validating parsers for XML and we put integrity constraints into our RDBMS schema. We do this with the intention of making our lives easier in the application code, protecting ourselves from invalid data. Within the single context of an application this makes sense, but on the open web, remixing data from different sources, expecting some data to be missing, wanting to use that same data in many different and unexpected ways this doesn’t make sense.

For non-developers often they are used to business rules, another way of describing constraints on what data is acceptable. Also common is that they have particular uses of the data in mind, and want to constrain for those uses — possibly preventing other uses.

What can we do?

Tooling and application development patterns will help here, moving constraints out of storage and into the application’s context. Jena Eyeball is one option here and we need others. We need to support developers better in finding, constraining, validating data that they can consume in their applications. Again, this will come with time.

We could also look for case-studies, where the relaxing of constraints in storage can allow different (possibly conflicting) applications to share data, removing duplication. This would be a good way to show how data independent of context has significant benefit.


HTTP, 303s and Supporting Custom URIs

Certainly for most data owners, curators, admins this stuff is an entirely different world; and a world one could argue they shouldn’t need to know about. With Linked Data, URI design comes into the domain of the data manager where historically it’s always been the domain of the web developer.

Even putting that aside, development tools and default server configurations mean that many of the web developers out there have a hard time with this stuff. The default for almost all server-side web languages routes requests to code using the filename in the URI — index.php, renderItem.aspx and so on. And when do we need to work with response codes? Most web devs today will have had no reason to experience more than 200, 404 and 302 — some will understand 401 if they’ve done some work with logins, but even then most of the framework will hide that for you.

So, the need to route requests to code using a mechanism other than filename in URL is something that, while simple, most people haven’t done before. Add into that the need to handle non-information resources, issue raw 303s and then handle the request for a very similar document URL and you have a bit of stuff that is out of the norm — and that looks complicated.

What can we do?

Working with different frameworks and technologies to make custom URLs the norm and filename based routing frowned-upon wouyld be good. This doesn’t need to be a Linked Data specific thing either, the notion of Cool URIs would also benefit.

We could help different tools build in support for 303s as well, or we could look to drop the need for 303s (which would be my preference). Either way, they need to get easier.


This is a tricky one. I nearly put this into the conceptual issues as part of the learning curve is grasping that RDF has multiple syntaxes and that they are equal. However, most people get that quite quickly; even if they do have problems with the implications of that.

Practically, though, people have quite a step with our two most prominent syntaxes — RDF/XML and Turtle. The specifics are slightly different for each, but the essence is common; identifying the statements.

Turtle is far easier to work with than RDF/XML in this regard, but even Turtle, when you apply all the semicolons and commas to arrive at a concise fragment, is still a step. The statements don’t really stand out.

What can we do?

There are already lots of validators around, and they help a lot. What would really help during the learning stages would be a simple data explorer that could be used locally to load, visualise and navigate a dataset. I don’t know of one yet — you?


None of the steps above are actually hard; taken individually they are all easy to understand and work through — especially with the help of someone who already knows what they’re doing. But, taken together, they add up to a perception that Linked Data is complex, esoteric and different to simply building a website and it is that (false) perception that we need to do more to address.

Semtech 2010, San Francisco

Powell Street, San FranciscoSan Francisco is such a very beautiful city. The blue sky, clean streets and the cable cars. A short walk and you’re on the coast, with the bridges and islands.

I’ve been to San Francisco before, for less than 24 hours and I only got to see the bridge from the plane window as I flew out again so it’s especially nice to be here for a week.

I’m here with colleagues from Talis for SemTech 2010.

We’ve had some great sessions so far. I sat in on the first day of OWLED 2010 and having seen a few bio-informatics solutions using OWL this was an interesting session. First up was Michel Dumontier talking about Relational patterns in OWL and their application to OBO. Michel talked about the integration of OWL with OBO so that OWL can be generated from OBO. He talked about adding OWL definitions to the OBO flat file format as OBO’s flat file format doesn’t currently allow for all of the statements you want to be able to make in OWL. In summary, they’ve put together what looks like a macro expansion language so that short names on OBO can be expanded into the correct class definitions in OWL. This kind of ongoing integration with existing syntaxes and formats is really interesting as it opens up more options than simply replacing systems.

The session went on to talk about water spectroscopy, quantum mechanics and chemicals, all described using OWL techniques. This is heavy-weight ontology modelling and very interesting to see description logic applied and delivering real value to these datasets. You can get the full papers online linked from the OWLED 2010 Schedule.

On Monday evening we had the opening sessions for Semtech, the first being Eric A. Franzon, Semantic Universe and Brian Sletten, Bosatsu Consulting, Inc. giving a presentation entitled Semantics for the Rest of Us. Now, this started out with one of the best analogous explanations I’ve ever heard – so obvious once you’re seen it done. Eric and Brian compared the idea of mashing up data with mashing up music, mixing tracks with appropriate tempos and pitches to create new, interesting and exciting pieces of music; such wonders as The Prodigy and Enya, or Billy Idol vs Pink. Such a wonderfully simple way to explain. The music analogy continued with Linked Data being compared with the Harmonica, “Easy to play; takes work to master”. From here, though, we left the business/non-technical track and started to delve into code examples and other technical aspects of Semantic Web – a shame as it blemished what was otherwise an inspiring talk.

There was the Chairman’s presentation, “What Will We Be Saying About Semantics This Year?”. Having partaken of the free wine I’m afraid we ducked out for some dinner. Colibri is a little mexican restaurant near the Hilton, Union Square.

Bernadette Hyland, Zepheira, at SemTech 2010That was Monday, and I’ve now spent all of Tuesday in the SemTech tutorial sessions. This morning David Wood and Bernadette of Semantic Web consultancy Zepheira did an excellent session on Linked Enterprise Data. The talk comes ahead of a soon-to-be-published book, Linked Enterprise Data which is full of case studies authored by those directly involved with real-world enterprise linked data projects. Should be a good book.

One of the things I liked most about the session was the mythbusting, this happened throughout, but Bernadette put up, and busted, three myths explicitly. These three myths apply to many aspects of the way enterprises work, but having them show up clearly from the case studies is very useful to know.

Myth: One authoritative, centralized system for data is necessary to ensure quality and proper usage.

Reality: In many cases there is no “one right way” to curate and view the data. What is right for one department can limit or block another.

Myth: If we centralize control, no one will be able to use the data in the wrong way.

Reality: If you limit users, they will find a way to take the data elsewhere –> decentralization

Myth: We can have one group who will provide reporting to meet everyone’s data analysis needs.

Reality: One group cannot keep up with all the changing ways in which people need to use data and it is very expensive.

Next up I was really interested to hear Duane Degler talk on interfaces for the Semantic Web, unfortunately I misunderstood the pitch for the session and it was far more introductory than I was looking for, with a whole host of examples of interfaces and visualisations for structured data – all of which I’d seen (and studied) before.

With a conference as full as SemTech there’s far more going on than you can get into, the conference is many tracks wide at times. I considered the New Business and Marketing Models with Semantics and Linked Data panel featuring Ian Davis (from Talis) alongside Scott Brinker, ion interactive, inc., Michael F. Uschold and Rachel Lovinger, Razorfish. It looked from Twitter to be an interesting session.

I decided instead to attend the lightning sessions, a dozen presenters in the usual strict 5 minutes each format. Here are a few of my highlights:

Could SemTech Run Entirely on Excel? Lee Feigenbaum, Cambridge Semantics Inc — Lee demonstrated how data in Microsoft Excel could be published as Linked Data using Anzo for Excel. I have to say his rapid demo was very impressive, taking a typical multi-sheet workbook, generating an ontology from it automagically and syncing the data back and forth to Anzo; he then created a simple HTML view from the data using a browser-based point-and-click tool. All in 5 minutes, just.

My colleague Leigh Dodds presented in 4 minutes 50 seconds. It was great to see a warm reception for it on twitter. tries to surface existing communities around BBC programmes, giving a place to see what people are saying, and how people are feeling, about their favourite TV shows.

My final highlight would jute, presented by Sean McDonald. Jute is a network visualisation tool with some nice features allowing you to pick properties of the data and configure them as visual attributes instead of having the relationship on the graph. One example shown was a graph of US politicians in which their Democrat or Republican membership was initially shown as a relationship to each party, this makes the graph hard to read, but jute makes it possible to reconfigure that property as a color attribute on the node, changing the politicians into red and blue nodes, removing the visual complexity of the party membership. A very nice tool for viewing graphs.

Then out for dinner at Puccini and Pinetti — not cheap, but the food was very good. The wine was expensive, but very good with great recommendations from the staff.

Great day.