What people find hard about Linked Data

This post originally appeared on Talis Consulting Blog.

Following on from the post I put up last talking about Linked Data training, I got asked what people find hard when learning about Linked Data for the first time. Delivering our training has given us a unique insight into that, across different roles, backgrounds and organisations — in several countries. We’ve taught hundreds of people in all.

It’s definitely true that people find Linked Data hard, but the learning curve is not really steep compared with other technologies. The main problem is there are a few steps along the way, certain things you have to grasp to be successful with this stuff.

I’ve broken those down into conceptual difficulties, the way we think, and practical problems. These are our perception, there are tasks in the course that are the specific what that people find difficult but I’m trying to surmise something beyond that and describe the why of these difficulties and how we might address them.

The main steps we find people have to climb (in no particular order) are Graph Thinking, URI/URL distinction, Open World Assumption, HTTP 303s, and Syntax…

Conceptual

Graph Thinking

The biggest conceptual problem learners seem to have is with what we call graph thinking. What I mean by graph thinking is the ability to think about data as a graph, a web, a network. We talk about it in the training material in terms of graphs, and start by explaining what a graph is (and that it’s not a chart!).

Non-programmers seem to struggle with this, not with understanding the concept, but with putting themselves above the data. It seems to me that most non-programmers we train find it very easy to think about the data from one point of view or another, but find it hard to think about the data in less specific use-cases.

Take the idea of a simple social network — friend-to-friend connections. Everyone can understand the list of someone’s friends, and on from there to friends-of-friends. The step-up seems to be in understanding the network as a whole, the graph. Thinking about the social graph, that your friends have friends and that your friends’ friends may also be your friends and it all forms an intertwined web, seems to be the thing to grasp. If you’re reading this, you may well be wondering what’s hard about that, but I can tell you that trying to think about Linked Data, this is a step up people have to take.

There’s no reason anyone should find this easy, in everyday life we’re always looking at information in a particular context, for a specific purpose and from an individual point-of-view.

For developers it can be even harder. Having worked with tables in the RDBMS for so long, many developers have adopted tables as their way of thinking about the problem. Even for those fluent in object-oriented design (a graph model) the practical implications of working with a graph of objects leads us to develop, predominantly, trees.

Don’t get me wrong, people understand the concept, however, even after experience we all seem to struggle to extract ourselves from our own specific view when modelling the data.

What can we do?

This will take time to change. As we see more and more data consumed in innovative ways we will start to grasp the importance of graph thinking and modelling outside of a single use-case. We can help this by really focussing on explaining the benefits of a graph model over trees and tables.

I hope we’ll see colleges and universities start to teach graph models more fully, putting less focus on the tables of the RDBMS and the trees of XML.

Examples like BBC Wildlife Finder, and other Linked Data sites, show the potential of graph thinking and the way it changes user experience.

For developers, tools such as the RDF mapping tools in Drupal 7 and emerging object/RDF persistence layers will help hugely.

Using URIs to name real things

In Linked Data we use URIs to name things, not just address documents, but as names to identify things that aren’t on the web, like people, places, concepts. When coming across Linked Data, knowing how to do this is another step people have to climb.

First they have to recognise that they need different URIs for the document and the thing the document describes. It’s a leap to understand:

  • that they can just make these up
  • that no meaning should be inferred from the words in it (and yet best practice is to make the readable)
  • that they can say things about other peoples’ URIs (though those statements won’t be de-referencable)
  • that they can choose their own URIs and URI patterns to work to

The information/non-information resource distinction forms part of this difficulty too. While for naive cases this is easy to understand, how a non-information resource gets de-referenced and you get back a description of it is difficult. The use of 303 redirects doesn’t help, and I’ll talk about that a little later in practical issues.

What can we do?

There are already resources discussing URI patterns and the trade-offs that we can point people to. These will help. What I find helps a lot is simply pointing out that they own their URIs, and that they should reclaim them from .Net or Java or PHP or whatever technology has subverted them. More on that below in supporting custom URIs.

As a community we could focus more on our own URIs, talking more about why we made the decisions we did; why natural keys, why GUIDs, why readable, why opaque?

Non-Constraining Nature (Open World Assumption)

Linked Data follows the open-world assumption — that something you don’t know may be said elsewhere. This is a sea-change for all developers and for most people working with data.

For developers, data storage os very often tied up with data validation. We use schema-validating parsers for XML and we put integrity constraints into our RDBMS schema. We do this with the intention of making our lives easier in the application code, protecting ourselves from invalid data. Within the single context of an application this makes sense, but on the open web, remixing data from different sources, expecting some data to be missing, wanting to use that same data in many different and unexpected ways this doesn’t make sense.

For non-developers often they are used to business rules, another way of describing constraints on what data is acceptable. Also common is that they have particular uses of the data in mind, and want to constrain for those uses — possibly preventing other uses.

What can we do?

Tooling and application development patterns will help here, moving constraints out of storage and into the application’s context. Jena Eyeball is one option here and we need others. We need to support developers better in finding, constraining, validating data that they can consume in their applications. Again, this will come with time.

We could also look for case-studies, where the relaxing of constraints in storage can allow different (possibly conflicting) applications to share data, removing duplication. This would be a good way to show how data independent of context has significant benefit.

Practical

HTTP, 303s and Supporting Custom URIs

Certainly for most data owners, curators, admins this stuff is an entirely different world; and a world one could argue they shouldn’t need to know about. With Linked Data, URI design comes into the domain of the data manager where historically it’s always been the domain of the web developer.

Even putting that aside, development tools and default server configurations mean that many of the web developers out there have a hard time with this stuff. The default for almost all server-side web languages routes requests to code using the filename in the URI — index.php, renderItem.aspx and so on. And when do we need to work with response codes? Most web devs today will have had no reason to experience more than 200, 404 and 302 — some will understand 401 if they’ve done some work with logins, but even then most of the framework will hide that for you.

So, the need to route requests to code using a mechanism other than filename in URL is something that, while simple, most people haven’t done before. Add into that the need to handle non-information resources, issue raw 303s and then handle the request for a very similar document URL and you have a bit of stuff that is out of the norm — and that looks complicated.

What can we do?

Working with different frameworks and technologies to make custom URLs the norm and filename based routing frowned-upon wouyld be good. This doesn’t need to be a Linked Data specific thing either, the notion of Cool URIs would also benefit.

We could help different tools build in support for 303s as well, or we could look to drop the need for 303s (which would be my preference). Either way, they need to get easier.

Syntax

This is a tricky one. I nearly put this into the conceptual issues as part of the learning curve is grasping that RDF has multiple syntaxes and that they are equal. However, most people get that quite quickly; even if they do have problems with the implications of that.

Practically, though, people have quite a step with our two most prominent syntaxes — RDF/XML and Turtle. The specifics are slightly different for each, but the essence is common; identifying the statements.

Turtle is far easier to work with than RDF/XML in this regard, but even Turtle, when you apply all the semicolons and commas to arrive at a concise fragment, is still a step. The statements don’t really stand out.

What can we do?

There are already lots of validators around, and they help a lot. What would really help during the learning stages would be a simple data explorer that could be used locally to load, visualise and navigate a dataset. I don’t know of one yet — you?

Summary

None of the steps above are actually hard; taken individually they are all easy to understand and work through — especially with the help of someone who already knows what they’re doing. But, taken together, they add up to a perception that Linked Data is complex, esoteric and different to simply building a website and it is that (false) perception that we need to do more to address.

Introducing the Web of Data

** This post originally appeared on Talis’ Platform Consulting Blog **

So, the blog is fairly new, but we’ve been here a while. For those of you who know us already you may know that Talis is more than 40 years old!

During that time the company has seen many changes in the technology landscape and has been at the forefront of many changes.

Linked Data is not too much different. We’ve been doing Linked Data and Semantic Web stuff for several years now. We think we’ve learned some lessons along the way.

If you’ve been to one of our open days, or paid really close attention to our branding, you’ll have noticed the strapline shared innovation™. We like to share what we’re doing and have been a little lax at talking about our consulting work here — expect that to change. 🙂

In the meantime I wanted to point to something we’ve been sharing for a while; course materials for learning about Linked Data. We originally designed this course for government departments working with data.gov.uk, refined based on our experience there and went on to deliver it to many teams throughout the BBC.

It’s now been delivered dozens of times to interested groups and inside companies with no previous knowledge who want to get into this technology fast.

In the spirit of sharing, the materials are freely available on the web and licensed under the Creative Commons Attribution License (CC-By).

Take a look and let us know what you think:

http://bit.ly/intro-to-web-of-data

NESTA Birmingham

Friday afternoon was an interesting few hours, Simon Whitehouse of Digital Birmingham had organised an event for anyone interested in putting in a bid for the NESTA Make It Local competition.

I got to meet Hadley Beeman who has been putting together some really exciting ideas on crowdsourcing data conversion for the public data that’s been released recently — hoping to help get the data out of Excel and other tabular formats and into something more flexible.

The NESTA competition is focussed on bringing together local government and local digital media businesses; bids have to be led by a local authority, must use local firms for implementation and must use previously unreleased data. As Simon pointed out, that puts those who gave already released data at a disadvantage to those who haven’t, though helping those who haven’t started releasing data to get going can’t be a bad thing.

Talking with others there brought out some great ideas

You're not the one and only…

The chorus of Chesney Hawkes‘ song goes “I am the one and only”, a huge pop hit with teenage girls in the 1990s, but what does that have to do with SemTech 2010?

I was in the exhibit space yesterday evening and there was so much really interesting stuff. I had some really great conversations. Talking about storage implementations with Franz and revelytix (and drinking their excellent margaritas), looking at vertical search with Semantifi and having a great discussion about scaling with the guys from Oracle.

A really useful exhibition of some great technology companies in the semweb space.

So why the Chesney reference? Well, several of the exhibitors started out with

we’re the only end-user semantic web application available today

and

we have the first foo bar baz server that does blah blah blah

and

we are the first and only semantic search widget linker

and all I could hear in my head every time it was said was Chesney… “You are the one and only” only they’re not.

For all of the exhibitors that said they were first or only I had serious doubts, having seen other things very similar. Maybe their ‘first’ was very specific — I was the first blogger at SemTech to write a summary of the first two days that included a reference to Colibri…

The problem with these statements is that they are damaging, how much depends on the listener. If the listener is new to the semweb and believe the claim then it makes our market look niche, immature and specialist. If the listener is informed and does not believe the claim it makes your business look like marketeers who will lie to impress people. Either way it’s not a positive outcome. Please stop.

Semtech 2010, San Francisco

Powell Street, San FranciscoSan Francisco is such a very beautiful city. The blue sky, clean streets and the cable cars. A short walk and you’re on the coast, with the bridges and islands.

I’ve been to San Francisco before, for less than 24 hours and I only got to see the bridge from the plane window as I flew out again so it’s especially nice to be here for a week.

I’m here with colleagues from Talis for SemTech 2010.

We’ve had some great sessions so far. I sat in on the first day of OWLED 2010 and having seen a few bio-informatics solutions using OWL this was an interesting session. First up was Michel Dumontier talking about Relational patterns in OWL and their application to OBO. Michel talked about the integration of OWL with OBO so that OWL can be generated from OBO. He talked about adding OWL definitions to the OBO flat file format as OBO’s flat file format doesn’t currently allow for all of the statements you want to be able to make in OWL. In summary, they’ve put together what looks like a macro expansion language so that short names on OBO can be expanded into the correct class definitions in OWL. This kind of ongoing integration with existing syntaxes and formats is really interesting as it opens up more options than simply replacing systems.

The session went on to talk about water spectroscopy, quantum mechanics and chemicals, all described using OWL techniques. This is heavy-weight ontology modelling and very interesting to see description logic applied and delivering real value to these datasets. You can get the full papers online linked from the OWLED 2010 Schedule.

On Monday evening we had the opening sessions for Semtech, the first being Eric A. Franzon, Semantic Universe and Brian Sletten, Bosatsu Consulting, Inc. giving a presentation entitled Semantics for the Rest of Us. Now, this started out with one of the best analogous explanations I’ve ever heard – so obvious once you’re seen it done. Eric and Brian compared the idea of mashing up data with mashing up music, mixing tracks with appropriate tempos and pitches to create new, interesting and exciting pieces of music; such wonders as The Prodigy and Enya, or Billy Idol vs Pink. Such a wonderfully simple way to explain. The music analogy continued with Linked Data being compared with the Harmonica, “Easy to play; takes work to master”. From here, though, we left the business/non-technical track and started to delve into code examples and other technical aspects of Semantic Web – a shame as it blemished what was otherwise an inspiring talk.

There was the Chairman’s presentation, “What Will We Be Saying About Semantics This Year?”. Having partaken of the free wine I’m afraid we ducked out for some dinner. Colibri is a little mexican restaurant near the Hilton, Union Square.

Bernadette Hyland, Zepheira, at SemTech 2010That was Monday, and I’ve now spent all of Tuesday in the SemTech tutorial sessions. This morning David Wood and Bernadette of Semantic Web consultancy Zepheira did an excellent session on Linked Enterprise Data. The talk comes ahead of a soon-to-be-published book, Linked Enterprise Data which is full of case studies authored by those directly involved with real-world enterprise linked data projects. Should be a good book.

One of the things I liked most about the session was the mythbusting, this happened throughout, but Bernadette put up, and busted, three myths explicitly. These three myths apply to many aspects of the way enterprises work, but having them show up clearly from the case studies is very useful to know.

Myth: One authoritative, centralized system for data is necessary to ensure quality and proper usage.

Reality: In many cases there is no “one right way” to curate and view the data. What is right for one department can limit or block another.

Myth: If we centralize control, no one will be able to use the data in the wrong way.

Reality: If you limit users, they will find a way to take the data elsewhere –> decentralization

Myth: We can have one group who will provide reporting to meet everyone’s data analysis needs.

Reality: One group cannot keep up with all the changing ways in which people need to use data and it is very expensive.

Next up I was really interested to hear Duane Degler talk on interfaces for the Semantic Web, unfortunately I misunderstood the pitch for the session and it was far more introductory than I was looking for, with a whole host of examples of interfaces and visualisations for structured data – all of which I’d seen (and studied) before.

With a conference as full as SemTech there’s far more going on than you can get into, the conference is many tracks wide at times. I considered the New Business and Marketing Models with Semantics and Linked Data panel featuring Ian Davis (from Talis) alongside Scott Brinker, ion interactive, inc., Michael F. Uschold and Rachel Lovinger, Razorfish. It looked from Twitter to be an interesting session.

I decided instead to attend the lightning sessions, a dozen presenters in the usual strict 5 minutes each format. Here are a few of my highlights:

Could SemTech Run Entirely on Excel? Lee Feigenbaum, Cambridge Semantics Inc — Lee demonstrated how data in Microsoft Excel could be published as Linked Data using Anzo for Excel. I have to say his rapid demo was very impressive, taking a typical multi-sheet workbook, generating an ontology from it automagically and syncing the data back and forth to Anzo; he then created a simple HTML view from the data using a browser-based point-and-click tool. All in 5 minutes, just.

My colleague Leigh Dodds presented fanhu.bz in 4 minutes 50 seconds. It was great to see a warm reception for it on twitter. Fanhu.bz tries to surface existing communities around BBC programmes, giving a place to see what people are saying, and how people are feeling, about their favourite TV shows.

My final highlight would jute, presented by Sean McDonald. Jute is a network visualisation tool with some nice features allowing you to pick properties of the data and configure them as visual attributes instead of having the relationship on the graph. One example shown was a graph of US politicians in which their Democrat or Republican membership was initially shown as a relationship to each party, this makes the graph hard to read, but jute makes it possible to reconfigure that property as a color attribute on the node, changing the politicians into red and blue nodes, removing the visual complexity of the party membership. A very nice tool for viewing graphs.

Then out for dinner at Puccini and Pinetti — not cheap, but the food was very good. The wine was expensive, but very good with great recommendations from the staff.

Great day.

Building a simple HTTP-to-Z39.50 gateway using Yaz4j and Tomcat | Index Data

Yaz4J is a wrapper library over the client-specific parts of YAZ, a C-based Z39.50 toolkit, and allows you to use the ZOOM API directly from Java. Initial version of Yaz4j has been written by Rob Styles from Talis and the project is now developed and maintained at IndexData. ZOOM is a relatively straightforward API and with a few lines of code you can write a basic application that can establish connection to a Z39.50 server. Here we will try to build a very simple HTTP-to-Z3950 gateway using yaz4j and the Java Servlet technology.

from Building a simple HTTP-to-Z39.50 gateway using Yaz4j and Tomcat | Index Data.

I write Yaz4J a couple of years ago now and it’s great to see it getting some use outside of Talis.

left wondering…

He’s a nice chap, the man across from me on the train, jolly as we share a ‘What do you do?’ over the tops of our laptops. Mine a Mac with stickers on, his an old corporate HP struggling to boot.

His top button done up, tie pulled tight, pink pin-stripe running through the dark blue of his suit; me in my worn jeans.

“What do you do?” I ask. “I’m a head hunter” he replies. “Oh, what sector?” I ask. “Big industry; Power, Energy, Oil and Gas” he says, smiling.

“That must be interesting, do you do much in renewables?” I ask trying to turn the conversation to something I’d be very interested to hear about. “Oh no, there’s nothing in renewables, it’s just a distraction” he says dismissively. He goes on… “I just finished reading a report, renewables are fine to make us look good but they can’t provide anything like enough power for the needs of somewhere like the UK. For the big companies like Shell, BP, they’re just a distraction.”

“and all this suggestion that hydrocarbons are running out isn’t true, the oil companies are happy for people to think that as it keeps the prices high, but a project I recently hired for has found millions of barrels just off Brazil. There’s plenty of it out there.”

I sit back, wondering if he has kids; if he has noticed the chaotic weather or the news; if he watched The Age of Stupid. I resist asking.

I am left saddened and wondering, do we have any chance at all.

Ground roundup of new eReaders at CES on CNN

Las Vegas, Nevada (CNN) — The first generation of electronic readers had little more than black-and-white text. The second generation had black-and-white text, simple graphics and Web connectivity.

Glimpses of the third generation are on display this week at the International Consumer Electronics Show, where manufacturers are previewing e-readers with color screens, interactive graphics and magazine-style layouts.

from Bold new e-readers grab attention at CES – CNN.com.