RDF, Big Data and The Semantic Web

I’ve been meaning to write this post for a little while, but things have been busy. So, with this afternoon free I figured I’d write it now.

I’ve spent the last 7 years working intensively with data. Mostly not with RDBMSs, but with different Big Data and Linked Data tools. Over the past year things have changed enormously.

The Semantic Web

A lot has been talked about the Semantic Web for a long time now. In fact, I often advise people to search for Linked Data rather than Semantic Web as the usefulness of the results in a practical context is vast. The semantic web has been a rather unfortunately academic endeavour that has been very hard for many developers to get into. In contrast, Linked Data has seen explosive growth over the past five years. It hasn’t gone mainstream though.

What does show signs of going mainstream is the schema.org initiative. This creates a positive feedback loop between sites putting structured data into their pages and search engines giving those sites more and better leads as a result.

Much has been said about Microdata killing RDF blah, blah but that’s not important. What is important is that publishing machine-understandable data on the web is going mainstream.

As an aside, as Microdata extends to solve the problems it currently has (global identifiers and meaningful links) it becomes just another way to write down the RDF model anyway. RDF is an abstract model, not a data format, and at the moment Microdata is a simplified subset of that model.

Big Data and NoSQL

In the meantime another data meme has also grown enormously. In fact, it has dwarfed Linked Data in the attention it has captured. That trend is Big Data and NoSQL.

In Planning for Big Data, Edd talks about the three Vs:

To clarify matters, the three Vs of volume, velocity and variety are commonly used to characterize different aspects of big data. They’re a helpful lens through which to view and understand the nature of the data and the software plat- forms available to exploit them. Most probably you will contend with each of the Vs to one degree or another.

Most Big Data projects are really focussed on volume. They have large quantities, terabytes or petabytes, of uniform data. Often this data is very simple in structure, such as tweets. Fewer projects are focussed on velocity, being able to handle data coming in quickly and even fewer on variety, having unknown or widely varied data.

You can see how the Hadoop toolset is tuned to this and also how the NoSQL communities focus mostly on denormalisation of data. This is a good way to focus resources if you have large volumes of relatively simple, highly uniform data and a specific use-case or queries.

Apart from Neo4J, which is the odd-one-out in the Big Data community this is the approach.

RDF

So, while we wait for the semantic web to evolve, what is RDF good for today?

That third V of the Big Data puzzle is where I’ve been helping people use graphs of data (and that’s what RDF is, a graph model). Graphs are great where you have a variety of data that you want to link up. Especially if you want to extend the data often and if you want to extend the data programmatically — i.e. you don’t want to commit to a complete, constraining schema up-front.

The other aspect of that variety in data that graphs help with is querying. As Jem Rayfield (BBC News & Sport) explains, using a graph makes the model simpler to develop and query.

Graph data models can reach higher levels of variety in the data before they become unwieldy. This allows more data to be mixed and queried together. Mixing in more data adds more context and more context adds allows for more insight. Insight is what we’re ultimately trying to get at with any data analysis. That’s why the intelligence communities have been using graphs for many years now.

What we’re seeing now, with the combination of Big Data and graph technologies, is the ability to add value inside the enterprise. Graphs are useful for data analysis even if you don’t intend to publish the data on the semantic web. Maybe even especially then.

Microsoft, Oracle and IBM are all playing in the Big Data space and have been for some time. What’s less well-known and less ready for mainstream is that they all have projects in the graph database space: DB2 NoSQL Graph StoreOracle Database Semantic TechnologiesConnected Services Framework 3.0.

Behind the scenes, in the enterprise, is probably the biggest space where graphs will be adding real value over the next few years; solving the variety problem in Big Data.

What people find hard about Linked Data

This post originally appeared on Talis Consulting Blog.

Following on from the post I put up last talking about Linked Data training, I got asked what people find hard when learning about Linked Data for the first time. Delivering our training has given us a unique insight into that, across different roles, backgrounds and organisations — in several countries. We’ve taught hundreds of people in all.

It’s definitely true that people find Linked Data hard, but the learning curve is not really steep compared with other technologies. The main problem is there are a few steps along the way, certain things you have to grasp to be successful with this stuff.

I’ve broken those down into conceptual difficulties, the way we think, and practical problems. These are our perception, there are tasks in the course that are the specific what that people find difficult but I’m trying to surmise something beyond that and describe the why of these difficulties and how we might address them.

The main steps we find people have to climb (in no particular order) are Graph Thinking, URI/URL distinction, Open World Assumption, HTTP 303s, and Syntax…

Conceptual

Graph Thinking

The biggest conceptual problem learners seem to have is with what we call graph thinking. What I mean by graph thinking is the ability to think about data as a graph, a web, a network. We talk about it in the training material in terms of graphs, and start by explaining what a graph is (and that it’s not a chart!).

Non-programmers seem to struggle with this, not with understanding the concept, but with putting themselves above the data. It seems to me that most non-programmers we train find it very easy to think about the data from one point of view or another, but find it hard to think about the data in less specific use-cases.

Take the idea of a simple social network — friend-to-friend connections. Everyone can understand the list of someone’s friends, and on from there to friends-of-friends. The step-up seems to be in understanding the network as a whole, the graph. Thinking about the social graph, that your friends have friends and that your friends’ friends may also be your friends and it all forms an intertwined web, seems to be the thing to grasp. If you’re reading this, you may well be wondering what’s hard about that, but I can tell you that trying to think about Linked Data, this is a step up people have to take.

There’s no reason anyone should find this easy, in everyday life we’re always looking at information in a particular context, for a specific purpose and from an individual point-of-view.

For developers it can be even harder. Having worked with tables in the RDBMS for so long, many developers have adopted tables as their way of thinking about the problem. Even for those fluent in object-oriented design (a graph model) the practical implications of working with a graph of objects leads us to develop, predominantly, trees.

Don’t get me wrong, people understand the concept, however, even after experience we all seem to struggle to extract ourselves from our own specific view when modelling the data.

What can we do?

This will take time to change. As we see more and more data consumed in innovative ways we will start to grasp the importance of graph thinking and modelling outside of a single use-case. We can help this by really focussing on explaining the benefits of a graph model over trees and tables.

I hope we’ll see colleges and universities start to teach graph models more fully, putting less focus on the tables of the RDBMS and the trees of XML.

Examples like BBC Wildlife Finder, and other Linked Data sites, show the potential of graph thinking and the way it changes user experience.

For developers, tools such as the RDF mapping tools in Drupal 7 and emerging object/RDF persistence layers will help hugely.

Using URIs to name real things

In Linked Data we use URIs to name things, not just address documents, but as names to identify things that aren’t on the web, like people, places, concepts. When coming across Linked Data, knowing how to do this is another step people have to climb.

First they have to recognise that they need different URIs for the document and the thing the document describes. It’s a leap to understand:

  • that they can just make these up
  • that no meaning should be inferred from the words in it (and yet best practice is to make the readable)
  • that they can say things about other peoples’ URIs (though those statements won’t be de-referencable)
  • that they can choose their own URIs and URI patterns to work to

The information/non-information resource distinction forms part of this difficulty too. While for naive cases this is easy to understand, how a non-information resource gets de-referenced and you get back a description of it is difficult. The use of 303 redirects doesn’t help, and I’ll talk about that a little later in practical issues.

What can we do?

There are already resources discussing URI patterns and the trade-offs that we can point people to. These will help. What I find helps a lot is simply pointing out that they own their URIs, and that they should reclaim them from .Net or Java or PHP or whatever technology has subverted them. More on that below in supporting custom URIs.

As a community we could focus more on our own URIs, talking more about why we made the decisions we did; why natural keys, why GUIDs, why readable, why opaque?

Non-Constraining Nature (Open World Assumption)

Linked Data follows the open-world assumption — that something you don’t know may be said elsewhere. This is a sea-change for all developers and for most people working with data.

For developers, data storage os very often tied up with data validation. We use schema-validating parsers for XML and we put integrity constraints into our RDBMS schema. We do this with the intention of making our lives easier in the application code, protecting ourselves from invalid data. Within the single context of an application this makes sense, but on the open web, remixing data from different sources, expecting some data to be missing, wanting to use that same data in many different and unexpected ways this doesn’t make sense.

For non-developers often they are used to business rules, another way of describing constraints on what data is acceptable. Also common is that they have particular uses of the data in mind, and want to constrain for those uses — possibly preventing other uses.

What can we do?

Tooling and application development patterns will help here, moving constraints out of storage and into the application’s context. Jena Eyeball is one option here and we need others. We need to support developers better in finding, constraining, validating data that they can consume in their applications. Again, this will come with time.

We could also look for case-studies, where the relaxing of constraints in storage can allow different (possibly conflicting) applications to share data, removing duplication. This would be a good way to show how data independent of context has significant benefit.

Practical

HTTP, 303s and Supporting Custom URIs

Certainly for most data owners, curators, admins this stuff is an entirely different world; and a world one could argue they shouldn’t need to know about. With Linked Data, URI design comes into the domain of the data manager where historically it’s always been the domain of the web developer.

Even putting that aside, development tools and default server configurations mean that many of the web developers out there have a hard time with this stuff. The default for almost all server-side web languages routes requests to code using the filename in the URI — index.php, renderItem.aspx and so on. And when do we need to work with response codes? Most web devs today will have had no reason to experience more than 200, 404 and 302 — some will understand 401 if they’ve done some work with logins, but even then most of the framework will hide that for you.

So, the need to route requests to code using a mechanism other than filename in URL is something that, while simple, most people haven’t done before. Add into that the need to handle non-information resources, issue raw 303s and then handle the request for a very similar document URL and you have a bit of stuff that is out of the norm — and that looks complicated.

What can we do?

Working with different frameworks and technologies to make custom URLs the norm and filename based routing frowned-upon wouyld be good. This doesn’t need to be a Linked Data specific thing either, the notion of Cool URIs would also benefit.

We could help different tools build in support for 303s as well, or we could look to drop the need for 303s (which would be my preference). Either way, they need to get easier.

Syntax

This is a tricky one. I nearly put this into the conceptual issues as part of the learning curve is grasping that RDF has multiple syntaxes and that they are equal. However, most people get that quite quickly; even if they do have problems with the implications of that.

Practically, though, people have quite a step with our two most prominent syntaxes — RDF/XML and Turtle. The specifics are slightly different for each, but the essence is common; identifying the statements.

Turtle is far easier to work with than RDF/XML in this regard, but even Turtle, when you apply all the semicolons and commas to arrive at a concise fragment, is still a step. The statements don’t really stand out.

What can we do?

There are already lots of validators around, and they help a lot. What would really help during the learning stages would be a simple data explorer that could be used locally to load, visualise and navigate a dataset. I don’t know of one yet — you?

Summary

None of the steps above are actually hard; taken individually they are all easy to understand and work through — especially with the help of someone who already knows what they’re doing. But, taken together, they add up to a perception that Linked Data is complex, esoteric and different to simply building a website and it is that (false) perception that we need to do more to address.

Distributed, Linked Data has significant implications for Intellectual Property Rights in Data.

What P2P networks have done for distribution of digital media is phenomenal. It is possible, easy even, to get almost any TV show, movie, track or album you can think of by searching one of the many torrent sites. As fast as the media industry take down one site through legal action another has appeared to take its place.

I don’t want to discuss the legal, moral or social implications of this, but discuss how the internet changes the nature of our relationship with media – and data. The internet is a great big copying machine, true enough, but it’s also a fabric that allows mass co-operation. It’s that mass peer-to-peer co-operation that makes so much content available for free; content that is published freely by its creator as well as infringing content.

Sharing of copyrighted content is always likely to be infringing on p2p networks, regardless of any tricks employed, but for data the situation may be different and the Linked Data web has real implications in this space.

Taking the Royal Mail’s Postcode Address File as my working example, because it’s been in the news recently as a result of the work done by ErnestMarples.com, I’ll attempt to show how the Linked Data web changes the nature of data publishing and intellectual property.

First, in case you’re not familiar, a quick introduction to Linked Data. In Linked Data we use http web addresses (which we call URIs) not only to refer to documents containing data but also to refer to real-world things and abstract concepts. We then combine those URIs with properties and values to make statements about the things the URIs represent. So, I might say that my postcode is referred to by the URI http://someorg.example.com/myaddress/postcode. Asking for that URI in the browser would then redirect you to a document containing data about my postcode, maybe sending you to http://someorg.example.com/myaddress/postcode.rdf if you asked for data and http://someorg.example.com/myaddress/postcode.html if you asked for a web page (that’s called content negotiation). All of that works today and organisations like the UK Government, BBC, New York Times and others are publishing data this way.

Also worth noting is the distinction between Linked Data (the technique described above) and Linked Open Data, the output of the W3C’s Linking Open Data project. An important distinction as I’m talking about how commercially owned and protected databases may be disrupted by Linked Data, whereas Linked Open Data is data that is already published on the web under an Open license.

Now, Royal Mail own the Postcode Address File, and other postcode data such as geo co-ordinates. They are covered in the UK under Copyright and Database Right (which for which bits is a different story) so we assume it is “owned”. The database contains more than 28 million postcodes, so publishing my own postcode could not be considered an infringement in any meaningful way, publishing the data for all the addresses within a single postcode would also be unlikely to infringe as it’s such a small fraction of the total data.

So I might publish some data like this (the format is Turtle, a way to write down Linked Data)

<http://someorg.example.com/myaddress/postcode>
  a paf:Postcode;
  paf:correctForm "B37 7YB";
  paf:normalisedForm "b377yb";
  geo:long -1.717336;
  geo:lat 52.467971;
  paf:ordnanceSurveyCode "SP1930085600";
  paf:gridRefEast 41930;
  paf:gridRefNorth 28560;
  paf:containsAddress <http://someorg.example.com/myaddress/postcode#1>;
  paf:googleMaps <http://maps.google.co.uk/maps?hl=en&source=hp&q=B377YB&ie=UTF8&hq=&hnear=Birmingham,+West+Midlands+B377YB,+United+Kingdom&gl=uk&ei=Zs8HS_KVNNOe4QbIpITTCw&ved=0CAgQ8gEwAA&t=h&z=16>.

<http://someorg.example.com/myaddress/postcode#1>
  a paf:Address;
  paf:organisationName "Talis Group Ltd";
  paf:dependentThoroughfareName "Knight's Court";
  paf:thoroughfareName "Solihull Parkway";
  paf:dependentLocality "Birmingham Business Park";
  paf:postTown "Birmingham";
  paf:postcode <http://someorg.example.com/myaddress/postcode>.

I’ve probably made some mistakes in terms of the PAF properties as it’s a long time since I worked with PAF, but it’s clear enough to make my point with. So, I publish this file on my own website as a way of describing the office where I work. That’s not an infringement of any rights in the data and perfectly legitimate thing to do with the address.

As the web of Linked Data takes off, and the same schema become commonly used for this kind of thing, we start to build a substantial copy of the original database. This time, however, the database is not on a single server as ErnestMarples.com was, but spread across the web of Linked Data. There is no single infringing organisation who can be made to take the data down again. If I were responsible for the revenue brought in from sales of PAF licenses this would be a concern, but not major as the distributed nature means it can’t be queried.

The distributed nature of the web means the web itself can’t be queried, but we already know how to address that technically – we built large aggregations of the web, index them and call them search engines. That is also already happening for the Linked Data web. As with the web of documents, some people are attempting to create general purpose search engines over the data and others specialised search engines for specific areas of interest. It’s easy to see that areas of value, such as address data, are likely to attract specialist attention.

Here though, while the individual documents do not infringe, an aggregate of many of them would start to infringe. The defence of crowd-sourcing used in other contexts (such as Open Street Map) does not apply here as the PAF is not factual data – the connection between a postcode and address can only have come from one place, PAF, and is owned by Royal Mail however it got into the database.

So, with the aggregate now infringing it can be taken down through request, negotiation or due process. The obvious answer to that might be for the aggregate to hold the URIs only, not the actual values of the data. This would leave it without a useful search mechanism, however. This could be addressed by having a well-known URI structure as I used in the example data. We might have

<http://addresssearch.example.net/postcodes/B37_7YB>
  owl:sameAs <http://someorg.example.com/myaddress/postcode>

This gets around the data issue, but the full list of postcodes itself may be considered infringing and they are clearly visible in the URIs. Taking them out of the URIs would leave no mechanism to go from a known postcode to data about it and addresses within it, the main use case for PAF. It doesn’t take long to make the link with other existing technology though, an area where we want to match a short string we know with an original short string, but cannot make the original short string available in clear text… Passwords.

Password storage uses one-way hashes so that the password is not available in its original form once stored, but requests to login can be matched by putting the attempted password through the same hash. Hashes are commonplace in the P2P world for a variety of functions, so are well-known and could be applied by an aggregator, or co-operative group, to solve this problem.

If I push the correct form of “B37 7YB” through MD5, I get “bdd2a7bf68119d001ebfd7f86c13a4c7”, but there is no way to get back from that to the original postcode. So a service that uses hashed URIs would not be publishing the postcode list in a directly useable form, but could be searched easily by anyone knowing how the URIs were structured and hashed.

<http://addresssearch.example.net/postcodes/bdd2a7bf68119d001ebfd7f86c13a4c7>
  owl:sameAs <http://someorg.example.com/myaddress/postcode>

Of course, a specialist address service, advertising address lookups and making money could still be considered as infringing by the courts regardless of the technical mechanisms employed, but what of more general aggregations or informal co-operative sites? sameAs, a site for sharing sameAs statements, already provides the infrastructure that would be needed for a service like this and the ease with which sites that do this can be setup and mirrored would make it hard to defend against using the law in the same way that torrent listing sites are difficult for the film and music industries to stop. Regardless of the technical approach and the degree to which that provide legal and practical defence, this is still publishing data in a way that is against the spirit of Copyright and Database Right.

The situation I describe above is one where many, many organisations and individuals are publishing data in a consistent form and that is likely to happen over the next few years for common data like business addresses and phone numbers, but much less likely for less mainstream data. The situation with addresses is one where it is clear there is a reason to publish your individual data other than to be part of the whole, in more contrived cases where the only reason to publish is to contribute to a larger aggregate the notion of fair-use for a small amount of the data may not stand up. That is, over the longer term, address data will not be crowd-sourced – people deliberately creating a dataset – but web-sourced – the data will be on the web anyway.

We can see from this small example that the kinds of data that may be vulnerable to distributed publishing in this way are wide-ranging. The Dewey Decimal Classification scheme used by libraries, Telephone directories (with lookups in both directions), Gracenote’s music database, Legal case numbering schemes, could all be published this way. The problem, of course, is that the data has to be distributed sufficiently that no individual host can be seen as infringing. For common data this will happen naturally, but the co-ordination overhead for a group trying to do this pro-actively would be significant; though that might be solved easily by someone else thinking about how to do this.

As I see it a small group of unfunded individuals would have difficulty achieving the level of distribution necessary to be defensible. Though could 1% of a database be considered infringing? Could/Would 100 people use their existing domains to publish 1% of the PAF each? Would 200 join in for ½% each? Then, of course, there are the usual problems of trust, timeliness and accuracy associated with non-authoritative publication.

These problems not withstanding, Linked Data has the potential to provide a global database at web-scale. Ways of querying that web of data will be invented, what I sketch out above is just one very basic approach. The change the web of data brings has huge implications for data owners and intellectual property rights in data.

Why hash tags are broken, and ideas for what to do instead.

I was at Moseley Bar Camp last Sunday and there were some great sessions. Andy Mabbett stood up to lead a discussion entitled Let’s Play Tag: recent developments and emerging issues in the use of tagging for added semantic richness.

Andy was looking for discussion on how to solve the problem of ambiguity in hash tags – a popular technique for categorising community tweets on twitter. His example is classic event tagging, the tag for the event was #mbcamp which works fine for the duration of a Sunday afternoon event, but what if you want tags to be more enduring?

Andy took us step-by-step through the issue of ambiguity of usernames as tags on twitter and flickr and described some of the issues of differing tag normalisation rules.

Andy also asked why we tag?

  • To add semantic richness?
  • To help your friends find stuff?
  • To help machines in 100 years find stuff
  • Don’t know

I tag for all of those reasons, but not on twitter. On twitter I use hash tags to contribute to an in the moment conversation that’s happening at a particular event or on something topical.

Andy’s issue, then, is with the value of these tags longer term and on more enduring stuff like blog posts, photos on flickr and so on. Perhaps 100 years might be pushing it, but it’s worth thinking about.

The problem with hash tags comes from the tension between finding something specific enough for the moment, something short enough to not use up too many of the 140 characters and something easy to remember. That’s two forces pulling one-way (shorter) and only one pulling the other.

The shorter the tag goes the easier it is to remember and to type, and the fewer character it uses up, but it also becomes more likely to clash with others. Perhaps some mainstream trends might get away with very short tags, I thought. #fb for example means facebook, surely, but looking at the use of it apparently the references to facebook are far outweighed by the noise.

So, twitter’s 140 character limit and the profusion of clients means we can only have short, easy to remember text tags, but the need for disambiguation and to be more specific means we need something longer.

We could solve the ambiguity problem by using something like a guid, but that’s not easy to remember or type, and is generally quite long. The length issue could be solved by encoding it using unicode characters. Twitter counts multi-byte UTF8 characters as single characters, which is correct, and this opens up some interesting unique tags for those willing to forego the easy typing.

By long I mean cf629dc3-d425-4707-8119-1f35d35d7687 which is a fairly typical GUID and is 36 character long. That’s too long if you only have 140 characters to play with. The length comes from the need to encode it as ASCII. Twitter, where our length obsession comes from, doesn’t require characters to be ASCII. The 140 character limit is for 140 UTF8 characters, so we can use a much greater range of characters to represent the same degree of uniqueness in a shorter UTF8 string.

UTF8 isn’t ideal as a starting point, though, as the number of bytes per character varies. The unicode definition uses nice simple 2 byte indexes, so we match 4 ASCII characters from the GUID to a unicode character, then use the UTF8 encoding for those to write it down. By using unicode and UTF8 it becomes just a handful of characters, just 8 for this GUID.

cf62 콢, 9dc3 鷃, d425 퐥, 4707 䜇, 8119 脙, 1f35 ἵ, d35d 퍝, 7687 皇

This gives us a tag of #콢鷃퐥䜇脙ἵ퍝皇 which is not easy to type, would be difficult for many to visually identify and could, for all I know, be extremely offensive to those who read CJK, Hangul or Greek. I may have got lucky with that GUID too, there may be GUIDs that don’t produce valid unicode pairs.

But, as it’s a GUID it gives a very high confidence that is unique, it’s only 8 characters long and works as a unicode tag on Flickr and a unicode tag on Hashtags. Just don’t look at the raw URLs in the source of the page…

What we lose with that approach is a good deal of ease-of-use. I certainly wouldn’t try this technique at an event.

If you’re prepared to lose a little usability, maybe giving people an easy place to grab a copy/paste version of the tag then you could produce something more easily readable, if not easy to type: #dɯɐɔqɯ for example. I might be tempted to do that, or add a graphic symbol or something.

There’s something else that nags at me about hash tags, though. They’re really not very webby. You rely on search and on hashtags.org and other specific tools to make sense of them. They can be easily abused, as Habitat showed recently.

So are there other ways to think about tagging? Ways that work with the web rather than just on the web. Examples from those applications where the 140 character limit does not apply? Blog posts, web pages, flickr images and so on?

What if we decided that our requirements for tagging were:

  1. A very high degree of uniqueness
  2. Anyone can get information about the tag easily
  3. Spam and content visible on the tag controlled by the tag owner
  4. That the tag can be enduring
  5. That the tag can be used anywhere on the web easily
  6. That content using the tag can be found with search
  7. That content using the tag can be found without search
  8. That no particular service or piece of software is necessary

In it’s essence, tagging is about saying this comment, blog post or image is about this event, concept, product etc. In the blogging world it’s very common to say this post is about the content in this other post. We do that through trackbacks and through simple links. Many blogs accept trackbacks and look at the referring page information so that they can provide links, alongside comments, to other posts referring to them.

A similar things happens with Google’s PageRank algorithm. Words used in links to a page, as well as the content of the referring page, contribute to the way a page is indexed.

The Semantic Web bases everything on URIs (the difference between URI and URL is not important here). If you want to give something a name you don’t pick a word, you use a URI.

I wonder if we could use URIs as tags? And how that would meet the needs above. Say we were to use http://wxwm.org.uk/moseleybarcamp/2009/June to mean the event that happened last weekend.

It has a very high degree of uniqueness, so it meets our first requirement. It can be put straight into a browser and can provide a page giving details of the event, so it’s easy for anyone to get information about the tag. The page at that address can be as clever, or as dumb, as it likes about showing things that link to it – so tag spam can be removed. The link is under control of the domain owner, so can be as enduring as you want to make it. Almost everywhere on the web allows you to post links, so it’s easy to use. Links to a specific URL can be easily searched for in Google and other search engines, and in Flickr and Twitter. Most browsers will send referring page information when requesting the URL, so content can be tracked without search – this means you can find out about unindexed and intranet sites referencing the tag. The URL can be a static page, or a script, it can monitor referrers and spam filter – or not. There is not centralised service needed nor any specific software.

Oh, and it could easily be made to work as Linked Data, the pattern for publishing data on the semantic web, to provide machine-readable information about the event and the conversation happening around it…

I think that only leaves the issue of URI length. I can’t get close to the 8 characters of the guid, or the 6 of mbcamp, but using bit.ly I can make a memorable short URL such as http://bit.ly/utf8tag that redirects to a much longer one, and as bit.ly don’t re-use URLs the bit.ly link remains as unique and almost as enduring (subject to bit.ly’s survival) as your own.

Scripting and Development for the Semantic Web (SFSW2009)

The following papers have been accepted for SFSW2009:

from Scripting and Development for the Semantic Web (SFSW2009).

Looks like a great line-up. As neither Nad, Jeni nor I are able to attend our paper will be presented (briefly) by Chris Clarke.

Panlibus » Blog Archive » Library of Congress launch Linked Data Subject Headings

Agree with this summary from Richard

On the surface, to those not yet bought in to the potential of Linked Data, and especially Linked Open Data, this may seem like an interesting but not necessarily massive leap forward. I believe that what underpins the fairly simple functional user interface they provide will gradually become core to bibliographic data becoming a first-class citizen in the web of data.

Overnight this uri ‘http://id.loc.gov/authorities/sh85042531’ has now become the globally available, machine and human readable, reliable source for the description for the subject heading of ‘Elephants’ containing links to its related terms (in a way that both machines and humans can navigate). This means that system developers and integrators can rely upon that link to represent a concept, not necessarily the way they want to [locally] describe it. This should facilitate the ability for disparate systems and services to simply share concepts and therefore understanding – one of the basic principles behind the Semantic Web.

from Panlibus » Blog Archive » Library of Congress launch Linked Data Subject Headings.

Great to see LoC doing this stuff and getting it out there.

Domain Specific Editing Interface using RDFa and jQuery

I wrote back in January about Resource Lists, Semantic Web, RDFa and Editing Stuff. This was based on work we’d done in Talis Aspire.

Several people suggested this should be written up as a fuller paper, so Nad, Jeni and I wrote it up as a paper for the SFSW 2009 workshop. It’s been accepted and will be published there, but unfortunately due to work priorities that have come up we won’t be able to attend.

A draft of the paper is here: A Pattern for Domain Specific Editing Interfaces Using Embedded RDFa and HTML Manipulation Tools.

The camera ready copy will be published in the conference proceedings. Feedback welcomed.

Coghead closes for business

With the announcement that Coghead, a really very smart app development platform, is closing its doors it’s worth thinking about how you can protect yourself from the inevitable disappearance of a service.

Of course, there are all the obvious business type due diligence activities like ensuring that the company has sufficient funds, understanding how
your subscription covers the cost (or doesn’t) of what you’re using and so on, but all these can do is make you feel more comfortable – they can’t provide real protection. To be protected you need 4 key things – if you have these 4 things you can, if necessary, move to hosting it yourself.

  1. URLs within your own domain.
  2. Both you and your customers will bookmark parts of the app, email links, embed links in documents, build excel spreadsheets that download the data and so on and so on. You need to control the DNS for the host that is running your tenancy in the SaaS service. Without this you have no way to redirect your customers if you need to run the software somewhere else.

    This is, really, the most important thing. You can re-create the data and the content, you can even re-write the application if you have to, but if you lose all the links then you will simple disappear.

  3. Regular exports of your data.
  4. You may not get much notice of changes in a SaaS service. When you find they are having outages, going bust or simply disappear is not the time to work out how to get your data back out. Automate a regular export of your data so you know you can’t lose too much. Coghead allowed for that and are giving people time to get their data out.

  5. Regular exports of your application.
  6. Having invested a lot in working out the write processes, rules and flows to make best use of your app you want to be able to export that too. This needs to be exportable in a form that can be re-imported somewhere else. Coghead hasn’t allowed for this, meaning that Coghead customers will have to re-write their apps based on a human reading of the Coghead definitions. Which brings me on to my next point…

  7. The code.
  8. You want to be able to take the exact same code that was running SaaS and install it on your own servers, install the exported code and data and update your DNS. Without the code you simply can’t do that. Making the code open-source may be a problem as others could establish equivalent services very quickly, but the software industry has had ways to deal with this problem through escrow and licensing for several decades. The code in escrow would be my absolute minimum.

SaaS and PaaS (Platform as a Service) providers promote a business model based on economies of scale, lower cost of ownership, improved availability, support and community. These things are all true even if they meet the four needs above – but the priorities for these needs are with the customer, not with the provider. That’s because meeting these four needs makes the development of a SaaS product harder and it also makes it harder for any individual customer to get setup. We certainly don’t meet all four with our SaaS and PaaS offerings at work yet, but I am confident that we’ll get there – and we’re not closing our doors any time soon 😉

Ruby Mock Web Server

I spent the afternoon today working with Sarndeep, our very smart automated test guy. He’s been working on extending what we can do with rspec to cover testing of some more interesting things.

Last week he and Elliot put together a great set of tests using MailTrap to confirm that we’re sending the right mails to the right addresses under the right conditions. Nice tests to have for a web app that generates email in a few cases.

This afternoon we were working on a mock web server. We use a lot of RESTful services in what we’re doing and being able to test our app for its handling of error conditions is important. We’ve had a static web server set up for a while, this has particular requests and responses configured in it, but we’ve not really liked it because the responses are all separate from the tests and the server is another apache vhost that has to be setup when you first checkout the app.

So, we’d decided a while ago that we wanted to put in a little Ruby based web server that we could control from within the rspec tests and that’s what we built a first cut of this afternoon.

require File.expand_path(File.dirname(__FILE__) + "/../Helper")
require 'rubygems'
require 'rack'
require 'thin'
class MockServer
  def initialize()
    @expectations = []
  end
  def register(env, response)
    @expectations << [env, response]
  end
  def clear()
    @expectations = []
  end
  def call(env)
    #puts "starting call\n"
    @expectations.each_with_index do |expectation,index|
      expectationEnv = expectation[0]
      response = expectation[1]
      matched = false
      #puts "index #{index} is #{expectationEnv} contains #{response}\n\n"
      expectationEnv.each do |envKey, value|
        puts "trying to match #{envKey}, #{value}\n"
        matched = true
        if value != env[envKey]
          matched = false
          break
        end
      end
      if matched
        @expectations.delete_at(index)
        return response
      end
    end
    #puts "ending call\n"
  end
end
mockServer = MockServer.new()
mockServer.register( { 'REQUEST_METHOD' => 'GET' }, [ 200, { 'Content-Type' => 'text/plain', 'Content-Length' => '11' }, [ 'Hello World' ]])
mockServer.register( { 'REQUEST_METHOD' => 'GET' }, [ 200, { 'Content-Type' => 'text/plain', 'Content-Length' => '11' }, [ 'Hello Again' ]])
Rack::Handler::Thin.run(mockServer, :Port => 4000)

The MockServer implements the Rack interface so it can work within the Thin web server from inside the rspec tests. The expectations are registered with the MockServer and the first parameter is simply a hashtable in the same format as the Rack Environment. You only specify the entries that you care about, any that you don’t specify are not compared with the request. Expectations don’t have to occur in order (expect where the environment you give is ambiguous, in which case they match first in first matched).

As a first venture into writing more in Ruby than an rspec test I have to say I found it pretty sweet – There was only one issue with getting at array indices that tripped me up, but Ross helped me out with that and it was pretty quickly sorted.

Plans for this include putting in a verify() and making it thread safe so that multiple requests can come in parallel. Any other suggestions (including improvements on my non-idiomatic code) very gratefully received.