Internet Technical
Distributed, Linked Data has significant implications for Intellectual Property Rights in Data.
What P2P networks have done for distribution of digital media is phenomenal. It is possible, easy even, to get almost any TV show, movie, track or album you can think of by searching one of the many torrent sites. As fast as the media industry take down one site through legal action another has appeared to take its place.
I don’t want to discuss the legal, moral or social implications of this, but discuss how the internet changes the nature of our relationship with media – and data. The internet is a great big copying machine, true enough, but it’s also a fabric that allows mass co-operation. It’s that mass peer-to-peer co-operation that makes so much content available for free; content that is published freely by its creator as well as infringing content.
Sharing of copyrighted content is always likely to be infringing on p2p networks, regardless of any tricks employed, but for data the situation may be different and the Linked Data web has real implications in this space.
Taking the Royal Mail’s Postcode Address File as my working example, because it’s been in the news recently as a result of the work done by ErnestMarples.com, I’ll attempt to show how the Linked Data web changes the nature of data publishing and intellectual property.
First, in case you’re not familiar, a quick introduction to Linked Data. In Linked Data we use http web addresses (which we call URIs) not only to refer to documents containing data but also to refer to real-world things and abstract concepts. We then combine those URIs with properties and values to make statements about the things the URIs represent. So, I might say that my postcode is referred to by the URI http://someorg.example.com/myaddress/postcode. Asking for that URI in the browser would then redirect you to a document containing data about my postcode, maybe sending you to http://someorg.example.com/myaddress/postcode.rdf if you asked for data and http://someorg.example.com/myaddress/postcode.html if you asked for a web page (that’s called content negotiation). All of that works today and organisations like the UK Government, BBC, New York Times and others are publishing data this way.
Also worth noting is the distinction between Linked Data (the technique described above) and Linked Open Data, the output of the W3C’s Linking Open Data project. An important distinction as I’m talking about how commercially owned and protected databases may be disrupted by Linked Data, whereas Linked Open Data is data that is already published on the web under an Open license.
Now, Royal Mail own the Postcode Address File, and other postcode data such as geo co-ordinates. They are covered in the UK under Copyright and Database Right (which for which bits is a different story) so we assume it is “owned”. The database contains more than 28 million postcodes, so publishing my own postcode could not be considered an infringement in any meaningful way, publishing the data for all the addresses within a single postcode would also be unlikely to infringe as it’s such a small fraction of the total data.
So I might publish some data like this (the format is Turtle, a way to write down Linked Data)
<http://someorg.example.com/myaddress/postcode> a paf:Postcode; paf:correctForm "B37 7YB"; paf:normalisedForm "b377yb"; geo:long -1.717336; geo:lat 52.467971; paf:ordnanceSurveyCode "SP1930085600"; paf:gridRefEast 41930; paf:gridRefNorth 28560; paf:containsAddress <http://someorg.example.com/myaddress/postcode#1>; paf:googleMaps <http://maps.google.co.uk/maps?hl=en&source=hp&q=B377YB&ie=UTF8&hq=&hnear=Birmingham,+West+Midlands+B377YB,+United+Kingdom&gl=uk&ei=Zs8HS_KVNNOe4QbIpITTCw&ved=0CAgQ8gEwAA&t=h&z=16>. <http://someorg.example.com/myaddress/postcode#1> a paf:Address; paf:organisationName "Talis Group Ltd"; paf:dependentThoroughfareName "Knight's Court"; paf:thoroughfareName "Solihull Parkway"; paf:dependentLocality "Birmingham Business Park"; paf:postTown "Birmingham"; paf:postcode <http://someorg.example.com/myaddress/postcode>.
I’ve probably made some mistakes in terms of the PAF properties as it’s a long time since I worked with PAF, but it’s clear enough to make my point with. So, I publish this file on my own website as a way of describing the office where I work. That’s not an infringement of any rights in the data and perfectly legitimate thing to do with the address.
As the web of Linked Data takes off, and the same schema become commonly used for this kind of thing, we start to build a substantial copy of the original database. This time, however, the database is not on a single server as ErnestMarples.com was, but spread across the web of Linked Data. There is no single infringing organisation who can be made to take the data down again. If I were responsible for the revenue brought in from sales of PAF licenses this would be a concern, but not major as the distributed nature means it can’t be queried.
The distributed nature of the web means the web itself can’t be queried, but we already know how to address that technically – we built large aggregations of the web, index them and call them search engines. That is also already happening for the Linked Data web. As with the web of documents, some people are attempting to create general purpose search engines over the data and others specialised search engines for specific areas of interest. It’s easy to see that areas of value, such as address data, are likely to attract specialist attention.
Here though, while the individual documents do not infringe, an aggregate of many of them would start to infringe. The defence of crowd-sourcing used in other contexts (such as Open Street Map) does not apply here as the PAF is not factual data – the connection between a postcode and address can only have come from one place, PAF, and is owned by Royal Mail however it got into the database.
So, with the aggregate now infringing it can be taken down through request, negotiation or due process. The obvious answer to that might be for the aggregate to hold the URIs only, not the actual values of the data. This would leave it without a useful search mechanism, however. This could be addressed by having a well-known URI structure as I used in the example data. We might have
<http://addresssearch.example.net/postcodes/B37_7YB> owl:sameAs <http://someorg.example.com/myaddress/postcode>
This gets around the data issue, but the full list of postcodes itself may be considered infringing and they are clearly visible in the URIs. Taking them out of the URIs would leave no mechanism to go from a known postcode to data about it and addresses within it, the main use case for PAF. It doesn’t take long to make the link with other existing technology though, an area where we want to match a short string we know with an original short string, but cannot make the original short string available in clear text… Passwords.
Password storage uses one-way hashes so that the password is not available in its original form once stored, but requests to login can be matched by putting the attempted password through the same hash. Hashes are commonplace in the P2P world for a variety of functions, so are well-known and could be applied by an aggregator, or co-operative group, to solve this problem.
If I push the correct form of “B37 7YB” through MD5, I get “bdd2a7bf68119d001ebfd7f86c13a4c7″, but there is no way to get back from that to the original postcode. So a service that uses hashed URIs would not be publishing the postcode list in a directly useable form, but could be searched easily by anyone knowing how the URIs were structured and hashed.
<http://addresssearch.example.net/postcodes/bdd2a7bf68119d001ebfd7f86c13a4c7> owl:sameAs <http://someorg.example.com/myaddress/postcode>
Of course, a specialist address service, advertising address lookups and making money could still be considered as infringing by the courts regardless of the technical mechanisms employed, but what of more general aggregations or informal co-operative sites? sameAs, a site for sharing sameAs statements, already provides the infrastructure that would be needed for a service like this and the ease with which sites that do this can be setup and mirrored would make it hard to defend against using the law in the same way that torrent listing sites are difficult for the film and music industries to stop. Regardless of the technical approach and the degree to which that provide legal and practical defence, this is still publishing data in a way that is against the spirit of Copyright and Database Right.
The situation I describe above is one where many, many organisations and individuals are publishing data in a consistent form and that is likely to happen over the next few years for common data like business addresses and phone numbers, but much less likely for less mainstream data. The situation with addresses is one where it is clear there is a reason to publish your individual data other than to be part of the whole, in more contrived cases where the only reason to publish is to contribute to a larger aggregate the notion of fair-use for a small amount of the data may not stand up. That is, over the longer term, address data will not be crowd-sourced – people deliberately creating a dataset – but web-sourced – the data will be on the web anyway.
We can see from this small example that the kinds of data that may be vulnerable to distributed publishing in this way are wide-ranging. The Dewey Decimal Classification scheme used by libraries, Telephone directories (with lookups in both directions), Gracenote’s music database, Legal case numbering schemes, could all be published this way. The problem, of course, is that the data has to be distributed sufficiently that no individual host can be seen as infringing. For common data this will happen naturally, but the co-ordination overhead for a group trying to do this pro-actively would be significant; though that might be solved easily by someone else thinking about how to do this.
As I see it a small group of unfunded individuals would have difficulty achieving the level of distribution necessary to be defensible. Though could 1% of a database be considered infringing? Could/Would 100 people use their existing domains to publish 1% of the PAF each? Would 200 join in for ½% each? Then, of course, there are the usual problems of trust, timeliness and accuracy associated with non-authoritative publication.
These problems not withstanding, Linked Data has the potential to provide a global database at web-scale. Ways of querying that web of data will be invented, what I sketch out above is just one very basic approach. The change the web of data brings has huge implications for data owners and intellectual property rights in data.
Why hash tags are broken, and ideas for what to do instead.
I was at Moseley Bar Camp last Sunday and there were some great sessions. Andy Mabbett stood up to lead a discussion entitled Let’s Play Tag: recent developments and emerging issues in the use of tagging for added semantic richness.
Andy was looking for discussion on how to solve the problem of ambiguity in hash tags – a popular technique for categorising community tweets on twitter. His example is classic event tagging, the tag for the event was #mbcamp which works fine for the duration of a Sunday afternoon event, but what if you want tags to be more enduring?
Andy took us step-by-step through the issue of ambiguity of usernames as tags on twitter and flickr and described some of the issues of differing tag normalisation rules.
Andy also asked why we tag?
- To add semantic richness?
- To help your friends find stuff?
- To help machines in 100 years find stuff
- Don’t know
I tag for all of those reasons, but not on twitter. On twitter I use hash tags to contribute to an in the moment conversation that’s happening at a particular event or on something topical.
Andy’s issue, then, is with the value of these tags longer term and on more enduring stuff like blog posts, photos on flickr and so on. Perhaps 100 years might be pushing it, but it’s worth thinking about.
The problem with hash tags comes from the tension between finding something specific enough for the moment, something short enough to not use up too many of the 140 characters and something easy to remember. That’s two forces pulling one-way (shorter) and only one pulling the other.
The shorter the tag goes the easier it is to remember and to type, and the fewer character it uses up, but it also becomes more likely to clash with others. Perhaps some mainstream trends might get away with very short tags, I thought. #fb for example means facebook, surely, but looking at the use of it apparently the references to facebook are far outweighed by the noise.
So, twitter’s 140 character limit and the profusion of clients means we can only have short, easy to remember text tags, but the need for disambiguation and to be more specific means we need something longer.
We could solve the ambiguity problem by using something like a guid, but that’s not easy to remember or type, and is generally quite long. The length issue could be solved by encoding it using unicode characters. Twitter counts multi-byte UTF8 characters as single characters, which is correct, and this opens up some interesting unique tags for those willing to forego the easy typing.
By long I mean cf629dc3-d425-4707-8119-1f35d35d7687 which is a fairly typical GUID and is 36 character long. That’s too long if you only have 140 characters to play with. The length comes from the need to encode it as ASCII. Twitter, where our length obsession comes from, doesn’t require characters to be ASCII. The 140 character limit is for 140 UTF8 characters, so we can use a much greater range of characters to represent the same degree of uniqueness in a shorter UTF8 string.
UTF8 isn’t ideal as a starting point, though, as the number of bytes per character varies. The unicode definition uses nice simple 2 byte indexes, so we match 4 ASCII characters from the GUID to a unicode character, then use the UTF8 encoding for those to write it down. By using unicode and UTF8 it becomes just a handful of characters, just 8 for this GUID.
cf62 콢, 9dc3 鷃, d425 퐥, 4707 䜇, 8119 脙, 1f35 ἵ, d35d 퍝, 7687 皇
This gives us a tag of #콢鷃퐥䜇脙ἵ퍝皇 which is not easy to type, would be difficult for many to visually identify and could, for all I know, be extremely offensive to those who read CJK, Hangul or Greek. I may have got lucky with that GUID too, there may be GUIDs that don’t produce valid unicode pairs.
But, as it’s a GUID it gives a very high confidence that is unique, it’s only 8 characters long and works as a unicode tag on Flickr and a unicode tag on Hashtags. Just don’t look at the raw URLs in the source of the page…
What we lose with that approach is a good deal of ease-of-use. I certainly wouldn’t try this technique at an event.
If you’re prepared to lose a little usability, maybe giving people an easy place to grab a copy/paste version of the tag then you could produce something more easily readable, if not easy to type: #dɯɐɔqɯ for example. I might be tempted to do that, or add a graphic symbol or something.
There’s something else that nags at me about hash tags, though. They’re really not very webby. You rely on search and on hashtags.org and other specific tools to make sense of them. They can be easily abused, as Habitat showed recently.
So are there other ways to think about tagging? Ways that work with the web rather than just on the web. Examples from those applications where the 140 character limit does not apply? Blog posts, web pages, flickr images and so on?
What if we decided that our requirements for tagging were:
- A very high degree of uniqueness
- Anyone can get information about the tag easily
- Spam and content visible on the tag controlled by the tag owner
- That the tag can be enduring
- That the tag can be used anywhere on the web easily
- That content using the tag can be found with search
- That content using the tag can be found without search
- That no particular service or piece of software is necessary
In it’s essence, tagging is about saying this comment, blog post or image is about this event, concept, product etc. In the blogging world it’s very common to say this post is about the content in this other post. We do that through trackbacks and through simple links. Many blogs accept trackbacks and look at the referring page information so that they can provide links, alongside comments, to other posts referring to them.
A similar things happens with Google’s PageRank algorithm. Words used in links to a page, as well as the content of the referring page, contribute to the way a page is indexed.
The Semantic Web bases everything on URIs (the difference between URI and URL is not important here). If you want to give something a name you don’t pick a word, you use a URI.
I wonder if we could use URIs as tags? And how that would meet the needs above. Say we were to use http://wxwm.org.uk/moseleybarcamp/2009/June to mean the event that happened last weekend.
It has a very high degree of uniqueness, so it meets our first requirement. It can be put straight into a browser and can provide a page giving details of the event, so it’s easy for anyone to get information about the tag. The page at that address can be as clever, or as dumb, as it likes about showing things that link to it – so tag spam can be removed. The link is under control of the domain owner, so can be as enduring as you want to make it. Almost everywhere on the web allows you to post links, so it’s easy to use. Links to a specific URL can be easily searched for in Google and other search engines, and in Flickr and Twitter. Most browsers will send referring page information when requesting the URL, so content can be tracked without search – this means you can find out about unindexed and intranet sites referencing the tag. The URL can be a static page, or a script, it can monitor referrers and spam filter – or not. There is not centralised service needed nor any specific software.
Oh, and it could easily be made to work as Linked Data, the pattern for publishing data on the semantic web, to provide machine-readable information about the event and the conversation happening around it…
I think that only leaves the issue of URI length. I can’t get close to the 8 characters of the guid, or the 6 of mbcamp, but using bit.ly I can make a memorable short URL such as http://bit.ly/utf8tag that redirects to a much longer one, and as bit.ly don’t re-use URLs the bit.ly link remains as unique and almost as enduring (subject to bit.ly’s survival) as your own.
data and anti-data
php -r “include ‘moriarty/moriarty.inc.php’; include ‘moriarty/changeset.class.php’; \$data=file_get_contents(‘megarecord.rdf.xml’); \$cs = new ChangeSet(array(‘before’=> \$data)) ; echo \$cs->to_rdfxml();” > removal_changeset.rdf.xml
Scripting and Development for the Semantic Web (SFSW2009)
The following papers have been accepted for SFSW2009:
- Christoph Lange : Krextor — An Extensible XML->RDF Extraction Framework
- Eugenio Tacchini, Andreas Schultz and Christian Bizer : Experiments with Wikipedia Cross-Language Data Fusion
- Jouni Tuominen, Tomi Kauppinen, Kim Viljanen and Eero Hyvönen : Ontology-Based Query Expansion Widget for Information Retrieval
- Laura Dragan, Knud Möller, Siegfried Handschuh, Oszkar Ambrus and Sebastian Trueg : Converging Web and Desktop Data with Konduit
- Mariano Rico, David Camacho and Oscar Corcho : Macros vs. scripting in VPOET
- Norman Gray, Tony Linde and Kona Andrews : SKUA — Retrofitting Semantics
- Pierre-Antoine Champin : Tal4Rdf: lightweight presentation for the Semantic Web
- Rob Styles, Nadeem Shabir and Jeni Tennison : A Pattern for Domain Specific Editing Interfaces Using Embedded RDFa and HTML Manipulation Tools
- Stéphane Corlosquet, Richard Cyganiak, Axel Polleres and Stefan Decker : RDFa in Drupal: Bringing Cheese to the Web of Data
from Scripting and Development for the Semantic Web (SFSW2009).
Looks like a great line-up. As neither Nad, Jeni nor I are able to attend our paper will be presented (briefly) by Chris Clarke.
Panlibus » Blog Archive » Library of Congress launch Linked Data Subject Headings
Agree with this summary from Richard
On the surface, to those not yet bought in to the potential of Linked Data, and especially Linked Open Data, this may seem like an interesting but not necessarily massive leap forward. I believe that what underpins the fairly simple functional user interface they provide will gradually become core to bibliographic data becoming a first-class citizen in the web of data.
Overnight this uri ‘http://id.loc.gov/authorities/sh85042531’ has now become the globally available, machine and human readable, reliable source for the description for the subject heading of ‘Elephants’ containing links to its related terms (in a way that both machines and humans can navigate). This means that system developers and integrators can rely upon that link to represent a concept, not necessarily the way they want to [locally] describe it. This should facilitate the ability for disparate systems and services to simply share concepts and therefore understanding – one of the basic principles behind the Semantic Web.
from Panlibus » Blog Archive » Library of Congress launch Linked Data Subject Headings.
Great to see LoC doing this stuff and getting it out there.
Domain Specific Editing Interface using RDFa and jQuery
I wrote back in January about Resource Lists, Semantic Web, RDFa and Editing Stuff. This was based on work we’d done in Talis Aspire.
Several people suggested this should be written up as a fuller paper, so Nad, Jeni and I wrote it up as a paper for the SFSW 2009 workshop. It’s been accepted and will be published there, but unfortunately due to work priorities that have come up we won’t be able to attend.
A draft of the paper is here: A Pattern for Domain Specific Editing Interfaces Using Embedded RDFa and HTML Manipulation Tools.
The camera ready copy will be published in the conference proceedings. Feedback welcomed.
Coghead closes for business
With the announcement that Coghead, a really very smart app development platform, is closing its doors it’s worth thinking about how you can protect yourself from the inevitable disappearance of a service.
Of course, there are all the obvious business type due diligence activities like ensuring that the company has sufficient funds, understanding how
your subscription covers the cost (or doesn’t) of what you’re using and so on, but all these can do is make you feel more comfortable – they can’t provide real protection. To be protected you need 4 key things – if you have these 4 things you can, if necessary, move to hosting it yourself.
- URLs within your own domain.
- Regular exports of your data.
- Regular exports of your application.
- The code.
Both you and your customers will bookmark parts of the app, email links, embed links in documents, build excel spreadsheets that download the data and so on and so on. You need to control the DNS for the host that is running your tenancy in the SaaS service. Without this you have no way to redirect your customers if you need to run the software somewhere else.
This is, really, the most important thing. You can re-create the data and the content, you can even re-write the application if you have to, but if you lose all the links then you will simple disappear.
You may not get much notice of changes in a SaaS service. When you find they are having outages, going bust or simply disappear is not the time to work out how to get your data back out. Automate a regular export of your data so you know you can’t lose too much. Coghead allowed for that and are giving people time to get their data out.
Having invested a lot in working out the write processes, rules and flows to make best use of your app you want to be able to export that too. This needs to be exportable in a form that can be re-imported somewhere else. Coghead hasn’t allowed for this, meaning that Coghead customers will have to re-write their apps based on a human reading of the Coghead definitions. Which brings me on to my next point…
You want to be able to take the exact same code that was running SaaS and install it on your own servers, install the exported code and data and update your DNS. Without the code you simply can’t do that. Making the code open-source may be a problem as others could establish equivalent services very quickly, but the software industry has had ways to deal with this problem through escrow and licensing for several decades. The code in escrow would be my absolute minimum.
SaaS and PaaS (Platform as a Service) providers promote a business model based on economies of scale, lower cost of ownership, improved availability, support and community. These things are all true even if they meet the four needs above – but the priorities for these needs are with the customer, not with the provider. That’s because meeting these four needs makes the development of a SaaS product harder and it also makes it harder for any individual customer to get setup. We certainly don’t meet all four with our SaaS and PaaS offerings at work yet, but I am confident that we’ll get there – and we’re not closing our doors any time soon ;-)
Ruby Mock Web Server
I spent the afternoon today working with Sarndeep, our very smart automated test guy. He’s been working on extending what we can do with rspec to cover testing of some more interesting things.
Last week he and Elliot put together a great set of tests using MailTrap to confirm that we’re sending the right mails to the right addresses under the right conditions. Nice tests to have for a web app that generates email in a few cases.
This afternoon we were working on a mock web server. We use a lot of RESTful services in what we’re doing and being able to test our app for its handling of error conditions is important. We’ve had a static web server set up for a while, this has particular requests and responses configured in it, but we’ve not really liked it because the responses are all separate from the tests and the server is another apache vhost that has to be setup when you first checkout the app.
So, we’d decided a while ago that we wanted to put in a little Ruby based web server that we could control from within the rspec tests and that’s what we built a first cut of this afternoon.
require File.expand_path(File.dirname(__FILE__) + "/../Helper")
require 'rubygems'
require 'rack'
require 'thin'
class MockServer
def initialize()
@expectations = []
end
def register(env, response)
@expectations << [env, response]
end
def clear()
@expectations = []
end
def call(env)
#puts "starting call\n"
@expectations.each_with_index do |expectation,index|
expectationEnv = expectation[0]
response = expectation[1]
matched = false
#puts "index #{index} is #{expectationEnv} contains #{response}\n\n"
expectationEnv.each do |envKey, value|
puts "trying to match #{envKey}, #{value}\n"
matched = true
if value != env[envKey]
matched = false
break
end
end
if matched
@expectations.delete_at(index)
return response
end
end
#puts "ending call\n"
end
end
mockServer = MockServer.new()
mockServer.register( { 'REQUEST_METHOD' => 'GET' }, [ 200, { 'Content-Type' => 'text/plain', 'Content-Length' => '11' }, [ 'Hello World' ]])
mockServer.register( { 'REQUEST_METHOD' => 'GET' }, [ 200, { 'Content-Type' => 'text/plain', 'Content-Length' => '11' }, [ 'Hello Again' ]])
Rack::Handler::Thin.run(mockServer, :Port => 4000)
The MockServer implements the Rack interface so it can work within the Thin web server from inside the rspec tests. The expectations are registered with the MockServer and the first parameter is simply a hashtable in the same format as the Rack Environment. You only specify the entries that you care about, any that you don’t specify are not compared with the request. Expectations don’t have to occur in order (expect where the environment you give is ambiguous, in which case they match first in first matched).
As a first venture into writing more in Ruby than an rspec test I have to say I found it pretty sweet – There was only one issue with getting at array indices that tripped me up, but Ross helped me out with that and it was pretty quickly sorted.
Plans for this include putting in a verify() and making it thread safe so that multiple requests can come in parallel. Any other suggestions (including improvements on my non-idiomatic code) very gratefully received.
Resource Lists, Semantic Web, RDFa and Editing Stuff
Some of the work I’ve been doing over the past few months has been on a resource lists product that helps lecturers and students make best use of the educational material for their courses.
One of the problems we hoped to address really well was the editing of lists. Historically products that do this have been deemed cumbersome and difficult by academic staff who will often produce lists as simple documents in Word or the like.
We wanted to make an editing interface that really worked for the academic community so they could keep the lists as accurate and current as they wanted.
Chris Clarke, our Programme Manager, and Fiona Grieg, one of our pilot customers, describe the work in a W3C case study. Ivan Hermann then picks up on one of the way we decided to implement editing using RDFa within the HTML DOM. In the case study Chris describes it like this:
The interface to build or edit lists uses a WYSIWYG metaphor implemented in Javascript operating over RDFa markup, allowing the user to drag and drop resources and edit data quickly, without the need to round trip back to the server on completion of each operation. The user’s actions of moving, adding, grouping or editing resources directly manipulate the RDFa model within the page. When the user has finished editing, they hit a save button which serialises the RDFa model in the page into an RDF/XML model which is submitted back to the server. The server then performs a delta on the incoming model with that in the persistent store. Any changes identified are applied to the store, and the next view of the list will reflect the user’s updates.
This approach has several advantages. First, as Andrew says
One thing I hadn’t used until recently was RDFa. We’ve used it on one of the main admin pages in our new product and it’s made what was initially quite a complex problem much simpler to implement.
The problem that’s made simpler is this – WYSIWYG editing of the page was best done using DOM manipulation techniques, and most easily using existing libraries such as prototype. But what was being edited isn’t really the visual document, it is the underlying RDF model. Trying to keep a version of the model in a JS array or something in synch with the changes happening in the DOM seemed to be a difficult (and potentially bug-ridden) option.
By using RDFa we can distribute the model through the DOM and have the model updated by virtue of having updated the DOM itself. Andrew describes this process nicely:
Currently using Jeni Tennison’s RDFQuery library to parse an RDF model out of an XHTML+RDFa page we can mix this with our own code and end up with something that allows complex WYSIWYG editing on a reading list. We use RDFQuery to parse an initial model out of the page with JavaScript and then the user can start modifying the page in a WYSIWYG style. They can drag new sections onto the list, drag items from their library of bookmarked resources onto the list and re-order sections and items on the list. All this is done in the browser with just a few AJAX calls behind the scenes to pull in data for newly added items where required. At the end of the process, when the Save button is pressed, we can submit the ‘before’ and ‘after’ models to our back-end logic which builds a Changeset from before and after models and persists this to a data store on the Talis Platform.
Building a Changeset from the two RDF models makes quite a complex problem relatively straightforward. The complexity now just being in the WYSIWYG interface and the dynamic updating of the RDFa in the page as new items are added or re-arranged.
As Andrew describes, the editing starts by extracting a copy of the model. This allows the browser to maintain before and after models. This is useful as when the before and after get posted to the server the before can be used to spot if there have been editing conflicts with someone else doing a concurrent edit – this is an improvement to how Chris described it in the case study.
There are some gotchas in this approach though. Firstly, some of the nodes have two-way links:
<http://example.com/lists/foo> <http://purl.org/vocab/resourcelist/schema#contains> <http://example.com/items/bar>
<http://example.com/items/bar> <http://purl.org/vocab/resourcelist/schema#list> <http://example.com/lists/foo>
So that the relationship from the list to the item gets removed when the item is deleted from the DOM we use the @rev attribute. This allows us to put the relationship from the list to the item with the item, rather than with the list.
The second issue is that we use rdf:Seq to maintain the ordering of the lists, so when the order changes in the DOM we have to do a quick traversal of the DOM changing the sequence predicates (_1, _2 etc) to match the new visual order.
Neither of these were difficult problems to solve :-)
My thanks go out to Jeni Tennison who helped me get the initial prototype of this approach working while we were at Swig back in Novemeber.
Exploring OpenLibrary Part Two
This post also appears on the n2 blog.
More than two weeks on from my last look at the OpenLibrary authors data and I’m finally finding some time to look a bit deeper. Last time I finished off thinking about the complete list of distinct dates within the authors file and how to model those.
Where I’ve got to today is tagged as day 2 of OpenLibrary in the n2 subversion.
First off, a correction – foaf:Name should have been foaf:name. Thanks to Leigh for pointing that out. I haven’t fixed in this tag, tagged before I realised I’d forgotten it, but next time, honestly.
It’s clear that there is some stuff in the data that simply shouldn’t be there, things that cannot possibly be a birth date such [from old catalog] and *. and simply ,. When I came across —oOo— I was somewhat dismayed. MARC data, where most of this data has come from, has a long and illustrious history, but one of the mistakes made early on was to put display data into the records in the form of ISBD punctuation. This, combined with the real inflexibility of most ILSs and web-based catalogs has forced libraries to hack there records with junk like —oOo— to fix display errors. This one comes from Antonio Ignacio Margariti.
In total there are only 6,156 unique birth date datums and 4,936 unique death dates. Of course there is some overlap, so in total there’s only 9,566 datums to worry about overall.
So what I plan to do is to set up the recognisable patterns in code and discard anything I don’t recognise as a date or date range. Doing that may mean I lose some date information, but I can add that back in later as more patterns get spotted. So far I’ve found several patterns (shown here using regex notation)…
“^[0-9]{1,4}$” – A straightforward number of 4 digits or fewer, no letters, punctuation or whitespace. These are simple years, last week I popped them in using bio:date . That’s not strictly within the rules of the bio schema as that really requires a date formatted in accordance with ISO8601. Ian had already implied his dis-pleasure with my use of bio:date and suggested I use the more relaxed dc elements date. However, on further chatting what we actually have is a date range within which the event occurred, so we need to show that the event happened somewhere within a date range. This can be solved using the W3C Time Ontology which allows for better description.
I spent some time getting hung up on exactly what is being said by these date assertions on a bio:Birth event. That is, are we saying that the birth took place somewhere within that period, or that the event happened over that period. This may seem a daft question to ask, but as others start modelling events in peoples’ bios this could easily become indistinguishable. Say I want to model my grandfather’s experience of the second world war. I’d very likely model that as an event occurring over a four year period. So, I feel the need to distinguish between an event happening over a period and an event happening at an unknown time within a period. I thought I was getting too pedantic about this, but Ian assured me I’m not and that the distinction matters.
The model we end up with is like this
@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix mine: <http://example.com/mine/schema#> .
@prefix time: <http://www.w3.org/TR/owl-time/> .
<http://example.com/a/OL149323A>
foaf:Name "Schaller, Heinrich";
foaf:primaryTopicOf <http://openlibrary.org/a/OL149323A>;
bio:event <http://example.com/a/OL149323A#birth>;
a foaf:Person .
<http://example.com/a/OL149323A#birth>
dc:date <http://example.com/a/OL149323A#birthDate>;
a bio:Birth .
<http://example.com/names/schallerheinrich>
mine:name_of <http://example.com/a/OL149323A>;
a mine:Name .
<http://example.com/dates/gregorian/ad/years/1900>
time:unitType time:unitYear;
time:year "1900";
a time:DateTimeDescription .
<http://example.com/a/OL149323A#birthDate>
time:inDateTime <http://example.com/dates/gregorian/ad/years/1900>;
a time:Instant .
The simple year accounts for 731,304 of the 748,291 birth dates and for 13,151 of the 181,696 death dates, about 80% of the dates overall. Following the 80/20 rule almost perfectly, the remaining 20% is going to be painful. It has been suggested I should stop here, but it seems a shame to not have access to the rest if we can dig in, and I can, so…
First of the remaining correct entries are the approximate years, recorded as ca. 1753 or (ca.) 1753 and other variants of that. These all suffer from leading and trailing junk, but I’ll catch the clean ones of these with “^[(]?ca\.[)]? ([0-9]{1,4})$”. The difficulty with these is that you can’t really convert these into a single year or even a date range as what people consider as within the “circa” will vary widely in different contexts. So, the interval can be described in the same way as a simple year, but the relationship with the authors birth is not simply time:inDateTime. I haven’t found a sensible circa predicate, so for now I’ll drop into mine.
@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix mine: <http://example.com/mine/schema#> .
@prefix time: <http://www.w3.org/TR/owl-time/> .
<http://example.com/a/OL151554A>
foaf:Name "Altdorfer, Albrecht";
foaf:primaryTopicOf <http://openlibrary.org/a/OL151554A>;
bio:event <http://example.com/a/OL151554A#birth>;
bio:event <http://example.com/a/OL151554A#death>;
a foaf:Person .
<http://example.com/a/OL151554A#birth>
dc:date <http://example.com/a/OL151554A#birthDate>;
a bio:Birth .
<http://example.com/a/OL151554A#death>
dc:date <http://example.com/a/OL151554A#deathDate>;
a bio:Death .
<http://example.com/names/altdorferalbrecht>
mine:name_of <http://example.com/a/OL151554A>;
a mine:Name .
<http://example.com/dates/gregorian/ad/years/1480>
time:unitType time:unitYear;
time:year "1480";
a time:DateTimeDescription .
<http://example.com/a/OL151554A#birthDate>
mine:circaDateTime <http://example.com/dates/gregorian/ad/years/1480>;
a time:Instant .
Ok, it’s time to stop there until next time. I have several remaining forms to look at and some issues of data cleanup.
Next time I’ll be looking at parsing out date ranges of a few years, shown in the data 1103 or 4. These will go in as longer date time descriptions so no new modelling needed.
Then we have centuries, 7th cent., again just a broader date time description required I hope. There are some entries for works from before the birth of Christ – 127 B.C.. I’ll have to take a look at how those get described. Then we have entries starting with an l like l854. I had thought that these may indicate a different calendaring system, but it appear not. Perhaps it’s bad OCRing as there are also entries like l8l4. Not sure what to do with those just yet.
In terms of data cleanup, there are dates in the birth_date field of the form d. 1823 which means that it’s actually a death date. There are also dates prefixed with fl. which means they are flourishing dates. These are used when a birth date is unknown but the period in which the creator was active is known. These need to be pulled out and handled separately.
Of course, I haven’t dealt with the leading and trailing punctuation yet or those that have names mixed in with the dates, so still much work to do in transforming this into a rich graph.
Search
What I'm Doing...
- @JingyeL I don't know yet, someone from here will be. in reply to JingyeL 2 days ago
- @JingyeL are you going to ISWC 2010 in November, Shanghai ? 2 days ago
- Back from work to find 666 unread emails waiting for me — must be a sign... 2 days ago
- More updates...
Recent Comments
- Puia on ISBN 10/13 Converter in Excel
- tee on Fixing a plasma TV
- computer doctor on Fixing a plasma TV
- Mars on left wondering…
- neeli on Pranav Mistry: The thrilling potential of SixthSense technology | Video on TED.com
- infopeep on You’re not the one and only…
- talisians on You’re not the one and only…
- olyerickson on Semtech 2010, San Francisco
- PaulMiller on Semtech 2010, San Francisco
- ldodds on Semtech 2010, San Francisco
Categories
- .Net Technical (8)
- Blog on Blog (6)
- commands I have issued (10)
- Enterprise Architecture (19)
- event (4)
- Fiction Book Review (2)
- Food (2)
- Intellectual Property (9)
- Interaction Design (27)
- Internet Social Impact (43)
- Internet Technical (16)
- IP Law (10)
- Library Tech (19)
- Linked Data (1)
- Music (2)
- New Toy (4)
- Non-Fiction Book Review (7)
- Ontologies (6)
- Open Data (7)
- Other Technical (20)
- Personal (36)
- Random Thought (16)
- Resourcing (4)
- Review (1)
- Security And Privacy (11)
- Semantic Web (32)
- Software Business (11)
- Software Engineering (37)
- Talis Technical (9)
- Uncategorized (44)
- Working at Talis (26)
- [grid::blogpaper] (8)
- [grid::fatherhood] (4)
Archives
- July 2010 (1)
- June 2010 (2)
- February 2010 (1)
- January 2010 (4)
- November 2009 (10)
- October 2009 (4)
- September 2009 (2)
- August 2009 (9)
- July 2009 (12)
- June 2009 (5)
- May 2009 (6)
- April 2009 (7)
- March 2009 (3)
- February 2009 (6)
- January 2009 (10)
- December 2008 (4)
- November 2008 (4)
- October 2008 (9)
- September 2008 (23)
- August 2008 (8)
- July 2008 (1)
- June 2008 (1)
- May 2008 (6)
- April 2008 (14)
- March 2008 (3)
- January 2008 (5)
- December 2007 (6)
- November 2007 (13)
- October 2007 (9)
- July 2007 (2)
- June 2007 (1)
- May 2007 (10)
- April 2007 (5)
- March 2007 (11)
- February 2007 (10)
- January 2007 (13)
- December 2006 (8)
- November 2006 (8)
- September 2006 (2)
- August 2006 (1)
- June 2006 (2)
- February 2006 (2)
- January 2006 (3)
- December 2005 (3)
- November 2005 (2)
- September 2005 (2)
- August 2005 (5)
- July 2005 (8)
- June 2005 (3)
- May 2005 (2)
- February 2005 (1)
- January 2005 (4)
- December 2004 (3)
- November 2004 (6)
- October 2004 (2)
- September 2004 (2)
- August 2004 (5)
- July 2004 (1)
- June 2004 (4)
- May 2004 (4)
- April 2004 (3)
- March 2004 (13)
- February 2004 (6)
- December 2003 (3)
- November 2003 (1)
- August 2003 (2)
- July 2003 (1)
- June 2003 (2)
- May 2003 (1)
- March 2003 (1)
- January 2003 (1)
- October 2002 (1)
- May 2002 (1)
- March 2002 (1)
- August 2001 (1)
- May 2001 (1)
- April 2001 (1)
- January 2001 (1)
- December 2000 (1)
- November 2000 (1)
- December 1999 (1)
- November 1999 (1)
- July 1999 (1)