Conversation with Bruce D’Arcus on Motivation for MODS Ontology « Musings

The problem from my standpoint is that MODS has some really odd, library-specific, design choices that I don’t think map very well to the wider world. A central concept like mods:name, with mods:role as a child of that, really makes no sense, and conflicts with more common modeling you see in DC, FRBR ,etc.

It’s semantics are also really loose.

So you have to ask yourself, just how linked could a MODS view in RDF really be?

from Conversation with Bruce D’Arcus on Motivation for MODS Ontology.

Excel RDF

Introduction

When world’s collide sometimes things happen that can be useful. This is not one of those useful things, but a collision of two worlds none-the-less…

ExcelRDF is a proposed serialisation for RDF using the Microsoft Excel Spreadsheet format. This work was inspired by the discussions in the semantic web community about Linked Data and whether or not it mandates the use of RDF. This document is not trying to prove a point, insult anyone or come down on either side of the argument. I just noticed that it hadn’t been done and it didn’t seem too difficult. Of course, that it hadn’t been done should have been enough of a warning to me that it is not, in any sense, desirable.

Conventions used in this document

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Overview

When a server receives a HTTP or HTTPS request for a resource that is described in RDF and the client indicates that it is willing to accept content of type application/vnd.ms-excel the server MAY respond with a Microsoft Excel spreadsheet meeting the following conventions.

  • The spreadsheet MUST contain one or more sheets that meet the following conventions.
  • A sheet SHOULD contain zero or more rows of RDF data.
  • Column A of each non-empty row of a sheet SHOULD contain a URI indicating the Subject of a statement.
  • Column B of each non-empty row of a sheet SHOULD contain a URI indicating the Property of a statement.
  • Column C MAY contain either a URI or a literal value as the Object of a statement.
  • If Column C contains a literal value then Column D MAY contain a language identifier in accordance with IETF BCP 47.
  • If Column C contains a literal value then Column E MAY contain a type specifier indicating the type of the literal value.

Example

The attached example Microsoft Excel spreadsheet contains RDF from the dbpedia project describing Annette Island Airport using the conventions described above.

ExcelRDF Example File

Use Cases

ExcelRDF may be useful where it is desirable to produce charts showing characteristics of a dataset, such as the relative distribution of types within a dataset. Perhaps analysing the count of particular properties. I can think of no obvious way to assess graph characteristics such as linkiness, but you could do things like word counts in literals, or working out how of the literal data is in French.

ExcelRDF may be useful where a specific contract, policy or agreement means that the data must be delivered as an Excel spreadsheet while the underlying data is more useful in RDF.

ExcelRDF may be useful if you wish to be deliberately obtuse.

What else? « Web of Data

great explanation from Dan Brickley:

The non-RDF bits of the data Web are – roughly – going to be the leaves on the tree. The bit that links it all together will be, as you say, the typed links, loose structuring and so on that come with RDF. This is also roughly analagous to the HTML Web: you find JPEGs, WAVs, flash files and so on linked in from the HTML Web, but the thing that hangs it all together isn’t flash or audio files, it’s the linky extensible format: HTML. For data, we’ll see more RDF than HTML (or RDFa bridging the two). But we needn’t panic if people put non-RDF data up online…. it’s still better than nothing. And as the LOD scene has shown, it can often easily be processed and republished by others. People worry too much! :)

from What else? « Web of Data.

Paul Miller is right… and so is Ian Davis

Paul Miller, a good friend and ex-colleague, has been having a tough time arguing that perhaps Linked Data doesn’t need RDF. Don’t misunderstand that, he thinks RDF is a Good Thing and Best Practice for Linked Data. But he thinks a dogmatic stance is unhelpful.

The problem, I contend, comes when well-meaning and knowledgeable advocates of both Linked Data and RDF conflate the two and infer, imply or assert that ‘Linked Data’ can only be Linked Data if expressed in RDF.

This dogmatism makes me deeply uncomfortable, and I find myself unable to agree with the underlying premise.

In the twitter stream that Paul links to there is some comment reminding people that RDF can take many forms, not just RDF/XML.

kidehen: @andypowe11 re. #rdf, it’s the data model for #linkeddata based #metadata. Remember #rdf != RDF/XML, no escaping RDF model re. #linkeddata.

Ian Davis (my boss) took a strong stance saying that if things weren’t RDF then they weren’t linked data. Perhaps the very thing Paul sees as a dogmatic stance. Ironic as Ian is far from dogmatic. But Ian is defending the term Linked Data, not saying that’s the only way to publish data on the web…

TallTed: @iand “I think LD better for many cases, but there are times i’d rather hv a spreadsheet.” What? Can a spreadsheet not hold #LinkedData?

Well, it seems to me both Paul and Ian are right to a strong degree and are essentially arguing over only one thing – the meaning of the term Linked Data.

Paul quote Tim Berners-Lee’s design note on Linked Data:

1. Use URIs as names for things

2. Use HTTP URIs so that people can look up those names

3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)

4. Include links to other URIs. so that they can discover more things.

The emphasis is Paul’s. I would emphasise a different point:

4. Include links to other URIs. so that they can discover more things.

And in point four lies the reason that Ian is saying a spreadsheet isn’t Linked Data, even if it’s on the web and even if it’s linked to. The only standard for describing how one resource relates to others using URIs is RDF. Sure, you can put URIs into a spreadsheet, but there is no standard interpretation of what the sheets, rows and columns mean. Sure, you can put URIs into a CSV file, but again, there is no standard interpretation of what the fields mean.

The end result of that is data published on the web that can be linked to but not from.

At this early time, though, Paul argues that what we really want is to get more and more data published and open. We all agree on that, I know. Ian does for sure, he runs Data Incubator for exactly that reason – well, that and helping show those publishing spreadsheets and CSV why they should move to RDF and Linked Data.

In the comments on Paul’s post Justin (another senior manager at Talis) says:

Yes the same mistake was made with the rise of the web.

Once you had URIs and HTTP you already had plain text which is a perfectly good way to encode content. By adopting the STANDARD convention of HTML, all sort of existing text based formats with their various mark ups were locked out. That locked out a lot of content that already existed and required anyone who wanted to play to convert existing content into a html format.

Of course it did have the small side effect that to consume web content you only needed a browser that understood one convention i.e. html.

The same is true of RDF. XML is the equivalent of ascii in this regard.

And that’s the point. XML is the equivalent of ASCII, as is a spreadsheet or a CSV file, not because they’re simple, but because they have no mechanism for embedding the relationships and links necessary to link out from your data. Yes, they can contain URIs and clients can decide to make those into links, but there is no way to describe the meaning.

I agree with both side of this argument – If it isn’t RDF then it isn’t Linked Data, but I wouldn’t keep pushing that point if someone was willing to publish data yet unable or unwilling to publish RDF (in any of its many forms).

Why hash tags are broken, and ideas for what to do instead.

I was at Moseley Bar Camp last Sunday and there were some great sessions. Andy Mabbett stood up to lead a discussion entitled Let’s Play Tag: recent developments and emerging issues in the use of tagging for added semantic richness.

Andy was looking for discussion on how to solve the problem of ambiguity in hash tags – a popular technique for categorising community tweets on twitter. His example is classic event tagging, the tag for the event was #mbcamp which works fine for the duration of a Sunday afternoon event, but what if you want tags to be more enduring?

Andy took us step-by-step through the issue of ambiguity of usernames as tags on twitter and flickr and described some of the issues of differing tag normalisation rules.

Andy also asked why we tag?

  • To add semantic richness?
  • To help your friends find stuff?
  • To help machines in 100 years find stuff
  • Don’t know

I tag for all of those reasons, but not on twitter. On twitter I use hash tags to contribute to an in the moment conversation that’s happening at a particular event or on something topical.

Andy’s issue, then, is with the value of these tags longer term and on more enduring stuff like blog posts, photos on flickr and so on. Perhaps 100 years might be pushing it, but it’s worth thinking about.

The problem with hash tags comes from the tension between finding something specific enough for the moment, something short enough to not use up too many of the 140 characters and something easy to remember. That’s two forces pulling one-way (shorter) and only one pulling the other.

The shorter the tag goes the easier it is to remember and to type, and the fewer character it uses up, but it also becomes more likely to clash with others. Perhaps some mainstream trends might get away with very short tags, I thought. #fb for example means facebook, surely, but looking at the use of it apparently the references to facebook are far outweighed by the noise.

So, twitter’s 140 character limit and the profusion of clients means we can only have short, easy to remember text tags, but the need for disambiguation and to be more specific means we need something longer.

We could solve the ambiguity problem by using something like a guid, but that’s not easy to remember or type, and is generally quite long. The length issue could be solved by encoding it using unicode characters. Twitter counts multi-byte UTF8 characters as single characters, which is correct, and this opens up some interesting unique tags for those willing to forego the easy typing.

By long I mean cf629dc3-d425-4707-8119-1f35d35d7687 which is a fairly typical GUID and is 36 character long. That’s too long if you only have 140 characters to play with. The length comes from the need to encode it as ASCII. Twitter, where our length obsession comes from, doesn’t require characters to be ASCII. The 140 character limit is for 140 UTF8 characters, so we can use a much greater range of characters to represent the same degree of uniqueness in a shorter UTF8 string.

UTF8 isn’t ideal as a starting point, though, as the number of bytes per character varies. The unicode definition uses nice simple 2 byte indexes, so we match 4 ASCII characters from the GUID to a unicode character, then use the UTF8 encoding for those to write it down. By using unicode and UTF8 it becomes just a handful of characters, just 8 for this GUID.

cf62 콢, 9dc3 鷃, d425 퐥, 4707 䜇, 8119 脙, 1f35 ἵ, d35d 퍝, 7687 皇

This gives us a tag of #콢鷃퐥䜇脙ἵ퍝皇 which is not easy to type, would be difficult for many to visually identify and could, for all I know, be extremely offensive to those who read CJK, Hangul or Greek. I may have got lucky with that GUID too, there may be GUIDs that don’t produce valid unicode pairs.

But, as it’s a GUID it gives a very high confidence that is unique, it’s only 8 characters long and works as a unicode tag on Flickr and a unicode tag on Hashtags. Just don’t look at the raw URLs in the source of the page…

What we lose with that approach is a good deal of ease-of-use. I certainly wouldn’t try this technique at an event.

If you’re prepared to lose a little usability, maybe giving people an easy place to grab a copy/paste version of the tag then you could produce something more easily readable, if not easy to type: #dɯɐɔqɯ for example. I might be tempted to do that, or add a graphic symbol or something.

There’s something else that nags at me about hash tags, though. They’re really not very webby. You rely on search and on hashtags.org and other specific tools to make sense of them. They can be easily abused, as Habitat showed recently.

So are there other ways to think about tagging? Ways that work with the web rather than just on the web. Examples from those applications where the 140 character limit does not apply? Blog posts, web pages, flickr images and so on?

What if we decided that our requirements for tagging were:

  1. A very high degree of uniqueness
  2. Anyone can get information about the tag easily
  3. Spam and content visible on the tag controlled by the tag owner
  4. That the tag can be enduring
  5. That the tag can be used anywhere on the web easily
  6. That content using the tag can be found with search
  7. That content using the tag can be found without search
  8. That no particular service or piece of software is necessary

In it’s essence, tagging is about saying this comment, blog post or image is about this event, concept, product etc. In the blogging world it’s very common to say this post is about the content in this other post. We do that through trackbacks and through simple links. Many blogs accept trackbacks and look at the referring page information so that they can provide links, alongside comments, to other posts referring to them.

A similar things happens with Google’s PageRank algorithm. Words used in links to a page, as well as the content of the referring page, contribute to the way a page is indexed.

The Semantic Web bases everything on URIs (the difference between URI and URL is not important here). If you want to give something a name you don’t pick a word, you use a URI.

I wonder if we could use URIs as tags? And how that would meet the needs above. Say we were to use http://wxwm.org.uk/moseleybarcamp/2009/June to mean the event that happened last weekend.

It has a very high degree of uniqueness, so it meets our first requirement. It can be put straight into a browser and can provide a page giving details of the event, so it’s easy for anyone to get information about the tag. The page at that address can be as clever, or as dumb, as it likes about showing things that link to it – so tag spam can be removed. The link is under control of the domain owner, so can be as enduring as you want to make it. Almost everywhere on the web allows you to post links, so it’s easy to use. Links to a specific URL can be easily searched for in Google and other search engines, and in Flickr and Twitter. Most browsers will send referring page information when requesting the URL, so content can be tracked without search – this means you can find out about unindexed and intranet sites referencing the tag. The URL can be a static page, or a script, it can monitor referrers and spam filter – or not. There is not centralised service needed nor any specific software.

Oh, and it could easily be made to work as Linked Data, the pattern for publishing data on the semantic web, to provide machine-readable information about the event and the conversation happening around it…

I think that only leaves the issue of URI length. I can’t get close to the 8 characters of the guid, or the 6 of mbcamp, but using bit.ly I can make a memorable short URL such as http://bit.ly/utf8tag that redirects to a much longer one, and as bit.ly don’t re-use URLs the bit.ly link remains as unique and almost as enduring (subject to bit.ly’s survival) as your own.

Putting Government Data online – Design Issues

Government data is being put online to increase accountability, contribute valuable information about the world, and to enable government, the country, and the world to function more efficiently. All of these purposes are served by putting the information on the Web as Linked Data. Start with the “low-hanging fruit”. Whatever else, the raw data should be made available as soon as possible. Preferably, it should be put up as Linked Data. As a third priority, it should be linked to other sources. As a lower priority, nice user interfaces should be made to it — if interested communities outside government have not already done it. The Linked Data technology, unlike any other technology, allows any data communication to be composed of many mixed vocabularies. Each vocabulary is from a community, be it international, national, state or local; or specific to an industry sector. This optimizes the usual trade-off between the expense and difficulty of getting wide agreement, and the practicality of working in a smaller community. Effort toward interoperability can be spend where most needed, making the evolution with time smoother and more productive.

from Tim Berners-Lee Putting Government Data online – Design Issues.

Sir Tim Berners-Lee to advise the Government on public information delivery – PublicTechnology.net

From: Sir Tim Berners-Lee to advise the Government on public information delivery – PublicTechnology.net

The Prime Minister has announced the appointment of the man credited with inventing the World Wide Web, Sir Tim Berners-Lee as expert adviser on public information delivery. The announcement was part of a statement on constitutional reform made in the House of Commons this afternoon.

Sir Tim Berners-Lee, who is currently director of the World Wide Web Consortium which overseas the web’s continued development. He will head a panel of experts who will advise the Minister for the Cabinet Office on how government can best use the internet to make non-personal public data as widely available as possible.

He will oversee the work to create a single online point of access for government held public data and develop proposals to extend access to data from the wider public sector, including selecting and implementing common standards. He will also help drive the use of the internet to improve government consultation processes.

TimBL talked about this at TED2009 and the video is below:

This is fantastic news, of course. Ambitious timescales, following the lead of the Obama administration, opening up government data for re-use as well as public oversight. All very good things.

The technical challenges in doing this will be very interesting. First off, the service will undoubtedly by Linked Data – the pattern of the Semantic Web or Web of Data. TimBL has been describing the efforts of the Linked Open Data community as “the web done right” for some time now. Linked data is also the approach taken by the US administration and is really starting to gather pace just like the early days of the document web. That will be interesting to see as it’s a different discipline to developing a basic html site with a different set of balances and trade-offs in the data modeling, granularity, URI design and so on.

Second up will be scaling to meet the traffic demand. As both a high profile linked data service and UK government data it will be highly in demand from day one. Coping with peak traffic loads is not technically difficult as long as someone has their eye on that ball from the start. It’s likely that demand for this data will be global, at least from those exploring what has been published, so traffic could get very high indeed. One of the aspects that might make this easier is that it will almost certainly be read-only for the foreseeable future, and that allows far more flexibility (and simplicity) in the approach to scaling.

Talking of it being read-only… Being a high profile data-source there will need to be a focus on securing it, not to prevent access, but to prevent unauthorised changes. Given the current atmosphere surrounding MPs expense claims and the level of voting in the recent European parliament elections it seems obvious that this will be a target for disgruntled and technically adept individuals both here and abroad. The read-only nature of the service helps make this easier, as does the linked data approach as that is the same in many security respects to the web of documents we have today – that is, securing it is well understood.

Definitely a project to watch closely.

[Disclosure - I work for Talis, a software company that offers a semantic web platform for doing this kind of publishing]

Scripting and Development for the Semantic Web (SFSW2009)

The following papers have been accepted for SFSW2009:

from Scripting and Development for the Semantic Web (SFSW2009).

Looks like a great line-up. As neither Nad, Jeni nor I are able to attend our paper will be presented (briefly) by Chris Clarke.