Government Data, Openness and Making Money

Friday, November 20th, 2009 | Internet Social Impact, Open Data, Semantic Web | 2 Comments

Over on the UK Government Data Developers group there’s been a great discussion about openness, innovation and how Government makes money from its data; and of course if it should make money. I can’t link to the discussion as the group is closed – sign up, it’s a great group.

Philosophically there’s always the stance that Government data has already been paid for by the public through general taxation.

Tim Berners-Lee even says so in his guest piece for Times Online.

This is data that has already been collected and paid for by the taxpayer, and the internet allows it to be distributed much more cheaply than before. Governments can unlock its value by simply letting people use it.

While that’s true, the role of Government is to maximise the return we get on our taxes so if more money can be made from the assets we have then surely we should.

This is where discussion breaks of into various arguments as to where on the spectrum licensing of Government data should sit, and how open to re-use it should be.

The discussion covers notions of Copyleft licensing, attribution, commercial and non-commercial use as well as models of innovation.

What I always come back to is the notion that to make money you have to have something that is not “open”, a scarce resource. I have a blog post talking about that in the context of software and the web that’s been drafted but not finished for some time, so I’m coming at this from a point of existing thinking.

To make money something has to be closed.

In the case of creative works, the thing that is closed is the right to produce copies (closed through Copyright law). An author makes money by selling that right (or a limited subset of it) to a publisher who makes money from exploiting the right to copy. The publisher has exclusivity.

In the case of open source software companies the dominant model is support and consultancy. They make money by exploiting the specialist knowledge they have in their heads – a careful balance exists for companies doing this between making the product great and needing the support revenue. This balance leads to other monetization strategies, like using the closed nature of being the only place to go for that software to sell default slots in the software (think search boxes), or advertising.

In the case of closed-source commercial software it is the code, the product itself, that remains closed.

Commercial organisations with data assets have to keep the data closed in order to make money. The Government, however, does not. The Government can give the data away for free because it has something else that is closed – the UK economy. To be a part of the (legitimate) UK economy you have to pay taxes, giving the UK a 20% to 40% share of all profits.

If people find ways to make money using Government data those taxes dwarf any potential licensing fee – can you imagine a commercial data provider asking for up to 40% of a company’s profit as the cost of a data license?

This is why it makes sense for the Government to make data available with as few restrictions as possible – ultimately that means Public Domain.

That seems to be the direction the mailing list is heading thanks to some great contributors. If open data, government data and innovation interest you then sign up and join in.

Schneier on Security: A Taxonomy of Social Networking Data

A Taxonomy of Social Networking Data

At the Internet Governance Forum in Sharm El Sheikh this week, there was a conversation on social networking data. Someone made the point that there are several different types of data, and it would be useful to separate them. This is my taxonomy of social networking data.

from Schneier on Security: A Taxonomy of Social Networking Data.

Follow the link for a useful breakdown of data in any community site or service.

Bringing FRBR Down to Earth…

Wednesday, November 11th, 2009 | Uncategorized | 30 Comments

I’ve been looking at FRBR for some time. I’ve written about it and spoken about it. Overall I’ve found it difficult to work with and not really useful in solving the problems of resource discovery.

One of the recurring themes I see when looking at library data in 2009 is that it is centred far too often on the record – a MARC21 record usually. This record-centric view of the world pervades much of what is possible, but often it even restricts our very thinking about what might be possible. We are constrained.

I’ve also seen many conversations about FRBR go along a similar route, discussing what exactly classifies as a work or an expression. Is the movie of the book a new work or just a different expression? The answer never being the same. According to Karen Coyle (who has taught me so much about library data) the abstract concept of Work has reached the point of being a fluid and malleable set of all the things that claim to be part of the work. Reading that I got really confused. Then, a few weeks ago, reading through several mailing lists and some more old blog posts, it hit me. The answer was right there in the discussion.

Nobody talks about works, expressions and manifestations, so why describe our data that way?

We talk about books and the stories they tell, we talk about how West Side Story is a re-telling of Romeo and Juliet. We talk about DVDs, Blu-Ray Discs and VHS Videos (OK, not so much anymore) and the movies they contain and we talk about the stories the movies tell.

Let’s look at an example and try to reconcile what we see with FRBR.

In FRBR speak (which is probably a squeaky, slightly digital noise) we would say that Wuthering Heights is a Work produced by Emily Bronte. We might have a copy of it in our hands, maybe the Penguin Classics edition (978-0141439556). We’d call the thing in our hands an Item. Then in-between Work and Item we have two levels of abstractness, the Expression which would be the story as written down in English (nobody’s quite sure where translations fit) and the Manifestation which would be that particular paperback version from Penguin.

If we add in the terms for the relationships it gets rather prosaic.

Wuthering Heights is a work by Emily Bronte, realized in a written expression of the same name. The written expression is embodied in several different manifestations each of which is exemplified by many items, one of which I hold in my hand.

I’m being deliberately extreme, I know. Comment below if you think I’m being too harsh or if you understand the FRBR/WEMI model differently.

Here it is in diagrammatic form:

FRBR 01

The difficulty I, and I suspect many others, have is that I don’t ever use any of those words. They’re too abstract to be useful. FRBR generalises its model and in that generalisation loses a great deal. Let’s talk about it using more natural language.

Wuthering Heights is a story by Emily Bronte. It was originally published as a novel in 1847 and has subsequently been made into a movie (several times) and re-published in many languages beyond its original English. It has been republished in many editions and as a part of many collections. It features several fictitious people including Catherine Earnshaw and Heathcliff. The author, Emily Bronte, had sisters who authored several other novels, though she authored only this one. Emily Bronte is also the subject of several biographies. I have the paperback in my hand right now.

No works, expressions and manifestations. No items. No abstraction. We can model this more clearly now, at least in my opinio.

Real 01

The structure of the model remains broadly the same, but the language allows us to see how it works and classify things more obviously. This has strong similarities to the way Bibliontology is modelled and Bibliontology is very easy to use for its intended purpose – citations.

The more specific nature of the language goes on to pay dividends when we start to add in more data. Wuthering Heights has been made into a movie (several times) and one of the problems often discussed in FRBR circles is whether or not a movie based on a book is a new work or a new expression. Of course, the argument is false as a movie that faithfully reproduces a novel is both an expression of the story told in the novel and a creative work in its own right. While the movie could not exist without the novel it is based on, the art of film-making is a creative act as well. This is a hard thing to model with the four abstract levels defined in FRBR.

Here is the FRBR model showing the movie as an expression of the original work:

FRBR 02

This now seems to imply that the movie is somehow a lesser creative work than the original novel and I’m uncomfortable about that, but we do have the relationship between the book and the movie modelled.

The alternative is to recognise the movie as a creative work in its own right in which case the model looks like this:

FRBR 03

Now we’ve recognised the movie as a creative work in its own right, but lost the detail that it shares something with the novel. That makes the model less useful.

Using less abstract terms, and more of them, we can model in a way that describes the real-life situation – and hopefully avoid some of the argument, though I’m sure other issues will arise. Adding in the movie using the less abstract terms gives us this:

Real 02

Now we have the movie recognised as what it is and we have the relationship with the original novel.

I’ve applied the same logic to the physical items. It doesn’t help me to know that something is simply an item – I want to know what it is. So classes of Hardback, Paperback, CD-ROM, Blu-Ray Disc and Vinyl LP would be useful, where currently RDA provides a complex combination of Encoding Format and Carrier Type. This level of detail is more than likely required for archive and preservation purposes, but for the mainstream use of the data a top-level type would be very useful.

We can add more stuff than movies, though. We can add recordings. Showing my strange taste in music I’ll start with Wuthering Heights by Kate Bush (and the title nicely gives away where this is going). I shan’t try an model this using FRBR for comparison because I can’t see how to. If you feel you can then please sketch it out and add it in the comments or email it to me.

I don’t see a practical way in which making Wuthering Heights (the song) an expression of Wuthering Heights (story) is useful; yet their still exists a relationship between them. The song tells the same story (albeit abridged to 4:29).

Real 03

Modelling with real world terminology also allowed us to separate the song from the recording and the recording from the album it features on. Perhaps not something we can get to from the data we have today, but a useful feature to have in the model.

The richness and utility of modelling comes from giving more detail, not less and from using more specific terms, not more general terms.

The introduction of more specific terms also leads us to write more specific data conversion routines; looking to identify novels, albums, tracks, stories and more. Much of the data will not be mined from our MARC records, but by looking at the specifics we get past much of the variation that is difficult when we try to treat all works, expressions and manifestations the same across all mediums and forms of artistic endeavour.

One of the potential downsides of this approach is an ontology that may explode to contain many classes. While this seems like it is adding detail it is actually just moving detail. RDA documents this as ‘Form of Work’ – ‘A class or genre to which a work belongs.’

If the work belongs to that class, why not model it as that class?

I know several folks out there have been having a hard time applying FRBR to serials and other things, if you fancy having a go at modelling it with real-world language instead I’d love to talk to you – comment below.

Six bottle of wine later and I can tell you, it’s pretty good stuff.

Thursday, November 5th, 2009 | Review | No Comments

Thanks to Documentally tipping us off on Twitter and his blog my wife and I received a gratuitous couple of FreshCases, one red, one white.

These claim to be the next generation of winebox and they are rather nicely designed. The floppy cardboard normally surrounding the tap on a wine box (and the digging around in the box for the tap with just two fingers) is replaced by a smart plastic moulding that, once pressed, releases the tap into position.

The tap for the red, a rather nice Nottage Hill Cabernet Shiraz, is where you’d expect it, front of the case down low. The white, a very crisp Nottage Hill Chardonnay, rather cleverly has the tap on the base, making it a perfect fit lying down in the fridge. With a little extra thought they could have made the handle asymmetric and had it hold the box at a slight angle, but they haven’t.

These cases only landed in the shop on November 1st, so it’s nice to get my hands on these so soon, and free is a great price. I’m told they hit the shelves at £19.99, making them £6.66 a bottle equivalent. That price seems high to me with so many bottles on half-price or 3 for a tenner offers. At £9.99 these would be too god to be true though – the Nottage Hill is definitely more a £7 bottle than a £3 one.

The cases themselves are a really nice design, they remind me, in proportion and style, of the boxes whisky bottle come in. That’s a bonus if you’re putting these out for a party – unlike most wine boxes these don’t look cheapskate. Of course, stripping down the box to squeeze the last glass out of the foil bag will still make you look cheap, or desperate.

Documentally went as far as to record a video showing you the FreshCase in some detail so I figured I’d just share that with you.

Hardy’s Nottage Hill FreshCase from Documentally on Vimeo.

The most obvious downside is that the size of a bottle acts as a limit to what we drink. Take away that barrier of opening (or rather not opening) the second bottle and the wine seems to run out very quickly indeed. That alone will probably keep me buying the more limiting 75cl bottles.

Where is all my subversioned code kept?

Wednesday, November 4th, 2009 | commands I have issued | No Comments

find . -type d | grep -v “/\.svn/” | grep -v “/\.svn$” | xargs svn info | grep “^URL”

Interactive storefront display | ARvertising news

Wednesday, October 28th, 2009 | Interaction Design, Other Technical | 1 Comment

As you walk down the street you are approached by a dog. He is on his guard trying to discern your intentions. He will follow you and interpret your gestures as friendly or aggressive. He will try to engage you in a relationship and get you to pay attention to him.

from Interactive storefront display | ARvertising news.

Computer generated dog, reacts to real-world passers-by.

ShelterIt – My digital think-tank: On identity

Wednesday, October 28th, 2009 | Library Tech, Semantic Web | 1 Comment

Did you notice what just happened? I used used an URI as an identifier for a subject. If you popped that URI into your browser, it will take you to WikiPedia’s article on the book and provide a lot of info there in human prose about this book, and this would make it rather easy for Bob to say that, yes indeed, that’s the same book I’ve got. So now we’ve got me and Bob agreeing that we have the same book.

from ShelterIt – My digital think-tank: On identity.

Great piece by Alexander Johannesen about the future of library data, semantic web and the difficulties of getting from here to there.

Ito World: Visualising Transport Data for Data.gov.uk

Tuesday, October 27th, 2009 | Open Data, Semantic Web | 1 Comment

It can be hard to make meaningful information from huge amounts of data, a graph and a table doesn’t always communicate all it should do. We have been working hard on technology to visualise big datasets into compelling stories that humans can understand. We were really pleased with what we came up with in just one and a half days, see for yourself

from Ito World: Visualising Transport Data for Data.gov.uk.

Nice work on visualizing traffic data.

Interview with the Twitter DJ Traktor App’s Co-Creator at djtechtools.com

Friday, October 16th, 2009 | Internet Social Impact | 1 Comment

On the surface, Twitter DJ seems like a gracious gesture from a DJ to solve the age-old problem of fans not knowing what the amazing track they’re hearing is called and who made it, as well as a boon for often small-time music producers to get some well-deserved props.

from Interview with the Twitter DJ Traktor App’s Co-Creator at djtechtools.com.

Nice integration of Traktor DJ software and Twitter – part of a growing trend that makes apps more native to the web.

yaz4j | Index Data

Tuesday, September 22nd, 2009 | Library Tech | No Comments

yaz4j is a toolkit for Java which includes a wrapper for the ZOOM API of YAZ. This allows developers to write Z39.50/SRU clients in Java. yaz4j supports both search and scan. See the javadoc for details.

from yaz4j | Index Data.

I wrote Yaz4J a couple of years ago when I needed a robust Z39.50 client. The underlying work is done by Index Data’s Yaz library, wrapped for use in Java using JNI (and yes, JNI does work fine and yes it does work cross-platform, we have it running on Linux, Windows and OS X). I hadn’t ever found the time to properly structure and mavenise the code or release it properly so it’s very pleasing that Adam Dickmeiss and Mike Taylor from Index Data along with Juan Cayetano have tidied it all up and published it under a home on Index Data’s site.

:-)

Search

What I'm Doing...