Pairwise Comparisons of Large Datasets

It’s been a while since I last posted. Work’s been busy, interesting, challenging 🙂 But now it’s the holidays and I have some time to write.

At work we’ve been building a small team around Big Data technologies; specifically Hadoop and Elasticsearch right now though those choices may change. Unlike many Big Data projects we’re not de-normalising our data for a specific application. We have several different applications and thoughts in mind so we want to keep options open. We’re working with graph-based data structures; Linked Data, essentially.

The first product we’re building is niche and the community of users are quite private about how they do business so as I’ve said before I won’t be talking much about that. That sounded kinda creepy 8-| they’re not the mafia, they’re really nice people!

What I can share with you is a little technique we’ve developed for doing pairwise comparisons in map/reduce.

We all know map/reduce is a great way to solve some kinds of problems and Hadoop is a great implementation that allows us to scale map/reduce solutions across many machines. One of the class of problems that is hard to do is pairwise comparisons. Let me first describe what I mean by a pairwise comparison…

Imagine you have a collection of documents. You want to know which ones are similar to which others. One way to do this is to compare every document with every other document and give the connection between them a similarity score. That is hard to do With a large collection of documents because of the number of comparisons – the problem is O(n²). Specifically, if we assume we don’t compare documents with themselves and that ɑ compared with β is the same as β compared with ɑ then the number of comparisons is (n²-n)/2.

If you want to scale this out across a cluster the specific difficulty is knowing what to compare next and what’s already been done. Most approaches I’ve seen use some central coordinator and require that every box in the cluster can access some central document store. Those cause more problems for very large sets.

Other approaches rely on re-defining the problem. One approach is to create some kind of initial grouping based on an attribute such as a subject classification and then only compare within those those groupings. That’s a great approach and is often very suitable. Another approach is to generate some kind of compound key describing the document and then connect all documents with the same key. That’s a great approach and means each document can have a key generated independently of the others. That scales really well but is not always possible.

What if we really do want to compare everything with everything else? That’s the situation I’ve been looking at.

Let’s simplify the example a little. We’ll use the words of the phonetic alphabet, alpha to zulu, to represent our set of documents:

Alpha Bravo Charlie Delta Echo Foxtrot Golf Hotel India Juliet Kilo Lima Mike November Oscar Papa Quebec Romeo Sierra Tango Uniform Victor Whiskey X-ray Yankee Zulu

A pairwise comparison can be viewed as a table with the same terms heading both rows and columns. This gives us a way of thinking about the workload. The smallest unit we can package as a piece of work is a cell in the table; the similarity score for which would be the comparison of the row and column headings.

Alpha Bravo Charlie Yankee Zulu
Alpha
Bravo
Charlie
Yankee
Zulu

The cells we need to calculate are highlighted in green. Using the cell as the unit of work is nice and simple – compare the similarity of two things – so being able to work at this level would be great. Thinking about map/reduce, the pair and their similarity score is the final result we’re looking for, so could be the output of the reducer code. That leaves us with the mapper to create pairs.

A simplistic approach to the mapper creating pairs would be to iterate all of the values:

Receiving ‘Alpha’ as input:
1) read ‘Alpha’ and ignore it
2) read ‘Bravo’ and output ‘Alpha, Bravo’
3) read ‘Charlie’ and output ‘Alpha, Charlie’

25) read ‘Yankee’ and output ‘Alpha, Yankee’
26) read ‘Zulu’ and output ‘Alpha, Zulu’

This is not a good approach it means the mapper will need to read all of the values for each input value. Remember that we can’t assume that the set will fit in memory, so can’t keep a full set inside each mapper to iterate quickly. The reading of values is then O(n²). The mapper has to do this in order to generate the pairs that will then be compared by the reducer. With this approach the mapper requires access to the full set of input values each time it processes. So, we’ve managed to remove the need for a central coordinator but not for a centrally accessible store.

What we need to find is a way of generating pairs without having to iterate the full input set multiple times. Our mental model of a table gives us a possible solution for that — coordinates. If we could generate pairs of values using coordinates as keys then the sort that occurs between the map and reduce will bring together pairs of values at the same coordinate — a coordinate identifying a cell:

1 2 3 25 26
Alpha Bravo Charlie Yankee Zulu
1 Alpha
2 Bravo
3 Charlie
25 Yankee
26 Zulu

This changes what our mapper needs to know. Rather than having to know every other value we need to know our position and every other coordinate. If we use sequential, incremented, values for the coords then we don’t need to query for those, we can simply calculate them. To do that, the mapper needs to know the row/column number of the current value it’s been given and the total number of rows/columns in the square. The total can be passed in as part of the job configuration.

Getting the position of our value within the input sequence is a little tricky. The TextInputFormat reads input files line by line and passes each line to the mapper. If the key it passed to the mapper were the line number that would make this problem very easy to solve. Unfortunately it passes the byte offset within the file. One way to know the position, then, would be to use fixed-lengths for the values. That way the byte offset divided by the fixed length would calculate the position. Alternatively we could pre-process the file and create a file of the form ‘1 [tab] Alpha’ to provide the position explicitly. This requires that we perform a single-threaded pass over the entire input set to generate an incrementing position number — not ideal.

It also means that if your comparison takes less time than creating a position-indexed file then this approach won’t be useful to you. In our case it is useful.

The mapper logic for a coordinate approach becomes:

1) read ‘Alpha’
2) output ‘Alpha’ to the coordinates of cells where it should be compared.

A naive implementation of this would output ‘Alpha’ to cells 1,1 to 26,1 for the top row and 1,1 to 1,26 for the left most column. That would create a grid n² but we know we can optimise that to (n²-n)/2 in which case Alpha would be output to cells 1,2 to 1,26 only; the green cells in our example. A middle-position value, Lima, would be output on 1,12 to 11,12 and 11,13 to 11,26. This means the mappers only have to pass over the input values a single time – O(n).

in code:

public class PairwiseMap extends Mapper<Text, Text, Text, Text> {

private static void output_rows(int row, Text name, Context context) throws IOException, InterruptedException {

for (int col = 1; col < row; col++) {
 String key = String.format("%d,%d", row, col);
 context.write(new Text(key), name);
 }
}

private static void output_cols(int wordPosition, Text name, int total, Context context) throws IOException, InterruptedException {
 int column = wordPosition;
 for (int row = wordPosition + 1; row <= total; row++) {
  String key = String.format("%d,%d", row, column);
  context.write(new Text(key), name);
 }
}
@Override
 protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {
  int total = Integer.parseInt(context.getConfiguration().get("TotalInputValues"));
  int line = Integer.parseInt(new String(key.getBytes()));
  output_rows(line, value, context);
  output_cols(line, value, total, context);
 }
}

This solution is effective but the pre-processing and the need to know the total are both frustrating limitations.

I can’t think of a better way to get the position, either with input files in HDFS or with rows in a HBase table. If you have a super-clever way to know the position of a value in a sequence that would help a lot. Maybe a custom HBase input format might be a possibility.

Any suggestions for improvements would be gratefully received 🙂

 

Room Not Found

I was staying in a hotel last week. I arrived and they gave me room 404. I went to the fourth floor and looked and looked but couldn’t find it.

I went back down to reception and they were very apologetic; they’d recently combined it with 403 and it was no longer there. They checked me into room 200. I said ‘OK’.

#this_really_happened.

Big Data, Large Batches and My Mistake

This is week 9 for me in my new challenge at Callcredit. I wrote a bit about what we’re doing last time and can’t write much about the detail right now as the product we’re building is secret. Credit bureaus are a secretive bunch, culturally. Probably not a bad thing given what they know about us all.

Don’t expect a Linked Data tool or product. What we’re building is firmly in Callcredit’s existing domain.

As well as the new job, I’ve been reading Eric Ries’ The Lean Startup, tracking Big Data news and developing this app. This weekend the combination of these things became a perfect storm that let me to a D’Oh! moment.

One of the many key points in Lean Startup is to maximise learning by getting stuff out as quickly as possible. The main aspect of getting stuff out is to work in small batches. There are strong parallels here with Agile development practices and the need to get a single end-to-end piece of functionality working quickly.

This GigaOm piece on Hadoop’s days being numbered describes the need for faster, smaller batches too; in the context of data analysis responses and incremental changes to data. It introduces a number of tools, some of which I’ve looked at and some I haven’t.

The essence of moving to tools like Percolator, Dremel and Giraph is to reduce the time to learning; to shorten the time it takes to get round the data processing loop.

So, knowing all of this, why have I been working in large batches? I’ve spent the last few weeks building out quite detailed data conversions, but without a UI on the front to make any value from it! Why, given everything I know and all that I’ve experienced didn’t I build a narrow end-to-end system that could be quickly broadened out?

A mixture of reasons, all of which aren’t really valid, just tricks of the mind.

Yesterday I started to fix this and built a small batch, end-to-end, run that I can release soon for internal review.

🙂

Getting over-excited about Dinosaurs…

I had the great pleasure, a few weeks ago, of working with Tom Scott and Michael Smethurst at the BBC on extensions to the Wildlife Ontology that sits behind Wildlife Finder.

In case you hadn’t spotted it (and if you’re reading this I can’t believe you haven’t) Wildlife Finder provides its information in HTML and RDF — Linked Data, providing a machine-readable version of the documents for those who want to extend or build on top of it. Readers of this blog will have seen Wildlife Finder showcased in many, many Linked Data presentations.

The initial data modelling work was a joint venture between Tom Scott of BBC and Leigh Dodds of Talis and they built an ontology that is simple, elegant and extensible. So, when I got a call asking if I could help them add Dinosaurs into the mix I was chuffed — getting paid to talk about dinosaurs!

Like most children, and we’re all children really, I got over-excited and rushed up to London to find out more. Tom and I spent some time working through changes and he, being far more knowledgeable than I on these matters, let me down gently.

Dinosaurs, of course, are no different to other animals in Wildlife Finder — other than being dead for a while longer…

This realisation made me feel a little below average in the biology department I can tell you. It’s one of those things you stumble across that is so obvious once someone says it to you and yet may well not have occurred to you without a lot of thought.

 

Choosing URIs, not a five minute task.

This post originally appeared on the Talis Consulting Blog.

Chris Keene at Sussex is having a tough time making a decision on his URIs so I thought I’d wade in and muddy the waters a little.

He’s following advice from the venerable Designing URI Sets for the UK Public Sector. An eleven page document from the heady days of October 2009.

Chris discusses the choice between data.lib.sussex.ac.uk and www.sussex.ac.uk/library/ in terms of elegance, data merging and running infrastructure. He’s leaning toward data.lib.sussex.ac.uk on the basis that data.organisation.tld is the prevailing wind.

There are many more aspects worth considering, and while data.organisation.tld may be a way to get up and running quickly you might get longer term benefit from more consideration; after all we don’t want these URIs to change.

The key requirements are outlined well in ‘Designing URI Sets’ as follows

3. In particular, the domain will:

  • Expect to be maintained in perpetuity
  • Not contain the name of the department or agency currently defining and naming a concept, as that may be re-assigned
  • Support a direct response, or redirect to department/agency servers
  • Ensure that concepts do not collide
  • Require the minimum of central administration and infrastructure costs
  • Be scalable for throughput, performance, resilience

These are all key points, but one in particular stands out for me in terms of choosing the hostname part of a URI

  • Not contain the name of the department or agency currently defining and naming a concept, as that may be re-assigned

That simple sentence contains a lot more than at first reading and suggests that any or all of the concepts defined in the data may become someone else’s responsibility in time. I think over time we will see this becoming key to the longevity of URIs, along with much better redirect maintenance.

The approach data.gov.uk has taken is to break the data into broad subject areas within which many different types of data might sit – education.data.gov.uk, transport.data.gov.uk, crime.data.gov.uk, health.data.gov.uk and so on. This is one example of breaking up the hosts and while right now they all point to one cluster of web servers they can be moved around to allow hosting in different places.

This is good, yet I can’t help thinking that those subject matter areas are really rather broad. Then there are others that seem to work on a different axis, statistics.data.gov.uk and research.data.gov.uk. Leaving me confused at first glance as to where the responsibility for publishing crime research would lie. Then there is patents.data.gov.uk, not “innovation” or “invention” but “patents”, the things listed.

Data.gov.uk has done a great job trailblazing, making and publishing their decisions and allowing others to learn from them, develop on them and contribute back. I think we can push their thinking on hostnames still further. If we consider Linked Data to be descriptions of things, rather than publishing data, then directories of those things would be useful.

For example, we could give somebody the responsibility of publishing a list of all schools in the UK at schools.example.gov and that would be one part of the puzzle. A different group may have the responsibility of publishing the list of all universities and yet another the list of all companies at companies.example.gov.

Of course, we would expect all of these to interlink, patents.example.gov would have links to companies.example.gov and universities.example.gov to document the ownership of patents. We’d expect to see links in schools.example.gov to inspections.schools.example.gov and so on.

Notice that I’ve dropped the word data from those examples, as much of this is about making machine (and human) readable descriptions of things. It’s only because we describe lots of things at the same time and describe them uniformly we call it data.

I’d still expect health.example.gov to appear as well, but the responsibility would be one of aggregating what could be considered health data in order to support querying; it would aggregate doctors.example.gov, hospitals.examples.gov and more. I would expect as many of these aggregates to pop up as are useful.

Of course, in this approach, as in the current data.gov.uk approach, everyone who wants to say something about a particular doctor, school or patent has to be able to get access to that host to say it and, perhaps, conflicting things said by different people get mixed up.

At this point you’re probably thinking well, we might as well just use data.organisation.tld and be done with it then. Unfortunately that simple moves the same design decisions from the hostname to the resource part of the URI, the bit after the hostname. You still have to make decisions and with only one hostname your hosting options are drastically reduced.

Data.gov.uk places the type of thing in the resource part of the URI using what they call concept/reference pairs:

2. Examples of concept/reference pairs:
• road/M5
• school/123
3. The concept/reference construct may be repeated as necessary, for example:
• road/M5/junction/24
• school/123/class/5

I tend to do this slightly differently, using container/reference pairs so I would use “roads” rather than “road” as this lends itself better to then putting listings at those URIs.

The antithesis

We can often learn something by turning an approach on its head. In this case I wonder what would happen if we embraced the idea that many people will have different world-views about the same thing, their own two-penneth so to speak. None of them necessarily authoritative.

In that case we end up with me publishing data on data.my.domain and you publishing data about the same things on data.your.domain. Just as happens all over the web today. If I choose my domains carefully then maybe I can hand bits on as I find someone else to run them better, as above, but always there is more than world view.

There are two common ways to make this work and be interconnected. A common approach is to use owl:sameAs to indicate that data.my.domain/Winston_Churchill and data.your.domain/Winston_Churchill are describing the very same thing. The OWL community is not entirely supportive of that use.

The other approach is to use the annotation pattern and rdfs:seeAlso; in which case documents describing a resource live in many places, but they agree on a single, canonical, URI.

So what would that mean for Sussex?

Well, I’m not sure.

Fortunately, Chris has a limited decision to make right now, choosing a URI, or URIs, for the Mass Observation Archive. It is for this he is considering data.lib.sussex.ac.uk and www.sussex.ac.uk/library/.

Thinking about changing responsibilities over time, I have to say I would choose neither. It is perfectly conceivable that the mass observation may at some time move and not be under the remit of the University of Sussex Library, or even the university.

I would choose a hostname that can travel with the archive wherever it may live. Fortunately it already has one, http://www.massobs.org.uk/. Ideally the catalogue would live it something like catalogue.massobs.org.uk or maybe massobs.org.uk/archive or something like that.

My leaning on this is really because this web of data isn’t something separate from the web of documents, it’s “as well as” and “part of” the web as one whole thing. data.anything makes it somehow different; which in essence it’s not.

Postscript

Oh, on just one more thing…

URI type, for example one of:
• id – Identifier URI
• doc – Document URI, Representation URI
• def – Ontology URI
• set – Set URI

Personally, I really dislike this URI pattern. It leaves the distinguishing piece early in the URI, making it harder to spot the change as the server redirects and harder to select or change when working with the URIs.

I much prefer the pattern

/container/reference to mean the resource
/container/reference.rdf for the rdf/xml
/container/reference.html for the html

and expanding to

/container/reference.json, /container/reference.nt, /container/reference.xml and on and on.

My reasoning is simple, I can copy and paste the document URI from the address bar, paste it to curl on the command line and simply backspace a few to trim off the extension. Also, in the browser or wget, this pattern gives us files named something.html and something.rdf by default. Much easier to work with in most tools.

Bringing FRBR Down to Earth…

I’ve been looking at FRBR for some time. I’ve written about it and spoken about it. Overall I’ve found it difficult to work with and not really useful in solving the problems of resource discovery.

One of the recurring themes I see when looking at library data in 2009 is that it is centred far too often on the record – a MARC21 record usually. This record-centric view of the world pervades much of what is possible, but often it even restricts our very thinking about what might be possible. We are constrained.

I’ve also seen many conversations about FRBR go along a similar route, discussing what exactly classifies as a work or an expression. Is the movie of the book a new work or just a different expression? The answer never being the same. According to Karen Coyle (who has taught me so much about library data) the abstract concept of Work has reached the point of being a fluid and malleable set of all the things that claim to be part of the work. Reading that I got really confused. Then, a few weeks ago, reading through several mailing lists and some more old blog posts, it hit me. The answer was right there in the discussion.

Nobody talks about works, expressions and manifestations, so why describe our data that way?

We talk about books and the stories they tell, we talk about how West Side Story is a re-telling of Romeo and Juliet. We talk about DVDs, Blu-Ray Discs and VHS Videos (OK, not so much anymore) and the movies they contain and we talk about the stories the movies tell.

Let’s look at an example and try to reconcile what we see with FRBR.

In FRBR speak (which is probably a squeaky, slightly digital noise) we would say that Wuthering Heights is a Work produced by Emily Bronte. We might have a copy of it in our hands, maybe the Penguin Classics edition (978-0141439556). We’d call the thing in our hands an Item. Then in-between Work and Item we have two levels of abstractness, the Expression which would be the story as written down in English (nobody’s quite sure where translations fit) and the Manifestation which would be that particular paperback version from Penguin.

If we add in the terms for the relationships it gets rather prosaic.

Wuthering Heights is a work by Emily Bronte, realized in a written expression of the same name. The written expression is embodied in several different manifestations each of which is exemplified by many items, one of which I hold in my hand.

I’m being deliberately extreme, I know. Comment below if you think I’m being too harsh or if you understand the FRBR/WEMI model differently.

Here it is in diagrammatic form:

FRBR 01

The difficulty I, and I suspect many others, have is that I don’t ever use any of those words. They’re too abstract to be useful. FRBR generalises its model and in that generalisation loses a great deal. Let’s talk about it using more natural language.

Wuthering Heights is a story by Emily Bronte. It was originally published as a novel in 1847 and has subsequently been made into a movie (several times) and re-published in many languages beyond its original English. It has been republished in many editions and as a part of many collections. It features several fictitious people including Catherine Earnshaw and Heathcliff. The author, Emily Bronte, had sisters who authored several other novels, though she authored only this one. Emily Bronte is also the subject of several biographies. I have the paperback in my hand right now.

No works, expressions and manifestations. No items. No abstraction. We can model this more clearly now, at least in my opinio.

Real 01

The structure of the model remains broadly the same, but the language allows us to see how it works and classify things more obviously. This has strong similarities to the way Bibliontology is modelled and Bibliontology is very easy to use for its intended purpose – citations.

The more specific nature of the language goes on to pay dividends when we start to add in more data. Wuthering Heights has been made into a movie (several times) and one of the problems often discussed in FRBR circles is whether or not a movie based on a book is a new work or a new expression. Of course, the argument is false as a movie that faithfully reproduces a novel is both an expression of the story told in the novel and a creative work in its own right. While the movie could not exist without the novel it is based on, the art of film-making is a creative act as well. This is a hard thing to model with the four abstract levels defined in FRBR.

Here is the FRBR model showing the movie as an expression of the original work:

FRBR 02

This now seems to imply that the movie is somehow a lesser creative work than the original novel and I’m uncomfortable about that, but we do have the relationship between the book and the movie modelled.

The alternative is to recognise the movie as a creative work in its own right in which case the model looks like this:

FRBR 03

Now we’ve recognised the movie as a creative work in its own right, but lost the detail that it shares something with the novel. That makes the model less useful.

Using less abstract terms, and more of them, we can model in a way that describes the real-life situation – and hopefully avoid some of the argument, though I’m sure other issues will arise. Adding in the movie using the less abstract terms gives us this:

Real 02

Now we have the movie recognised as what it is and we have the relationship with the original novel.

I’ve applied the same logic to the physical items. It doesn’t help me to know that something is simply an item – I want to know what it is. So classes of Hardback, Paperback, CD-ROM, Blu-Ray Disc and Vinyl LP would be useful, where currently RDA provides a complex combination of Encoding Format and Carrier Type. This level of detail is more than likely required for archive and preservation purposes, but for the mainstream use of the data a top-level type would be very useful.

We can add more stuff than movies, though. We can add recordings. Showing my strange taste in music I’ll start with Wuthering Heights by Kate Bush (and the title nicely gives away where this is going). I shan’t try an model this using FRBR for comparison because I can’t see how to. If you feel you can then please sketch it out and add it in the comments or email it to me.

I don’t see a practical way in which making Wuthering Heights (the song) an expression of Wuthering Heights (story) is useful; yet their still exists a relationship between them. The song tells the same story (albeit abridged to 4:29).

Real 03

Modelling with real world terminology also allowed us to separate the song from the recording and the recording from the album it features on. Perhaps not something we can get to from the data we have today, but a useful feature to have in the model.

The richness and utility of modelling comes from giving more detail, not less and from using more specific terms, not more general terms.

The introduction of more specific terms also leads us to write more specific data conversion routines; looking to identify novels, albums, tracks, stories and more. Much of the data will not be mined from our MARC records, but by looking at the specifics we get past much of the variation that is difficult when we try to treat all works, expressions and manifestations the same across all mediums and forms of artistic endeavour.

One of the potential downsides of this approach is an ontology that may explode to contain many classes. While this seems like it is adding detail it is actually just moving detail. RDA documents this as ‘Form of Work’ – ‘A class or genre to which a work belongs.’

If the work belongs to that class, why not model it as that class?

I know several folks out there have been having a hard time applying FRBR to serials and other things, if you fancy having a go at modelling it with real-world language instead I’d love to talk to you – comment below.

Chase Jarvis Blog: Uber-Cool Video Projections On Buildings

Projection on buildings – Live performance from NuFormer Digital Media on Vimeo.

There’s been a lot of hot fuss lately about what’s possible with new projection media, especially in urban environments, onto building facades, etc. Last time I was in Paris there was similar stuff emerging on building walls in the Marais, but this seems to be evolving quickly and really taking off. Impressive live performance here in this video from NuFormer.

from Chase Jarvis Blog: Uber-Cool Video Projections On Buildings.

Interaction of 3D modelled world, projected onto the same building as in the model, allows awesome effects as the building is brought to life. Reminded me of the opening of the Atlantis Hotel, Dubai, which made similar attempts to use projection and lighting to change the building but failed to really make use of the building itself IMHO.

What will make eBooks as readily available as MP3s?

Printing Press by Thomas Hawk, Licensed cc-nc

Printing Press by Thomas Hawk, Licensed cc-nc

I was talking to a colleague recently about ebooks and the lack of access to course text books electronically. I asked why he thought that was, and he suggested that we were waiting for digital rights management to be sorted out – he meant that in his view we were waiting for DRM technology to be strong enough to protect publishers’ intellectual property rights.

This struck me as interesting, as that certainly wasn’t the case with music where DRM has been struggling (and failing) to catch up for some time. Then, last week, came the news that Amazon had recalled a book it had previously sold, at the publisher’s behest, deleting it from everyone’s Kindle and refunding them. James Grimmelmann reminds us he warned of Amazon’s terms and suggests we need new laws around digital property rights.

We’d also been discussing this at work, in the context of how digital music has disrupted things and attempting to predict how and what ebooks will disrupt and when. Then, Roy Tennant pops up saying Print is SO Not Dead. All of these got me thinking more about it.

The major trigger for digital music was the MP3 player, some cheap, some cool, both hardware and software. People bought MP3 players instead of CD, Mini-disc and cassette because they were smaller, could hold more music and had better battery life. Initially you put music onto them by ripping the CDs you already owned. i.e. there was a cheap, easy way to get digital music onto them from your existing media. We’ll leave the legality of ripping CDs to others.

It was this ability to get music into digital file form that led to online music sharing, and subsequently to the publishing of non-DRM MP3 files from major record labels. The ease with which music could be made available in digital form for anyone to use is what changed the recording industries business and gave consumers what they wanted – cheap, DRM-free music from their favourite artists.

DRM didn’t work for music for many reasons, not least because of the ease with which people could get hold of DRM-free copies. Other contributing factors included the profusion of cheap MP3 players, these players meant people didn’t just want DRM-free music, butneeded it because their cheap players wouldn’t play DRM. Those cheap players wouldn’t implement DRM because of the increased hardware cost of supporting it as well as the licensing cost of many of the schemes. Remember, we’re not talking about a £200 iPod here, we’re talking about a £5 USB stick with a headphone socket and 4 buttons.

The per unit manufacturing cost of an ebook reader is much higher, they’ll sell in smaller volumes, they have more parts including a good screen and use newer technologies rather than off-the-shelf components. The proportion of cost that DRM would impose is a much smaller part of the total unit cost than it was for MP3 players.

A good few good ebook readers have come out over the past year or so including the first and second generation kindles, the BeBook and recently the Samsung Papyrus. All very nice and very capable. Plastic Logic are on the brink of launching a nice new, very lightweight plastic reader.

But there’s still something missing – the books.

That’s not to say nothing’s progressing – a great number of books are available and I’ve not heard anyone complaining that they couldn’t get anything, but it’s still a tiny drop in the ocean, Barnes & Noble launching a store with 700,000 books, compared to Kindle’s “Over 300,000 eBooks, Newspapers, Magazines, and Blogs“. To put that in context, the Library of Congress alone which has 141,847,810 items in it’s catalogue.

And I have a stack of books on my desk, real ones with paper pages. And no way to easily get these onto my laptop.

This is due to several asymmetries. In music, the music is recorded and a player has always been required to reproduce the sound, whether analogue or digital. Books have had many advances in the production side, but not on the consumer side – books have never needed a player.

The second asymmetry is of the display and input of computers. Bill Buxton talks about this a lot when explaining why computers are still on the periphery of life, rather than integrated through it. Essentially this comes down to the issue that the display on my laptop can’t also see – there is no easy way to put a physical book into the computer.

So where does that leave the take-up of ebooks? The publishers seem to be in the same position the record industry was in some time ago, but without the driver to change. With music, consumers were able to say “if you won’t do this, we’ll do it ourselves” but with books that isn’t as easy. There aren’t students out there copying text books to give to their fellow students.

So without an obvious source of DRM-free ebooks – ones that people really want to read – and DRM as a much lower part of the manufacturing cost it seems unlikely that we’re going to see cheap, non-DRM ebook readers being taken up by lots of people.

So, in the absence of consumer-led digitisation of everyone’s existing collections, and assuming Google’s book scans don’t become freely available, what reason do publishers have to really support open and flexible digital publishing? None that I can see.

So this is where DRM may actually come in useful – in providing the mechanism that allows publishers to release those precious digital copies into the marketplace.