Sir Tim Berners-Lee to advise the Government on public information delivery –

From: Sir Tim Berners-Lee to advise the Government on public information delivery –

The Prime Minister has announced the appointment of the man credited with inventing the World Wide Web, Sir Tim Berners-Lee as expert adviser on public information delivery. The announcement was part of a statement on constitutional reform made in the House of Commons this afternoon.

Sir Tim Berners-Lee, who is currently director of the World Wide Web Consortium which overseas the web’s continued development. He will head a panel of experts who will advise the Minister for the Cabinet Office on how government can best use the internet to make non-personal public data as widely available as possible.

He will oversee the work to create a single online point of access for government held public data and develop proposals to extend access to data from the wider public sector, including selecting and implementing common standards. He will also help drive the use of the internet to improve government consultation processes.

TimBL talked about this at TED2009 and the video is below:

This is fantastic news, of course. Ambitious timescales, following the lead of the Obama administration, opening up government data for re-use as well as public oversight. All very good things.

The technical challenges in doing this will be very interesting. First off, the service will undoubtedly by Linked Data – the pattern of the Semantic Web or Web of Data. TimBL has been describing the efforts of the Linked Open Data community as “the web done right” for some time now. Linked data is also the approach taken by the US administration and is really starting to gather pace just like the early days of the document web. That will be interesting to see as it’s a different discipline to developing a basic html site with a different set of balances and trade-offs in the data modeling, granularity, URI design and so on.

Second up will be scaling to meet the traffic demand. As both a high profile linked data service and UK government data it will be highly in demand from day one. Coping with peak traffic loads is not technically difficult as long as someone has their eye on that ball from the start. It’s likely that demand for this data will be global, at least from those exploring what has been published, so traffic could get very high indeed. One of the aspects that might make this easier is that it will almost certainly be read-only for the foreseeable future, and that allows far more flexibility (and simplicity) in the approach to scaling.

Talking of it being read-only… Being a high profile data-source there will need to be a focus on securing it, not to prevent access, but to prevent unauthorised changes. Given the current atmosphere surrounding MPs expense claims and the level of voting in the recent European parliament elections it seems obvious that this will be a target for disgruntled and technically adept individuals both here and abroad. The read-only nature of the service helps make this easier, as does the linked data approach as that is the same in many security respects to the web of documents we have today – that is, securing it is well understood.

Definitely a project to watch closely.

[Disclosure - I work for Talis, a software company that offers a semantic web platform for doing this kind of publishing]

Legally Speaking: The Dead Souls of the Google Booksearch Settlement – O'Reilly Radar

In the short run, the Google Book Search settlement will unquestionably bring about greater access to books collected by major research libraries over the years. But it is very worrisome that this agreement, which was negotiated in secret by Google and a few lawyers working for the Authors Guild and AAP (who will, by the way, get up to $45.5 million in fees for their work on the settlement—more than all of the authors combined!), will create two complementary monopolies with exclusive rights over a research corpus of this magnitude. Monopolies are prone to engage in many abuses.

The Book Search agreement is not really a settlement of a dispute over whether scanning books to index them is fair use. It is a major restructuring of the book industry’s future without meaningful government oversight. The market for digitized orphan books could be competitive, but will not be if this settlement is approved as is.

from Legally Speaking: The Dead Souls of the Google Booksearch Settlement – O’Reilly Radar.

Coghead closes for business

With the announcement that Coghead, a really very smart app development platform, is closing its doors it’s worth thinking about how you can protect yourself from the inevitable disappearance of a service.

Of course, there are all the obvious business type due diligence activities like ensuring that the company has sufficient funds, understanding how
your subscription covers the cost (or doesn’t) of what you’re using and so on, but all these can do is make you feel more comfortable – they can’t provide real protection. To be protected you need 4 key things – if you have these 4 things you can, if necessary, move to hosting it yourself.

  1. URLs within your own domain.
  2. Both you and your customers will bookmark parts of the app, email links, embed links in documents, build excel spreadsheets that download the data and so on and so on. You need to control the DNS for the host that is running your tenancy in the SaaS service. Without this you have no way to redirect your customers if you need to run the software somewhere else.

    This is, really, the most important thing. You can re-create the data and the content, you can even re-write the application if you have to, but if you lose all the links then you will simple disappear.

  3. Regular exports of your data.
  4. You may not get much notice of changes in a SaaS service. When you find they are having outages, going bust or simply disappear is not the time to work out how to get your data back out. Automate a regular export of your data so you know you can’t lose too much. Coghead allowed for that and are giving people time to get their data out.

  5. Regular exports of your application.
  6. Having invested a lot in working out the write processes, rules and flows to make best use of your app you want to be able to export that too. This needs to be exportable in a form that can be re-imported somewhere else. Coghead hasn’t allowed for this, meaning that Coghead customers will have to re-write their apps based on a human reading of the Coghead definitions. Which brings me on to my next point…

  7. The code.
  8. You want to be able to take the exact same code that was running SaaS and install it on your own servers, install the exported code and data and update your DNS. Without the code you simply can’t do that. Making the code open-source may be a problem as others could establish equivalent services very quickly, but the software industry has had ways to deal with this problem through escrow and licensing for several decades. The code in escrow would be my absolute minimum.

SaaS and PaaS (Platform as a Service) providers promote a business model based on economies of scale, lower cost of ownership, improved availability, support and community. These things are all true even if they meet the four needs above – but the priorities for these needs are with the customer, not with the provider. That’s because meeting these four needs makes the development of a SaaS product harder and it also makes it harder for any individual customer to get setup. We certainly don’t meet all four with our SaaS and PaaS offerings at work yet, but I am confident that we’ll get there – and we’re not closing our doors any time soon ;-)

OCLC, Record Usage, Copyright, Contracts and the Law

FUD truck by John Markos on Flickr

FUD truck by John Markos on Flickr

NB: This is my own blog. The opinions I publish do not necessarily reflect those of my employer. I am not a lawyer, but I did ask James Grimmelmann for his thoughts.

Over on Metalogue, Karen Calhoun has been clarifying OCLC’s thinking behind its intention to change the usage policy for records sourced from WorldCat. It’s great to see OCLC communicating this stuff, albeit a tad late given the furore that had already ensued. The question still remains though, are they right to be doing what they are?

Firstly, in the interest of full disclosure, let me make it perfectly clear that I work for Talis. I enjoy working for Talis and I agree with Talis’s vision. I have to say that because Karen is clearly not happy with us:

OCLC has been severely criticized for its WorldCat data sharing policies and practices. Some of these criticisms have come from people or organizations that would benefit economically if they could freely replicate WorldCat.

OCLC believe that Talis is one of those organisations, and we are. There are others too, LibraryThing, Reddit, OpenLibrary, Amazon, Google. Potentially many libraries could benefit too.

This isn’t the first time I’ve talked about OCLC’s business model. I wrote an open letter to Karen Calhoun some time ago, talking about the issues of centralised control. The same concerns raise themselves again now. I feel there are several mis-conceptions in what Karen writes that I would like to offer a different perspective on.

First off, OCLC has no right to do this. That sounds all moral and indignant. I don’t mean it that way. What I mean is, they have literally no right in law – or at least only a very limited one.

Karen talks a lot about Creative Commons in her note, it’s apparent that they even considered using a Creative Commons license

And yes, while we considered simply adopting a Creative Commons license, we chose to retain an OCLC-specific policy to help us re-express well-established community practice from the Guidelines.

There is an important thing to know about CC. Applying a Creative Commons License to data is utterly worthless. It may indicate the intent of the publisher, but has absolutely no legal standing. This is because CC is a license scheme based on Copyright. Data is not protected by Copyright. The courts settled this in Feist Publications v. Rural Telephone Service.

This means that when Karen Coombs asks for several rights for the data:

1. Perpetual use – once I’ve downloaded something from OCLC I’ve for the right to use it forever period end of story. This promotes a bunch of things including the LOCKSS principle in the event something happens to OCLC
2. Right to share – records I’ve downloaded I’ve got the right to share with others
This means share in any fashion which the library sees fit, be it Z39.50 access, SRU/W, OAI, or transmission of records via other means
3. Right to migrate format – Eventually, libraries may stop using MARC or need to move records into a non-MARC system. So libraries need the right to transform their records

it is simply a matter of the members telling OCLC that’s how it’s gonna be. For those not under contract with OCLC – you have these rights already!

Therein lies the nub of OCLC’s problem. In Europe the database would be afforded legal protection simply by virtue of having taken effort or investment to create, the so called sui-generis right. US law does not have any such protection for databases. I know this because I was heavily involved in the development of the Open Data Commons PDDL and a real-life lawyer told me.

So, other legal remedies that might be used to enforce the policy could include a claim for misappropriation – reaping where one has not sown. This would be under state, rather than federal, law. Though NBA v. Motorola suggests that misappropriation may only apply if for some reason OCLC were unable to continue their service as a result. James Grimmelmann tells me

RS: If I understand correctly that would mean the only option left for enforcing restrictions on the use of the data would be contractual. Have I missed something obvious?

JG: I could see a claim for misappropriation under state law — OCLC has invested effort in creating WorldCat, and unauthorized use would amount to “reaping where one has not sown,” in the classic phrase from INS  v. AP.  I doubt, however, that such a claim would succeed, since misappropriation law is almost completely preempted by copyright.  Recent statements of misappropriation doctrine (e.g., NBA v. Motorola) suggest that it might remain available only where the plaintiff’s service couldn’t be provided at all if the defendant were allowed to do what it’s doing.  I don’t think that applies here.  So you’re right, it’s only contractual.

Without any solid legal basis on which to build a license directly, the policy falls back to being simply a contract – and with any contract you can decide if you wish to accept it or not. That, I suspect, is why OCLC wish to turn the existing guidelines into a binding contract.

So, OCLC members have the choice as to whether or not they accept the terms of the contract, but what about OpenLibrary? Some have suggested that this change could scupper that effort due to the viral nature of the reference to the usage policy in the records ultimately derived from WorldCat.

Nonsense. This is a truck load of FUD created around the new OCLC policy. Those talking about this possibility are right to be concerned, of course, as that may well be OCLC’s intent, but it doesn’t hold water. Given that the only enforcement of the policy is as a contract, it is only binding on those who are party to the contract. If OpenLibrary gets records from OCLC member libraries the presence of the policy statement does not create a contract, so OpenLibrary would not be considered party to the contract and not subject to enforcement of it. That is, if they haven’t signed a contract with OCLC this policy means nothing to them. They are under no legal obligation to adhere to it.

This is why OCLC are insisting that everyone has an upfront agreement with them. They know they need a contract. James Grimmelmann, who confirmed my interpretations of US Law for me said this in his reply this morning

JG: Let me add that it is possible for entities that get records from entities that get records from OCLC to be parties to OCLC’s contracts; it just requires that everyone involved be meticulous about making everyone else they deal with agree to the contract before giving them records. But as soon as some entities start passing along records without insisting on a signature up front, there are players in the system who aren’t bound, and OCLC has no contractual control over the records they get.

Jonathan Rochkind also concludes that OCLC’s focus on Copyright is bogus:

All this is to say, the law has changed quite a bit since 1982. If OCLC is counting on a copyright, they should probably have their legal counsel investigate. I’m not a lawyer, it doesn’t seem good to me–and even if they did have copyright, I can’t see how this would prevent people from taking sets of records anyway, as long as they didn’t take the whole database. But I’m still not a lawyer.

This is OCLC’s fear, that the WorldCat will get out of the bag.

The comparisons with other projects that use licenses such as CC or GFDL, and even open-source licenses are also entirely without merit.

To understand why we have to understand the philosophy behind the use of licenses. In OCLC’s case the intention is to restrict the usage of the data in order to prevent competing services from appearing. In the case of wikipedia and open-source projects the use of licenses is there to allow the community to fork the project in order to prevent monopoly ownership – i.e. to allow competing versions to appear. There are many versions of Linux, the community is better for that, the good ones thrive and the bad ones die. When a good one goes bad others rise up to take its place, starting from a point just before things went bad. If this is what OCLC want they must allow anyone to take the data, all of it, easily and create a competing service – under the same constraints, that the competing service must also make its data freely available. That’s what the ODC PDDL was designed for.

The reason this works in practice is that these are digital goods, in economic terms that means they are non-rival – if I give you a copy I still have my own copy, unlike a rival good where giving it to you would mean giving it up myself. OCLC has built a business model based on the notion that its data is a rival good, but the internet, cheap computing and a more mature understanding shows that to be broken.

Jonathan Rochkind also talk about a difference in intent in criticising OCLC’s comparison with Creative Commons:

But there remains one very big difference between the CC-BY-NC license you used as a model, and the actual policy. Your actual policy requires some recipients of sharing to enter into an agreement with OCLC (which OCLC can refuse to offer to a particular entity). The CC-BY-NC very explicitly and intentionally does NOT require this, and even _removes_ the ability of any sharers to require this.

This is a very big difference, as the entire purpose of the CC licenses is to avoid the possibility of someone requiring such a thing. So your policy may be like CC-BY-NC, while removing it’s very purpose.

Striving to prevent the creation of an alternative database is anti-competitive, reduces innovation and damages the member libraries in order to protect OCLC corp.

Their [OCLC's record usage guidelines] stated rationale for imposing conditions on libraries’ record sharing is that “member libraries have made a major investment in the OCLC Online Union Catalog and expect other member libraries, member networks and OCLC to take appropriate steps to protect the database.”

This makes no sense. The investment has been made now. The money is gone. What matters now is how much it costs libraries to continue to do business. Those costs would be reduced by making the data a commodity. Several centralised efforts have the potential to do just that, but the internet itself has that potential too, a potential OCLC has been working against for a long time. Their fight has taken the form of asking member libraries and software authors like Terry Reese not to upset the status quo by facilitating easy access to the Z39.50 network and now this change to the policy.

What underlies this is a lack of trust in the members. OCLC know that if an alternative emerged its member libraries would move based on merit, and OCLC clearly doesn’t believe it could compete on that level playing field. They are saying that they require a monopoly position in order to be viable.

However, what’s good for members and what’s good for OCLC are not one and the same thing. Members’ investment would be better protected by ensuring that the data is as promiscuously copied as possible. If members were to force OCLC to release the entire database under terms that ensure anyone who takes a copy must also make that copy available to others under the same terms then competition and market would be created. Competition and market are what drive innovation both in features and in cost reduction. In fact, it would create exactly the kind of market that has caused US legislators to refuse a database right, repeatedly. Think about it.

Above all, don’t be fooled that this data is anything but yours. The database is yours. All of yours.

If WorldCat were being made available in its entirety like this, it would be entirely reasonable to put clauses in to ensure any union catalogs taking the WorldCat data had to also publish their data reciprocally. That route leads us to a point where a truly global set of data becomes possible – where World(Cat) means world rather than predominantly affluent American libraries.

Surely OCLC, with its expertise in service provision, its understanding of how to analyse this kind of data, its standing in the community and not to forget its substantial existing network of libraries and librarians would continue to carve out a substantial and prestigious role for itself?

I’ve met plenty of folks from OCLC and they’re smart. They’ll come up with plenty of stuff worth the membership fee – it just shouldn’t be the data you already own.

Cryptography Challenge…

Cory Doctorow asked Bruce Schneier to give him a hand designing wedding rings. Not an obvious combination until you realise these are crypto rings…

There are two great discussions going on over at both blogs. Cory has asked his crowd to help design a cipher for his crypto wedding rings. While Bruce simply said Contest: Cory Doctorow’s Cipher Wheel Rings.

The discussion on both posts is worth reading. A mixture of things popping up about the similarity between the three rings and the Enigma machine as well as comments about Jefferson’s Wheel Cipher.

Like most things Cory does (or says) there’s an element of the slightly bizarre. The prize, a not to be sniffed-at signed copy of Little Brother.

The full set of photos are on Cory’s Flickr account, tagged weddingring.

Comparisons with the Enigma machine, I suspect, are bogus. While there is a visual similarity with the Enigma’s wheels the Enigma’s cipher was implemented in the electronics within the machine. The letters on the rotors simply enabling the correct starting positions to be selected. The Enigma machines perform a substitution cipher, but with the additional complexity that the substitution pattern changes for each letter through the message. I don’t see a way to do that with these rings. There may be rotor ciphers that could be implemented – I don’t know.

Jefferson’s cipher is a much closer match, a fully manual system consisting of 26 wheels with the alphabet scrambled differently on each one. Similar to the Enigma machine, sender and receiver had to have the order of the wheels synchronised and each letter would use a different substitution scheme, though Jefferson’s not as thorough as the Enigma.

As the rings cannot be altered and the alphabet is in order on all three wheels, any attempt that results in one character of cipher text for each character of plain text will be a simple substitution cipher. While it may take several complex steps to arrive at the cipher character it will only take an attacker one step to go back.

So, if you’re thinking about this problem seriously there are some things you have to decide on first…

  1. Is the ring considered secret or not?

    This is isn’t an unreasonable assumption (putting aside that the details have been published online). It’s not that long ago that messages were transferred in plain text relying only on the emperor’s seal – made in wax with a ring only he carried.

  2. Can you include another secret?

    There are suggestions on the blogs of using most recent blog posts, first pages of known books and other items as keys to drive the cipher. This then involves taking the character from the key and the character from the plaintext and some form of mathematical computation (shifting rings up or down, finding the next dot above or below, that kind of thing) to arrive at the cipher text character.

  3. Is the algorithm secret?

    Knowing Bruce’s views on secrecy and security, even suggesting it is pure heresy. Considering the ring to be secret may be part of this, or may not. Some of the ideas I’ve had fall outside being encryption and really fall into the realm of a ‘secret encoding’. But hey, something has to be secret and if it can’t be the ring, or the key, the maybe it has to be the algorithm.

Then, of course, you have to decide what to do with the rings. Any Cryptographic algorithm fulfils one of four basic purposes:

  1. Symmetric Encryption

    These algorithms use the same key to encrypt and decrypt the text. They may use a single algorithm, like ROT13, or they may use a matched pair of algorithms, like many other substitution ciphers.

  2. Asymmetric Encryption

    These algorithms use one key to encrypt and another to decrypt. The keys in this case are paired and are usually termed public and private keys. Typically you would use the recipients public key to encrypt and they would use their own private key to decrypt.

  3. Non-Decryptable Hashes

    Used mostly for storing passwords (I can’t think of another use), these algorithms enable you to reliably convert plain text into a hash with little possibility of reversing the process. For passwords this means you store the hash of the password, then compare the hashed version of any sign-in attempt with the stored hash.

  4. Signing

    Signing means adding some kind of addendum to the message that confirms you wrote it. Again this is done using public/private key pairs. You use your private key to create a hashed version of the message which others can then verify using your public key.

As well as thinking about all of that good stuff it might be worth looking for clues in the design of the rings. Bruce must have had something in mind when designing the rings.

Here are the obvious things to notice:

  1. All three rings feature the alphabet in order.
  2. The dot patterns are not random.
  3. The dot pattern follow a 1, 2, 3 pattern.
  4. The dot pattern is not unique (it repeats) when looking across the three rings.

Less obvious:

  1. The S across three rings, looking at the dots above, makes dot, dot, dot while the O across the dots on top is three blanks (dash, dash, dash?) this made me go look at Morse Code again.

Yep, that’s all I spotted :-(

I’ll be chatting with a coupe of colleagues to see if we can put our heads together and also watching to see what the winner comes up with.


Lookybook | Home

A friend emailed me a link to Lookybook a little while ago and I’ve been meaning to blog it for a while.

It’s an interesting experiment as there are a good number of brilliant picture books to read with young children online in all their glory. Not low-res scans or just the first two pages; the full books in great big hi-res, page turning flash.

You can, and I do, sit at the computer and read these with the kids. I suspect, though, that most parents will buy more of these books as they find more and more great books full of big pictures of lorries for the 3 year old boy in their life.