Building a simple HTTP-to-Z39.50 gateway using Yaz4j and Tomcat | Index Data

Monday, February 22nd, 2010 | Library Tech | No Comments

Yaz4J is a wrapper library over the client-specific parts of YAZ, a C-based Z39.50 toolkit, and allows you to use the ZOOM API directly from Java. Initial version of Yaz4j has been written by Rob Styles from Talis and the project is now developed and maintained at IndexData. ZOOM is a relatively straightforward API and with a few lines of code you can write a basic application that can establish connection to a Z39.50 server. Here we will try to build a very simple HTTP-to-Z3950 gateway using yaz4j and the Java Servlet technology.

from Building a simple HTTP-to-Z39.50 gateway using Yaz4j and Tomcat | Index Data.

I write Yaz4J a couple of years ago now and it’s great to see it getting some use outside of Talis.

left wondering…

Friday, January 29th, 2010 | Personal | No Comments

He’s a nice chap, the man across from me on the train, jolly as we share a ‘What do you do?’ over the tops of our laptops. Mine a Mac with stickers on, his an old corporate HP struggling to boot.

His top button done up, tie pulled tight, pink pin-stripe running through the dark blue of his suit; me in my worn jeans.

“What do you do?” I ask. “I’m a head hunter” he replies. “Oh, what sector?” I ask. “Big industry; Power, Energy, Oil and Gas” he says, smiling.

“That must be interesting, do you do much in renewables?” I ask trying to turn the conversation to something I’d be very interested to hear about. “Oh no, there’s nothing in renewables, it’s just a distraction” he says dismissively. He goes on… “I just finished reading a report, renewables are fine to make us look good but they can’t provide anything like enough power for the needs of somewhere like the UK. For the big companies like Shell, BP, they’re just a distraction.”

“and all this suggestion that hydrocarbons are running out isn’t true, the oil companies are happy for people to think that as it keeps the prices high, but a project I recently hired for has found millions of barrels just off Brazil. There’s plenty of it out there.”

I sit back, wondering if he has kids; if he has noticed the chaotic weather or the news; if he watched The Age of Stupid. I resist asking.

I am left saddened and wondering, do we have any chance at all.

Ground roundup of new eReaders at CES on CNN

Wednesday, January 27th, 2010 | Library Tech | 1 Comment

Las Vegas, Nevada (CNN) — The first generation of electronic readers had little more than black-and-white text. The second generation had black-and-white text, simple graphics and Web connectivity.

Glimpses of the third generation are on display this week at the International Consumer Electronics Show, where manufacturers are previewing e-readers with color screens, interactive graphics and magazine-style layouts.

from Bold new e-readers grab attention at CES – CNN.com.

Hacking Into Your Account is as Easy as 123456

Saturday, January 23rd, 2010 | Security And Privacy | 2 Comments

in reality, it’s as easy as “123456″. And if that doesn’t work, we’d suggest trying “12345″, next.

from Hacking Into Your Account is as Easy as 123456.

The 18 Mistakes That Kill Startups

Monday, January 18th, 2010 | Software Business | 3 Comments

when I think about what killed most of the startups in the e-commerce business back in the 90s, it was bad programmers. A lot of those companies were started by business guys who thought the way startups worked was that you had some clever idea and then hired programmers to implement it. That’s actually much harder than it sounds—almost impossibly hard in fact—because business guys can’t tell which are the good programmers. They don’t even get a shot at the best ones, because no one really good wants a job implementing the vision of a business guy.

from The 18 Mistakes That Kill Startups by Paul Graham.

When you really want empty files…

Friday, November 27th, 2009 | commands I have issued | 2 Comments

ls | sed -e ’s%^%cat /dev/null > %’ | bash

* NB: Dangerous.

QOOQ – Le premier coach culinaire tactile

Friday, November 27th, 2009 | Interaction Design | 1 Comment

Tablets and multi-touch hardware are becoming more mainstream, and the release of Windows 7 will drive yet more. There hasn’t been much in the way of product design going into the tablets I’ve seen so far, which is why so many people keep hoping for an Apple tablet.

French company Unowhy are taking a different approach though, releasing a tablet targeted at the kitchen, with the sale driven by content – recipes and training videos from top french chefs.

QOOQ - Le premier coach culinaire tactile

QOOQ – Le premier coach culinaire tactile.

The physical design looks really good, if the price is right I could see these selling well and possibly prompting a targeted linux distro for it.

Pranav Mistry: The thrilling potential of SixthSense technology | Video on TED.com

Wednesday, November 25th, 2009 | Interaction Design | 3 Comments

At TEDIndia, Pranav Mistry demos several tools that help the physical world interact with the world of data — including a deep look at his SixthSense device and a new, paradigm-shifting paper “laptop.” In an onstage Q&A, Mistry says he’ll open-source the software behind SixthSense, to open its possibilities to all.

from Pranav Mistry: The thrilling potential of SixthSense technology | Video on TED.com.

101 Things I Learned in Interaction Design School

Wednesday, November 25th, 2009 | Interaction Design | 7 Comments

101 Things I Learned in Interaction Design School

A set of short, easily digested learnings from the world of Interaction Design, inspired by “101 Things I Learned in Architecture School”, by Matthew Frederick

from 101 Things I Learned in Interaction Design School.

Distributed, Linked Data has significant implications for Intellectual Property Rights in Data.

What P2P networks have done for distribution of digital media is phenomenal. It is possible, easy even, to get almost any TV show, movie, track or album you can think of by searching one of the many torrent sites. As fast as the media industry take down one site through legal action another has appeared to take its place.

I don’t want to discuss the legal, moral or social implications of this, but discuss how the internet changes the nature of our relationship with media – and data. The internet is a great big copying machine, true enough, but it’s also a fabric that allows mass co-operation. It’s that mass peer-to-peer co-operation that makes so much content available for free; content that is published freely by its creator as well as infringing content.

Sharing of copyrighted content is always likely to be infringing on p2p networks, regardless of any tricks employed, but for data the situation may be different and the Linked Data web has real implications in this space.

Taking the Royal Mail’s Postcode Address File as my working example, because it’s been in the news recently as a result of the work done by ErnestMarples.com, I’ll attempt to show how the Linked Data web changes the nature of data publishing and intellectual property.

First, in case you’re not familiar, a quick introduction to Linked Data. In Linked Data we use http web addresses (which we call URIs) not only to refer to documents containing data but also to refer to real-world things and abstract concepts. We then combine those URIs with properties and values to make statements about the things the URIs represent. So, I might say that my postcode is referred to by the URI http://someorg.example.com/myaddress/postcode. Asking for that URI in the browser would then redirect you to a document containing data about my postcode, maybe sending you to http://someorg.example.com/myaddress/postcode.rdf if you asked for data and http://someorg.example.com/myaddress/postcode.html if you asked for a web page (that’s called content negotiation). All of that works today and organisations like the UK Government, BBC, New York Times and others are publishing data this way.

Also worth noting is the distinction between Linked Data (the technique described above) and Linked Open Data, the output of the W3C’s Linking Open Data project. An important distinction as I’m talking about how commercially owned and protected databases may be disrupted by Linked Data, whereas Linked Open Data is data that is already published on the web under an Open license.

Now, Royal Mail own the Postcode Address File, and other postcode data such as geo co-ordinates. They are covered in the UK under Copyright and Database Right (which for which bits is a different story) so we assume it is “owned”. The database contains more than 28 million postcodes, so publishing my own postcode could not be considered an infringement in any meaningful way, publishing the data for all the addresses within a single postcode would also be unlikely to infringe as it’s such a small fraction of the total data.

So I might publish some data like this (the format is Turtle, a way to write down Linked Data)

<http://someorg.example.com/myaddress/postcode>
  a paf:Postcode;
  paf:correctForm "B37 7YB";
  paf:normalisedForm "b377yb";
  geo:long -1.717336;
  geo:lat 52.467971;
  paf:ordnanceSurveyCode "SP1930085600";
  paf:gridRefEast 41930;
  paf:gridRefNorth 28560;
  paf:containsAddress <http://someorg.example.com/myaddress/postcode#1>;
  paf:googleMaps <http://maps.google.co.uk/maps?hl=en&source=hp&q=B377YB&ie=UTF8&hq=&hnear=Birmingham,+West+Midlands+B377YB,+United+Kingdom&gl=uk&ei=Zs8HS_KVNNOe4QbIpITTCw&ved=0CAgQ8gEwAA&t=h&z=16>.

<http://someorg.example.com/myaddress/postcode#1>
  a paf:Address;
  paf:organisationName "Talis Group Ltd";
  paf:dependentThoroughfareName "Knight's Court";
  paf:thoroughfareName "Solihull Parkway";
  paf:dependentLocality "Birmingham Business Park";
  paf:postTown "Birmingham";
  paf:postcode <http://someorg.example.com/myaddress/postcode>.

I’ve probably made some mistakes in terms of the PAF properties as it’s a long time since I worked with PAF, but it’s clear enough to make my point with. So, I publish this file on my own website as a way of describing the office where I work. That’s not an infringement of any rights in the data and perfectly legitimate thing to do with the address.

As the web of Linked Data takes off, and the same schema become commonly used for this kind of thing, we start to build a substantial copy of the original database. This time, however, the database is not on a single server as ErnestMarples.com was, but spread across the web of Linked Data. There is no single infringing organisation who can be made to take the data down again. If I were responsible for the revenue brought in from sales of PAF licenses this would be a concern, but not major as the distributed nature means it can’t be queried.

The distributed nature of the web means the web itself can’t be queried, but we already know how to address that technically – we built large aggregations of the web, index them and call them search engines. That is also already happening for the Linked Data web. As with the web of documents, some people are attempting to create general purpose search engines over the data and others specialised search engines for specific areas of interest. It’s easy to see that areas of value, such as address data, are likely to attract specialist attention.

Here though, while the individual documents do not infringe, an aggregate of many of them would start to infringe. The defence of crowd-sourcing used in other contexts (such as Open Street Map) does not apply here as the PAF is not factual data – the connection between a postcode and address can only have come from one place, PAF, and is owned by Royal Mail however it got into the database.

So, with the aggregate now infringing it can be taken down through request, negotiation or due process. The obvious answer to that might be for the aggregate to hold the URIs only, not the actual values of the data. This would leave it without a useful search mechanism, however. This could be addressed by having a well-known URI structure as I used in the example data. We might have

<http://addresssearch.example.net/postcodes/B37_7YB>
  owl:sameAs <http://someorg.example.com/myaddress/postcode>

This gets around the data issue, but the full list of postcodes itself may be considered infringing and they are clearly visible in the URIs. Taking them out of the URIs would leave no mechanism to go from a known postcode to data about it and addresses within it, the main use case for PAF. It doesn’t take long to make the link with other existing technology though, an area where we want to match a short string we know with an original short string, but cannot make the original short string available in clear text… Passwords.

Password storage uses one-way hashes so that the password is not available in its original form once stored, but requests to login can be matched by putting the attempted password through the same hash. Hashes are commonplace in the P2P world for a variety of functions, so are well-known and could be applied by an aggregator, or co-operative group, to solve this problem.

If I push the correct form of “B37 7YB” through MD5, I get “bdd2a7bf68119d001ebfd7f86c13a4c7″, but there is no way to get back from that to the original postcode. So a service that uses hashed URIs would not be publishing the postcode list in a directly useable form, but could be searched easily by anyone knowing how the URIs were structured and hashed.

<http://addresssearch.example.net/postcodes/bdd2a7bf68119d001ebfd7f86c13a4c7>
  owl:sameAs <http://someorg.example.com/myaddress/postcode>

Of course, a specialist address service, advertising address lookups and making money could still be considered as infringing by the courts regardless of the technical mechanisms employed, but what of more general aggregations or informal co-operative sites? sameAs, a site for sharing sameAs statements, already provides the infrastructure that would be needed for a service like this and the ease with which sites that do this can be setup and mirrored would make it hard to defend against using the law in the same way that torrent listing sites are difficult for the film and music industries to stop. Regardless of the technical approach and the degree to which that provide legal and practical defence, this is still publishing data in a way that is against the spirit of Copyright and Database Right.

The situation I describe above is one where many, many organisations and individuals are publishing data in a consistent form and that is likely to happen over the next few years for common data like business addresses and phone numbers, but much less likely for less mainstream data. The situation with addresses is one where it is clear there is a reason to publish your individual data other than to be part of the whole, in more contrived cases where the only reason to publish is to contribute to a larger aggregate the notion of fair-use for a small amount of the data may not stand up. That is, over the longer term, address data will not be crowd-sourced – people deliberately creating a dataset – but web-sourced – the data will be on the web anyway.

We can see from this small example that the kinds of data that may be vulnerable to distributed publishing in this way are wide-ranging. The Dewey Decimal Classification scheme used by libraries, Telephone directories (with lookups in both directions), Gracenote’s music database, Legal case numbering schemes, could all be published this way. The problem, of course, is that the data has to be distributed sufficiently that no individual host can be seen as infringing. For common data this will happen naturally, but the co-ordination overhead for a group trying to do this pro-actively would be significant; though that might be solved easily by someone else thinking about how to do this.

As I see it a small group of unfunded individuals would have difficulty achieving the level of distribution necessary to be defensible. Though could 1% of a database be considered infringing? Could/Would 100 people use their existing domains to publish 1% of the PAF each? Would 200 join in for ½% each? Then, of course, there are the usual problems of trust, timeliness and accuracy associated with non-authoritative publication.

These problems not withstanding, Linked Data has the potential to provide a global database at web-scale. Ways of querying that web of data will be invented, what I sketch out above is just one very basic approach. The change the web of data brings has huge implications for data owners and intellectual property rights in data.

Search

What I'm Doing...