Building a simple HTTP-to-Z39.50 gateway using Yaz4j and Tomcat | Index Data
Yaz4J is a wrapper library over the client-specific parts of YAZ, a C-based Z39.50 toolkit, and allows you to use the ZOOM API directly from Java. Initial version of Yaz4j has been written by Rob Styles from Talis and the project is now developed and maintained at IndexData. ZOOM is a relatively straightforward API and with a few lines of code you can write a basic application that can establish connection to a Z39.50 server. Here we will try to build a very simple HTTP-to-Z3950 gateway using yaz4j and the Java Servlet technology.
from Building a simple HTTP-to-Z39.50 gateway using Yaz4j and Tomcat | Index Data.
I write Yaz4J a couple of years ago now and it’s great to see it getting some use outside of Talis.
left wondering…
He’s a nice chap, the man across from me on the train, jolly as we share a ‘What do you do?’ over the tops of our laptops. Mine a Mac with stickers on, his an old corporate HP struggling to boot.
His top button done up, tie pulled tight, pink pin-stripe running through the dark blue of his suit; me in my worn jeans.
“What do you do?” I ask. “I’m a head hunter” he replies. “Oh, what sector?” I ask. “Big industry; Power, Energy, Oil and Gas” he says, smiling.
“That must be interesting, do you do much in renewables?” I ask trying to turn the conversation to something I’d be very interested to hear about. “Oh no, there’s nothing in renewables, it’s just a distraction” he says dismissively. He goes on… “I just finished reading a report, renewables are fine to make us look good but they can’t provide anything like enough power for the needs of somewhere like the UK. For the big companies like Shell, BP, they’re just a distraction.”
“and all this suggestion that hydrocarbons are running out isn’t true, the oil companies are happy for people to think that as it keeps the prices high, but a project I recently hired for has found millions of barrels just off Brazil. There’s plenty of it out there.”
I sit back, wondering if he has kids; if he has noticed the chaotic weather or the news; if he watched The Age of Stupid. I resist asking.
I am left saddened and wondering, do we have any chance at all.
Ground roundup of new eReaders at CES on CNN
Las Vegas, Nevada (CNN) — The first generation of electronic readers had little more than black-and-white text. The second generation had black-and-white text, simple graphics and Web connectivity.
Glimpses of the third generation are on display this week at the International Consumer Electronics Show, where manufacturers are previewing e-readers with color screens, interactive graphics and magazine-style layouts.
Hacking Into Your Account is as Easy as 123456
in reality, it’s as easy as “123456″. And if that doesn’t work, we’d suggest trying “12345″, next.
The 18 Mistakes That Kill Startups
when I think about what killed most of the startups in the e-commerce business back in the 90s, it was bad programmers. A lot of those companies were started by business guys who thought the way startups worked was that you had some clever idea and then hired programmers to implement it. That’s actually much harder than it sounds—almost impossibly hard in fact—because business guys can’t tell which are the good programmers. They don’t even get a shot at the best ones, because no one really good wants a job implementing the vision of a business guy.
from The 18 Mistakes That Kill Startups by Paul Graham.
QOOQ – Le premier coach culinaire tactile
Tablets and multi-touch hardware are becoming more mainstream, and the release of Windows 7 will drive yet more. There hasn’t been much in the way of product design going into the tablets I’ve seen so far, which is why so many people keep hoping for an Apple tablet.
French company Unowhy are taking a different approach though, releasing a tablet targeted at the kitchen, with the sale driven by content – recipes and training videos from top french chefs.
QOOQ – Le premier coach culinaire tactile.
The physical design looks really good, if the price is right I could see these selling well and possibly prompting a targeted linux distro for it.
Pranav Mistry: The thrilling potential of SixthSense technology | Video on TED.com
At TEDIndia, Pranav Mistry demos several tools that help the physical world interact with the world of data — including a deep look at his SixthSense device and a new, paradigm-shifting paper “laptop.” In an onstage Q&A, Mistry says he’ll open-source the software behind SixthSense, to open its possibilities to all.
from Pranav Mistry: The thrilling potential of SixthSense technology | Video on TED.com.
101 Things I Learned in Interaction Design School
101 Things I Learned in Interaction Design School
A set of short, easily digested learnings from the world of Interaction Design, inspired by “101 Things I Learned in Architecture School”, by Matthew Frederick
Distributed, Linked Data has significant implications for Intellectual Property Rights in Data.
What P2P networks have done for distribution of digital media is phenomenal. It is possible, easy even, to get almost any TV show, movie, track or album you can think of by searching one of the many torrent sites. As fast as the media industry take down one site through legal action another has appeared to take its place.
I don’t want to discuss the legal, moral or social implications of this, but discuss how the internet changes the nature of our relationship with media – and data. The internet is a great big copying machine, true enough, but it’s also a fabric that allows mass co-operation. It’s that mass peer-to-peer co-operation that makes so much content available for free; content that is published freely by its creator as well as infringing content.
Sharing of copyrighted content is always likely to be infringing on p2p networks, regardless of any tricks employed, but for data the situation may be different and the Linked Data web has real implications in this space.
Taking the Royal Mail’s Postcode Address File as my working example, because it’s been in the news recently as a result of the work done by ErnestMarples.com, I’ll attempt to show how the Linked Data web changes the nature of data publishing and intellectual property.
First, in case you’re not familiar, a quick introduction to Linked Data. In Linked Data we use http web addresses (which we call URIs) not only to refer to documents containing data but also to refer to real-world things and abstract concepts. We then combine those URIs with properties and values to make statements about the things the URIs represent. So, I might say that my postcode is referred to by the URI http://someorg.example.com/myaddress/postcode. Asking for that URI in the browser would then redirect you to a document containing data about my postcode, maybe sending you to http://someorg.example.com/myaddress/postcode.rdf if you asked for data and http://someorg.example.com/myaddress/postcode.html if you asked for a web page (that’s called content negotiation). All of that works today and organisations like the UK Government, BBC, New York Times and others are publishing data this way.
Also worth noting is the distinction between Linked Data (the technique described above) and Linked Open Data, the output of the W3C’s Linking Open Data project. An important distinction as I’m talking about how commercially owned and protected databases may be disrupted by Linked Data, whereas Linked Open Data is data that is already published on the web under an Open license.
Now, Royal Mail own the Postcode Address File, and other postcode data such as geo co-ordinates. They are covered in the UK under Copyright and Database Right (which for which bits is a different story) so we assume it is “owned”. The database contains more than 28 million postcodes, so publishing my own postcode could not be considered an infringement in any meaningful way, publishing the data for all the addresses within a single postcode would also be unlikely to infringe as it’s such a small fraction of the total data.
So I might publish some data like this (the format is Turtle, a way to write down Linked Data)
<http://someorg.example.com/myaddress/postcode> a paf:Postcode; paf:correctForm "B37 7YB"; paf:normalisedForm "b377yb"; geo:long -1.717336; geo:lat 52.467971; paf:ordnanceSurveyCode "SP1930085600"; paf:gridRefEast 41930; paf:gridRefNorth 28560; paf:containsAddress <http://someorg.example.com/myaddress/postcode#1>; paf:googleMaps <http://maps.google.co.uk/maps?hl=en&source=hp&q=B377YB&ie=UTF8&hq=&hnear=Birmingham,+West+Midlands+B377YB,+United+Kingdom&gl=uk&ei=Zs8HS_KVNNOe4QbIpITTCw&ved=0CAgQ8gEwAA&t=h&z=16>. <http://someorg.example.com/myaddress/postcode#1> a paf:Address; paf:organisationName "Talis Group Ltd"; paf:dependentThoroughfareName "Knight's Court"; paf:thoroughfareName "Solihull Parkway"; paf:dependentLocality "Birmingham Business Park"; paf:postTown "Birmingham"; paf:postcode <http://someorg.example.com/myaddress/postcode>.
I’ve probably made some mistakes in terms of the PAF properties as it’s a long time since I worked with PAF, but it’s clear enough to make my point with. So, I publish this file on my own website as a way of describing the office where I work. That’s not an infringement of any rights in the data and perfectly legitimate thing to do with the address.
As the web of Linked Data takes off, and the same schema become commonly used for this kind of thing, we start to build a substantial copy of the original database. This time, however, the database is not on a single server as ErnestMarples.com was, but spread across the web of Linked Data. There is no single infringing organisation who can be made to take the data down again. If I were responsible for the revenue brought in from sales of PAF licenses this would be a concern, but not major as the distributed nature means it can’t be queried.
The distributed nature of the web means the web itself can’t be queried, but we already know how to address that technically – we built large aggregations of the web, index them and call them search engines. That is also already happening for the Linked Data web. As with the web of documents, some people are attempting to create general purpose search engines over the data and others specialised search engines for specific areas of interest. It’s easy to see that areas of value, such as address data, are likely to attract specialist attention.
Here though, while the individual documents do not infringe, an aggregate of many of them would start to infringe. The defence of crowd-sourcing used in other contexts (such as Open Street Map) does not apply here as the PAF is not factual data – the connection between a postcode and address can only have come from one place, PAF, and is owned by Royal Mail however it got into the database.
So, with the aggregate now infringing it can be taken down through request, negotiation or due process. The obvious answer to that might be for the aggregate to hold the URIs only, not the actual values of the data. This would leave it without a useful search mechanism, however. This could be addressed by having a well-known URI structure as I used in the example data. We might have
<http://addresssearch.example.net/postcodes/B37_7YB> owl:sameAs <http://someorg.example.com/myaddress/postcode>
This gets around the data issue, but the full list of postcodes itself may be considered infringing and they are clearly visible in the URIs. Taking them out of the URIs would leave no mechanism to go from a known postcode to data about it and addresses within it, the main use case for PAF. It doesn’t take long to make the link with other existing technology though, an area where we want to match a short string we know with an original short string, but cannot make the original short string available in clear text… Passwords.
Password storage uses one-way hashes so that the password is not available in its original form once stored, but requests to login can be matched by putting the attempted password through the same hash. Hashes are commonplace in the P2P world for a variety of functions, so are well-known and could be applied by an aggregator, or co-operative group, to solve this problem.
If I push the correct form of “B37 7YB” through MD5, I get “bdd2a7bf68119d001ebfd7f86c13a4c7″, but there is no way to get back from that to the original postcode. So a service that uses hashed URIs would not be publishing the postcode list in a directly useable form, but could be searched easily by anyone knowing how the URIs were structured and hashed.
<http://addresssearch.example.net/postcodes/bdd2a7bf68119d001ebfd7f86c13a4c7> owl:sameAs <http://someorg.example.com/myaddress/postcode>
Of course, a specialist address service, advertising address lookups and making money could still be considered as infringing by the courts regardless of the technical mechanisms employed, but what of more general aggregations or informal co-operative sites? sameAs, a site for sharing sameAs statements, already provides the infrastructure that would be needed for a service like this and the ease with which sites that do this can be setup and mirrored would make it hard to defend against using the law in the same way that torrent listing sites are difficult for the film and music industries to stop. Regardless of the technical approach and the degree to which that provide legal and practical defence, this is still publishing data in a way that is against the spirit of Copyright and Database Right.
The situation I describe above is one where many, many organisations and individuals are publishing data in a consistent form and that is likely to happen over the next few years for common data like business addresses and phone numbers, but much less likely for less mainstream data. The situation with addresses is one where it is clear there is a reason to publish your individual data other than to be part of the whole, in more contrived cases where the only reason to publish is to contribute to a larger aggregate the notion of fair-use for a small amount of the data may not stand up. That is, over the longer term, address data will not be crowd-sourced – people deliberately creating a dataset – but web-sourced – the data will be on the web anyway.
We can see from this small example that the kinds of data that may be vulnerable to distributed publishing in this way are wide-ranging. The Dewey Decimal Classification scheme used by libraries, Telephone directories (with lookups in both directions), Gracenote’s music database, Legal case numbering schemes, could all be published this way. The problem, of course, is that the data has to be distributed sufficiently that no individual host can be seen as infringing. For common data this will happen naturally, but the co-ordination overhead for a group trying to do this pro-actively would be significant; though that might be solved easily by someone else thinking about how to do this.
As I see it a small group of unfunded individuals would have difficulty achieving the level of distribution necessary to be defensible. Though could 1% of a database be considered infringing? Could/Would 100 people use their existing domains to publish 1% of the PAF each? Would 200 join in for ½% each? Then, of course, there are the usual problems of trust, timeliness and accuracy associated with non-authoritative publication.
These problems not withstanding, Linked Data has the potential to provide a global database at web-scale. Ways of querying that web of data will be invented, what I sketch out above is just one very basic approach. The change the web of data brings has huge implications for data owners and intellectual property rights in data.
Search
What I'm Doing...
- @moustaki, would you recommend an equivalent to music ontology for visual recordings? 11 hrs ago
- @chriskeene Does the uni have it's own local weather system? (http://twitter.com/chriskeene/status/10314171215 and go left) in reply to chriskeene 20 hrs ago
- @_philjohn should I expect a late arrival then? in reply to _philjohn 20 hrs ago
- More updates...
Recent Comments
- Patents are Property – Like it or Not « Chasing the Power Curve on When Patents Go Wrong…
- Arizona Joe on Fixing a plasma TV
- alex_turner11 on Ground roundup of new eReaders at CES on CNN
- negative_charge on Hacking Into Your Account is as Easy as 123456
- infopeep on Hacking Into Your Account is as Easy as 123456
- BenenhaleyBrian on The 18 Mistakes That Kill Startups
- Brian Benenhaley on The 18 Mistakes That Kill Startups
- infopeep on The 18 Mistakes That Kill Startups
- Rob Styles on Ruby Mock Web Server
- Jim on Fixing a plasma TV
Categories
- .Net Technical (8)
- Blog on Blog (6)
- commands I have issued (9)
- Enterprise Architecture (19)
- event (4)
- Fiction Book Review (2)
- Food (2)
- Intellectual Property (9)
- Interaction Design (27)
- Internet Social Impact (43)
- Internet Technical (16)
- IP Law (10)
- Library Tech (19)
- Music (2)
- New Toy (4)
- Non-Fiction Book Review (7)
- Ontologies (6)
- Open Data (7)
- Other Technical (20)
- Personal (36)
- Random Thought (16)
- Resourcing (4)
- Review (1)
- Security And Privacy (11)
- Semantic Web (30)
- Software Business (10)
- Software Engineering (37)
- Talis Technical (9)
- Uncategorized (44)
- Working at Talis (26)
- [grid::blogpaper] (8)
- [grid::fatherhood] (4)
Archives
- February 2010 (1)
- January 2010 (4)
- November 2009 (10)
- October 2009 (4)
- September 2009 (2)
- August 2009 (9)
- July 2009 (12)
- June 2009 (5)
- May 2009 (6)
- April 2009 (7)
- March 2009 (3)
- February 2009 (6)
- January 2009 (10)
- December 2008 (4)
- November 2008 (4)
- October 2008 (9)
- September 2008 (23)
- August 2008 (8)
- July 2008 (1)
- June 2008 (1)
- May 2008 (6)
- April 2008 (14)
- March 2008 (3)
- January 2008 (5)
- December 2007 (6)
- November 2007 (13)
- October 2007 (9)
- July 2007 (2)
- June 2007 (1)
- May 2007 (10)
- April 2007 (5)
- March 2007 (11)
- February 2007 (10)
- January 2007 (13)
- December 2006 (8)
- November 2006 (8)
- September 2006 (2)
- August 2006 (1)
- June 2006 (2)
- February 2006 (2)
- January 2006 (3)
- December 2005 (3)
- November 2005 (2)
- September 2005 (2)
- August 2005 (5)
- July 2005 (8)
- June 2005 (3)
- May 2005 (2)
- February 2005 (1)
- January 2005 (4)
- December 2004 (3)
- November 2004 (6)
- October 2004 (2)
- September 2004 (2)
- August 2004 (5)
- July 2004 (1)
- June 2004 (4)
- May 2004 (4)
- April 2004 (3)
- March 2004 (13)
- February 2004 (6)
- December 2003 (3)
- November 2003 (1)
- August 2003 (2)
- July 2003 (1)
- June 2003 (2)
- May 2003 (1)
- March 2003 (1)
- January 2003 (1)
- October 2002 (1)
- May 2002 (1)
- March 2002 (1)
- August 2001 (1)
- May 2001 (1)
- April 2001 (1)
- January 2001 (1)
- December 2000 (1)
- November 2000 (1)
- December 1999 (1)
- November 1999 (1)
- July 1999 (1)
