Exploring OpenLibrary Part One
This post also appears on the n2 blog.
I thought it was about time I got around to taking a better look at what might be possible with the OpenLibrary data.
My plan is to try and convert it into meaningful RDF and see what we can find out about things along the way. The project is an own-time project mostly, so progress isn’t likely to be very rapid. Let’s see how it goes. I’ll diary here as stuff gets done.
To save me typing loads of stuff out here, today’s source code is tagged and in the n2 subversion as day 1 of OpenLibrary.
Day one, 3rd October 2008, I downloaded the authors data from OpenLibrary and unzipped it. I’m also downloading the editions data from OpenLibrary, but that’s bigger (1.8Gb) so I’m playing with the author data while that comes down the tubes.
The data has been exported by OpenLibrary as JSON, so is pretty easy to work with. I’m going to write some PHP scripts on the command line to mess with it and it looks great for doing that.
Each line of the JSON in the authors file represents a single author, although some authors will have more than one entry. Taking a look at Iain Banks (aka Iain M Banks) we have the following entries:
{"name": "Banks, Iain", "personal_name": "Banks, Iain", "key": "\/a\/OL32312A", "birth_date": "1954", "type": {"key": "\/type\/type"}, "id": 81616}
{"name": "Banks, Iain.", "type": {"key": "\/type\/type"}, "id": 3011389, "key": "\/a\/OL954586A", "personal_name": "Banks, Iain."}
{"type": {"key": "\/type\/type"}, "id": 9897124, "key": "\/a\/OL2623466A", "name": "Iain Banks"}
{"type": {"key": "\/type\/type"}, "id": 9975649, "key": "\/a\/OL2645303A", "name": "Iain Banks "}
{"type": {"key": "\/type\/type"}, "id": 10565263, "key": "\/a\/OL2774908A", "name": "IAIN M. BANKS"}
{"type": {"key": "\/type\/type"}, "id": 10626661, "key": "\/a\/OL2787336A", "name": "Iain M. Banks"}
{"type": {"key": "\/type\/type"}, "id": 12035518, "key": "\/a\/OL3127859A", "name": "Iain M Banks"}
{"type": {"key": "\/type\/type"}, "id": 12078804, "key": "\/a\/OL3137983A", "name": "Iain M Banks "}
{"type": {"key": "\/type\/type"}, "id": 12177832, "key": "\/a\/OL3160648A", "name": "IAIN M.BANKS"}
In total the file contains 4,174,245 entries. First job is to get a more manageable set of data to work with. So, I wrote a short script to extract 1 line in every 10 from a file. The resulting sample author data file contains 417,424 entries. This is more manageable for quick testing of what I’m doing.
So now we can start writing some code to produce some RDF. Given the size of these files, I need to stream the data in and out again in chunks. The easiest format I find for that is turtle which has the added benefit of being human readable. YMMV. Previously I’ve streamed stuff out using n-triples. That has some great benefits too, like being able to generate different parts of the graph, for the same subject, in different parts of the file then being them together using a simple command line sort. It’s also a great format for chunking the resulting data into reasonable size files as breaking on whole lines doesn’t break the graph, whereas with rdf/xml and turtle it does.
So, I may end up dropping back to n-triples, but for now I’m going to use turtle.
I also like working on the command line and love the unix pipes model, so I’ll be writing the cli (command line) tools to read from STDIN and write to STDOUT so I can mess with the data using grep, sed, awk, sort, uniq and so on.
First things first, Let’s find out what’s really in the authors data. Reading the json line by line and converting each line into an associative array is simple in PHP, so let’s do that, keep track of all the keys we find in the arrays and recurse into the nested arrays to look at them - then dump the result out. The arrays contain this set of keys:
alternate_names alternate_names alternate_names\1 alternate_names\2 alternate_names\3 bio birth_date comment date death_date entity_type fuller_name id key location name numeration personal_name photograph title type type\key website
So, they have names, birth dates, death dates, alternate names and a few other bits and pieces. And they have a ‘key’ which turns out to be the resource part of the OpenLibrary url. That’s means we can link back into OpenLibrary nice and easy. Going back to our previous Iain Banks examples, we want to create something like this for each one:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
<http://example.com/a/OL32312A>
foaf:Name "Banks, Iain";
foaf:primaryTopicOf <http://openlibrary.org/a/OL32312A>;
bio:event <http://example.com/a/OL32312A#birth>;
a foaf:Person .
<http://example.com/a/OL32312A#birth>
bio:date "1954";
a bio:Birth .
This gives us a foaf:Person for the author and tracks his birth date using a bio:Birth event. While tracking the birth as a separate entity may seem odd it gives the opportunity to say things about the birth itself. We’ll model death dates the same way, for the same reason. I’ve written some basic code to generate foaf from the OpenLibrary authors.
Linking back to the OpenLibrary url has been done here using foaf:primaryTopicOf. I didn’t use owl:sameAs because the url at OpenLibrary is that of a web page, whereas the uri here (http://example.com/a/OL32312A) represents a person. Clearly a person is not the same as a web page that contains information about them.
The only thing worrying me is that the uris we’re using are constructed from OpenLibrary’s keys. This makes matching them up with other data sources hard. Matching with other data sources requires a natural key, but there’s not enough data in these author entries to create one. The best I can do is to create a natural key that will enable people to discover the group of authors that share a name.
@prefix mine: <http://example.com/mine/schema#> .
<http://example.com/names/banksiain>
mine:name_of <http://example.com/a/OL32312A>;
a mine:Name .
These uris will enable me to find authors that share the same name easily, either because they do share the same name or because they’re duplicates. The natural key is simply the author’s name with any casing, whitespace or punctuation stripped out. That might need to evolve as I start looking at the names in more detail later.
Next step is to look in more detail at the dates in here, we have some simple cases of trailing whitespace or trailing punctuation, but also some more interesting cases of approximate dates or possible ranges - these occur for historical authors mostly. The complete list of distinct dates within the authors file is in svn. If you know anything about dates, feel free to throw me some free advice on what to do with them…
4 Comments to Exploring OpenLibrary Part One
Rob,
This compelled me to knock up a movie showing the distillation of entities from OpenLibrary information resource URLs [1], en route to producing Linked Data using proxy URI :-)
Links:
1. http://www.vimeo.com/1878243
Kingsley
October 4, 2008
That’s really great Kingsley, thanks. There are lots of really obvious ways to get great value from the OpenLibrary data - it’s good to see some of them in action.
The appropriate URIs for things in OpenLibrary are the OpenLibrary URI plus #it; e.g. an author is http://openlibrary.org/a/OL32312A#it
I would have thought that birth and death should be b-nodes, but I’m fine with using http://openlibrary.org/a/OL32312A#birth
October 28, 2008
Hey Aaron, nice of you to drop-by.
WRT bnodes, I’m trying to give everything a uri to make it easier for others to link into and/or comment on any aspect.
With the OL #it URIs, that’s great, I didn’t know you guys were doing that - I’ll work that into what I’m doing.
rob
Leave a comment
Search
Right Now (ish)
- @malcolmlandon IT DIDN'T HAPPEN TO ME! in reply to malcolmlandon 19 hrs ago
- dear twitter, how many of you have a wood burning stove (http://tinyurl.com/a64bc5)? and what do you think of it? 19 hrs ago
- @ostephens I'm the only that understands me ;-) in reply to ostephens 20 hrs ago
- More updates...
Categories
- .Net Technical
- Blog on Blog
- commands I have issued
- Enterprise Architecture
- event
- Fiction Book Review
- Food
- Interaction Design
- Internet Social Impact
- Internet Technical
- IP Law
- Library Tech
- Music
- New Toy
- Non-Fiction Book Review
- Other Technical
- Personal
- Random Thought
- Resourcing
- Security And Privacy
- Semantic Web
- Software Business
- Software Engineering
- Talis Technical
- Uncategorized
- Working at Talis
- [grid::blogpaper]
- [grid::fatherhood]
Archive
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- January 2008
- December 2007
- November 2007
- October 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- September 2006
- August 2006
- June 2006
- February 2006
- January 2006
- December 2005
- November 2005
- September 2005
- August 2005
- July 2005
- June 2005
- May 2005
- February 2005
- January 2005
- December 2004
- November 2004
- October 2004
- September 2004
- August 2004
- July 2004
- June 2004
- May 2004
- April 2004
- March 2004
- February 2004
- December 2003
- November 2003
- August 2003
- July 2003
- June 2003
- May 2003
- March 2003
- January 2003
- May 2002
- March 2002
- August 2001
- May 2001
- April 2001
- January 2001
- December 2000
- November 2000
- December 1999
- November 1999
- July 1999
October 4, 2008