Semantic Web

Exploring OpenLibrary Part Two

Wednesday, October 22nd, 2008 | Internet Technical, Semantic Web, Talis Technical | 1 Comment

This post also appears on the n2 blog.

More than two weeks on from my last look at the OpenLibrary authors data and I’m finally finding some time to look a bit deeper. Last time I finished off thinking about the complete list of distinct dates within the authors file and how to model those.

Where I’ve got to today is tagged as day 2 of OpenLibrary in the n2 subversion.

First off, a correction - foaf:Name should have been foaf:name. Thanks to Leigh for pointing that out. I haven’t fixed in this tag, tagged before I realised I’d forgotten it, but next time, honestly.

It’s clear that there is some stuff in the data that simply shouldn’t be there, things that cannot possibly be a birth date such [from old catalog] and *. and simply ,. When I came across —oOo— I was somewhat dismayed. MARC data, where most of this data has come from, has a long and illustrious history, but one of the mistakes made early on was to put display data into the records in the form of ISBD punctuation. This, combined with the real inflexibility of most ILSs and web-based catalogs has forced libraries to hack there records with junk like —oOo— to fix display errors. This one comes from Antonio Ignacio Margariti.

In total there are only 6,156 unique birth date datums and 4,936 unique death dates. Of course there is some overlap, so in total there’s only 9,566 datums to worry about overall.

So what I plan to do is to set up the recognisable patterns in code and discard anything I don’t recognise as a date or date range. Doing that may mean I lose some date information, but I can add that back in later as more patterns get spotted. So far I’ve found several patterns (shown here using regex notation)…

“^[0-9]{1,4}$” - A straightforward number of 4 digits or fewer, no letters, punctuation or whitespace. These are simple years, last week I popped them in using bio:date . That’s not strictly within the rules of the bio schema as that really requires a date formatted in accordance with ISO8601. Ian had already implied his dis-pleasure with my use of bio:date and suggested I use the more relaxed dc elements date. However, on further chatting what we actually have is a date range within which the event occurred, so we need to show that the event happened somewhere within a date range. This can be solved using the W3C Time Ontology which allows for better description.

I spent some time getting hung up on exactly what is being said by these date assertions on a bio:Birth event. That is, are we saying that the birth took place somewhere within that period, or that the event happened over that period. This may seem a daft question to ask, but as others start modelling events in peoples’ bios this could easily become indistinguishable. Say I want to model my grandfather’s experience of the second world war. I’d very likely model that as an event occurring over a four year period. So, I feel the need to distinguish between an event happening over a period and an event happening at an unknown time within a period. I thought I was getting too pedantic about this, but Ian assured me I’m not and that the distinction matters.

The model we end up with is like this


@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix mine: <http://example.com/mine/schema#> .
@prefix time: <http://www.w3.org/TR/owl-time/> .

<http://example.com/a/OL149323A>
	foaf:Name "Schaller, Heinrich";
	foaf:primaryTopicOf <http://openlibrary.org/a/OL149323A>;
	bio:event <http://example.com/a/OL149323A#birth>;
	a foaf:Person .

<http://example.com/a/OL149323A#birth>
	dc:date <http://example.com/a/OL149323A#birthDate>;
	a bio:Birth .

<http://example.com/names/schallerheinrich>
	mine:name_of <http://example.com/a/OL149323A>;
	a mine:Name .

<http://example.com/dates/gregorian/ad/years/1900>
	time:unitType time:unitYear;
	time:year "1900";
	a time:DateTimeDescription .

<http://example.com/a/OL149323A#birthDate>
	time:inDateTime <http://example.com/dates/gregorian/ad/years/1900>;
	a time:Instant .

The simple year accounts for 731,304 of the 748,291 birth dates and for 13,151 of the 181,696 death dates, about 80% of the dates overall. Following the 80/20 rule almost perfectly, the remaining 20% is going to be painful. It has been suggested I should stop here, but it seems a shame to not have access to the rest if we can dig in, and I can, so…

First of the remaining correct entries are the approximate years, recorded as ca. 1753 or (ca.) 1753 and other variants of that. These all suffer from leading and trailing junk, but I’ll catch the clean ones of these with “^[(]?ca\.[)]? ([0-9]{1,4})$”. The difficulty with these is that you can’t really convert these into a single year or even a date range as what people consider as within the “circa” will vary widely in different contexts. So, the interval can be described in the same way as a simple year, but the relationship with the authors birth is not simply time:inDateTime. I haven’t found a sensible circa predicate, so for now I’ll drop into mine.


@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix mine: <http://example.com/mine/schema#> .
@prefix time: <http://www.w3.org/TR/owl-time/> .

<http://example.com/a/OL151554A>
	foaf:Name "Altdorfer, Albrecht";
	foaf:primaryTopicOf <http://openlibrary.org/a/OL151554A>;
	bio:event <http://example.com/a/OL151554A#birth>;
	bio:event <http://example.com/a/OL151554A#death>;
	a foaf:Person .

<http://example.com/a/OL151554A#birth>
	dc:date <http://example.com/a/OL151554A#birthDate>;
	a bio:Birth .

<http://example.com/a/OL151554A#death>
	dc:date <http://example.com/a/OL151554A#deathDate>;
	a bio:Death .

<http://example.com/names/altdorferalbrecht>
	mine:name_of <http://example.com/a/OL151554A>;
	a mine:Name .

<http://example.com/dates/gregorian/ad/years/1480>
	time:unitType time:unitYear;
	time:year "1480";
	a time:DateTimeDescription .

<http://example.com/a/OL151554A#birthDate>
	mine:circaDateTime <http://example.com/dates/gregorian/ad/years/1480>;
	a time:Instant .

Ok, it’s time to stop there until next time. I have several remaining forms to look at and some issues of data cleanup.

Next time I’ll be looking at parsing out date ranges of a few years, shown in the data 1103 or 4. These will go in as longer date time descriptions so no new modelling needed.

Then we have centuries, 7th cent., again just a broader date time description required I hope. There are some entries for works from before the birth of Christ - 127 B.C.. I’ll have to take a look at how those get described. Then we have entries starting with an l like l854. I had thought that these may indicate a different calendaring system, but it appear not. Perhaps it’s bad OCRing as there are also entries like l8l4. Not sure what to do with those just yet.

In terms of data cleanup, there are dates in the birth_date field of the form d. 1823 which means that it’s actually a death date. There are also dates prefixed with fl. which means they are flourishing dates. These are used when a birth date is unknown but the period in which the creator was active is known. These need to be pulled out and handled separately.

Of course, I haven’t dealt with the leading and trailing punctuation yet or those that have names mixed in with the dates, so still much work to do in transforming this into a rich graph.

Exploring OpenLibrary Part One

Friday, October 3rd, 2008 | Internet Technical, Semantic Web, Talis Technical | 4 Comments

This post also appears on the n2 blog.

I thought it was about time I got around to taking a better look at what might be possible with the OpenLibrary data.

My plan is to try and convert it into meaningful RDF and see what we can find out about things along the way. The project is an own-time project mostly, so progress isn’t likely to be very rapid. Let’s see how it goes. I’ll diary here as stuff gets done.

To save me typing loads of stuff out here, today’s source code is tagged and in the n2 subversion as day 1 of OpenLibrary.

Day one, 3rd October 2008, I downloaded the authors data from OpenLibrary and unzipped it. I’m also downloading the editions data from OpenLibrary, but that’s bigger (1.8Gb) so I’m playing with the author data while that comes down the tubes.

The data has been exported by OpenLibrary as JSON, so is pretty easy to work with. I’m going to write some PHP scripts on the command line to mess with it and it looks great for doing that.

Each line of the JSON in the authors file represents a single author, although some authors will have more than one entry. Taking a look at Iain Banks (aka Iain M Banks) we have the following entries:


{"name": "Banks, Iain", "personal_name": "Banks, Iain", "key": "\/a\/OL32312A", "birth_date": "1954", "type": {"key": "\/type\/type"}, "id": 81616}
{"name": "Banks, Iain.", "type": {"key": "\/type\/type"}, "id": 3011389, "key": "\/a\/OL954586A", "personal_name": "Banks, Iain."}
{"type": {"key": "\/type\/type"}, "id": 9897124, "key": "\/a\/OL2623466A", "name": "Iain Banks"}
{"type": {"key": "\/type\/type"}, "id": 9975649, "key": "\/a\/OL2645303A", "name": "Iain Banks         "}
{"type": {"key": "\/type\/type"}, "id": 10565263, "key": "\/a\/OL2774908A", "name": "IAIN M. BANKS"}
{"type": {"key": "\/type\/type"}, "id": 10626661, "key": "\/a\/OL2787336A", "name": "Iain M. Banks"}
{"type": {"key": "\/type\/type"}, "id": 12035518, "key": "\/a\/OL3127859A", "name": "Iain M Banks"}
{"type": {"key": "\/type\/type"}, "id": 12078804, "key": "\/a\/OL3137983A", "name": "Iain M Banks         "}
{"type": {"key": "\/type\/type"}, "id": 12177832, "key": "\/a\/OL3160648A", "name": "IAIN M.BANKS"}

In total the file contains 4,174,245 entries. First job is to get a more manageable set of data to work with. So, I wrote a short script to extract 1 line in every 10 from a file. The resulting sample author data file contains 417,424 entries. This is more manageable for quick testing of what I’m doing.

So now we can start writing some code to produce some RDF. Given the size of these files, I need to stream the data in and out again in chunks. The easiest format I find for that is turtle which has the added benefit of being human readable. YMMV. Previously I’ve streamed stuff out using n-triples. That has some great benefits too, like being able to generate different parts of the graph, for the same subject, in different parts of the file then being them together using a simple command line sort. It’s also a great format for chunking the resulting data into reasonable size files as breaking on whole lines doesn’t break the graph, whereas with rdf/xml and turtle it does.

So, I may end up dropping back to n-triples, but for now I’m going to use turtle.

I also like working on the command line and love the unix pipes model, so I’ll be writing the cli (command line) tools to read from STDIN and write to STDOUT so I can mess with the data using grep, sed, awk, sort, uniq and so on.

First things first, Let’s find out what’s really in the authors data. Reading the json line by line and converting each line into an associative array is simple in PHP, so let’s do that, keep track of all the keys we find in the arrays and recurse into the nested arrays to look at them - then dump the result out. The arrays contain this set of keys:

alternate_names
alternate_names
alternate_names\1
alternate_names\2
alternate_names\3
bio
birth_date
comment
date
death_date
entity_type
fuller_name
id
key
location
name
numeration
personal_name
photograph
title
type
type\key
website

So, they have names, birth dates, death dates, alternate names and a few other bits and pieces. And they have a ‘key’ which turns out to be the resource part of the OpenLibrary url. That’s means we can link back into OpenLibrary nice and easy. Going back to our previous Iain Banks examples, we want to create something like this for each one:


@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://example.com/a/OL32312A>
	foaf:Name "Banks, Iain";
	foaf:primaryTopicOf <http://openlibrary.org/a/OL32312A>;
	bio:event <http://example.com/a/OL32312A#birth>;
	a foaf:Person .

<http://example.com/a/OL32312A#birth>
	bio:date "1954";
	a bio:Birth .

This gives us a foaf:Person for the author and tracks his birth date using a bio:Birth event. While tracking the birth as a separate entity may seem odd it gives the opportunity to say things about the birth itself. We’ll model death dates the same way, for the same reason. I’ve written some basic code to generate foaf from the OpenLibrary authors.

Linking back to the OpenLibrary url has been done here using foaf:primaryTopicOf. I didn’t use owl:sameAs because the url at OpenLibrary is that of a web page, whereas the uri here (http://example.com/a/OL32312A) represents a person. Clearly a person is not the same as a web page that contains information about them.

The only thing worrying me is that the uris we’re using are constructed from OpenLibrary’s keys. This makes matching them up with other data sources hard. Matching with other data sources requires a natural key, but there’s not enough data in these author entries to create one. The best I can do is to create a natural key that will enable people to discover the group of authors that share a name.


@prefix mine: <http://example.com/mine/schema#> .
<http://example.com/names/banksiain>
	mine:name_of <http://example.com/a/OL32312A>;
	a mine:Name .

These uris will enable me to find authors that share the same name easily, either because they do share the same name or because they’re duplicates. The natural key is simply the author’s name with any casing, whitespace or punctuation stripped out. That might need to evolve as I start looking at the names in more detail later.

Next step is to look in more detail at the dates in here, we have some simple cases of trailing whitespace or trailing punctuation, but also some more interesting cases of approximate dates or possible ranges - these occur for historical authors mostly. The complete list of distinct dates within the authors file is in svn. If you know anything about dates, feel free to throw me some free advice on what to do with them…

Vocamp 2008 Ontologies

Monday, September 29th, 2008 | Semantic Web | No Comments

Following on from vocamp I’ve got the ontologies I’m working with straightened out and about to be published.

First of all, I’ve made changes to aiiso (the academic institution internal structures ontology) to integrate it with foaf. Basically what I’ve done is deprecate the original aiiso:organisationalUnit and subclass all of the aiiso: classes for departments, faculties etc from foaf:Organization directly. This should allow them to work with other ontologies designed to work with foaf.

Next, the organisationalUnit and knowledgeGrouping have been deprecated. The organisationalUnit property has been replaced by organisation and joined by an inverse property of organizationWithin which hopefully makes the intentional direction clear. These are both declared with a domain and range of foaf:Organization, allowing them to be used on other places foaf is being used.

Chris Wallace and I then spent some time modelling how people participate in institutions. We concluded that we could easily make the participation vocab general enough to use anywhere. Participation specifies the way a foaf:Agent (Organization, Person or Group) participates in a foaf:Group or foaf:Organization.

Having modelled it we discovered that Knud Moeller had arrived at the same model as we had when modelling participation in conferences. The only difference we had was that we were modelling different types of role (Chair, Speaker, Chancellor) as individuals and Knud was modelling them as subclasses of a Role class. After much debate I made the decision to spec Participation to use subclassing in the same way Knud had.

Chris then kindly pulled a set of roles out of one of his existing databases and we’re publishing those under the aiiso-roles companion schema. This model means that Participation contains the details of how different things relate while different domains are free to create lists of roles that others can then re-use. Aiiso-roles is a first draft of how to do it, rather than a considered list of role titles in academia. Please suggest edits when it gets put online.

I had hoped to spend some time on the Lifecycle ontology, but with everything else being discussed it just didn’t happen. Though I did do a run-through of it with Knud as his PhD is on lifecycle of data and documents online. I hope it can be of some use to him.

The changes aren’t published yet, but will be soon, I promise.

vocabs

Monday, August 18th, 2008 | Semantic Web | No Comments

Nadeem and I have been working on several ontologies (for RDF) over the past few months and are intending to publish all of them.

The first to get published is an initial cut of AIISO (pronounced ey-s-oh pronunciation key), the Academic Institution Internal Structure Ontology.

We put this together really quickly to cover an internal need to document the departmental, school and faculty structures of higher education institutions. As of writing we know of two issues with it…

We named two of the predicates knowledgeGrouping and organisationalUnit after the things they link to. An OrganisationalUnit is any kind of department, faculty etc and a KnowledgeGrouping are collections of knowledge that get taught, things like modules and courses. The problem with the organisationalUnit and knowledgeGrouping properties is two fold, firstly they don’t actually describe the relationship and secondly they don’t give any indication of direction. So, if we say:

<http://broadminster.org> <aiiso:organisationalUnit> <http://broadminster.org/faculty-of-science>

it’s clear to us as people that the faculty of science is a part of Broadminster, but it’s not obvious from the ontology that’s what’s going on. We plan to change that to either use our own partOf property or possibly re-use dcterms:partOf within aiiso.

The other thing we failed to do was to make the appropriate links with FOAF. Aiiso’s OrganisalUnit is a specialisation of foaf:Group, so that needs to go into the ontology.

The other thing that’s come up in conversation that we’re fairly sure we’ve got right (though others disagree) is that the descriptions are somewhat self-referential. A faculty, for example, is described as follows:

A Faculty is an OrganisationalUnit that represents a group of people recognised by an OrganisationalUnit as forming a cohesive group referred to by the organisation as a faculty.

this defines the semantics of being a faculty as nothing more than ‘because that’s what you say you are’. This caused some debate internally about whether or not the semantics of being a faculty were consistent across institutions - does the term faculty have the same meaning at MIT as it does at Harvard or Virginia?

Perhaps faculty is reasonably consistent, but college and school certain vary substantially and are re-used in areas outside of higher education. My secondary school is certainly not the same kind of thing as the Winchester School of Art, a part of the University of Southampton.

I think the way to solve this is for aiiso to become clearer about its scope, its intention to describe higher education institutions and not other parts of the academic sector. That leaves others free to define ontologies for kindergarten, high school and the rest.

Photo: MIT’s Stata Center night shot by paul+photos=moody

Reification, Triples, Quads and not getting it…

Wednesday, April 2nd, 2008 | Semantic Web | 7 Comments

I’ve been working with RDF for almost 3 years now. There’s not much evidence of that here and I was recently challenged on why that is.

In large part it’s because I don’t get it. There are a lot of things I’m still struggling with in terms of how to think about solutions when using RDF and how best to work with it. Sure, I can write SPARQL with patterns several levels deep. Sure I can work with Turtle and RDF/XML in several programming languages (Java, XSLT, PHP and sed of course). I think I even understand how to think in an open-world way.

But one big thing has bugged the hell out of me for ages and ages…

I WANT QUADS

At least, I thought I did. And I thought I was alone, but then I got this in an email from Alan Dix:

One of the LBi attendees mentioned a community site they had designed for a client that allowed users to create linkages between things on the site (e.g. song/artists) … and then annotate the links. This led to short discussion (on one of my old hobby horses) on the way RDF privileges nodes over relationships because statements of triples are not labelled (do not have URIs). While the system described would have required everything to have been reified if done using RDF technology.

This sums up one of the things I’ve been struggling with so much - that there is no way to refer to the arc between two nodes. When we describe a node we use an instance URI, we say

<http://example.com/foo> a <http://example.com/schema#thing>

but standard practice when specifying predicates is to simply use the predicate, we simply use "a" rather than:

<http://…/foo> <http://…/relns/1234> <http://…/schema#thing> .
<http://…/relns/1234> means rdf:Type .

This means that while all ‘things’ have unique URIs, all type relationships use the same URI, meaning you can’t refer to the instance of a relationship directly. A URI identifying the triple would act as a surrogate, allowing you to say "The predicate on statement 97824". This is also appealing as it could also act as a surrogate for the object, where the object is a literal.

I was thinking about a problem involving incrementing a value, where I was thinking in a way that led me to want an update facility like "Increment the object of statement 87642".

Now that was just plain wrong-thinking! A statement only has identity by virtue of what it says, unlike a row in an rdbms table which has identity because of its position in the table. That is, saying "increment field 3 of row 87642" makes sense, but saying "Increment the object of statement 87642" does not. It doesn’t because as soon as the object is incremented it is a different statement. So, having triple identity to allow modification of the predicate or the object is not consistent with the way RDF is.

I was thinking about a problem involving how many times a statement had been made. So, imagine a very simple tagging statement like:

<http://…/something> tags:taggedWith "Interesting" .

I was wanting to know how many times a statement had been made, so with tagging it would give you relative sizes for a tag cloud, for example.

This is a desire for a way to refer to the statement as a whole, rather than my previous wrong-thinking which was a desire to address the parts of a statement. Other common problems that I’ve come across discussing this are around provenance or audit - who said what, when; how did that statement come to be.

Whenever I tried to discuss this I would get a blanket "REIFICATION" response. I’d read the re-ification spec and re-read it and it took me ages to get why I kept getting pointed that way.

If a triple only has identity by virtue of what it says, and giving it identity other than that leads to the kind of wrong-thinking I described earlier, then the only way to identify a statement is by virtue of what it says - that’s all re-ification is.

So, if I want to know about the tagging statement earlier

DESCRIBE ?statement WHERE {
?statement a rdf:Statement .
?statement rdf:subject <http://.../something> .
?statement rdf:predicate tags:taggedWith .
?statement rdf:object "Interesting" .
}

This allows us, simply, to identify a statement purely on the basis of what it says rather than any notion of identity other than that.

So the conclusion is, I’m wrong to want a URI for each triple and I need to fix my wrong thinking and embrace re-ification; just as soon as stores have real good support for it ;-)

SKOS, Linked Data and LCSH!

Wednesday, April 2nd, 2008 | Semantic Web | 6 Comments

The inimitable Ed Summers has been working inside the Library of Congress, building examples and demonstrators of how LC could be getting themselves into the semantic web, the linked-data web.

It appears he’s got fed up of waiting for the support, permission and infrastructure he so richly deserves to get this data out there and he’s been and gone and done something smart outside.

lcsh.info is now a home where you can find a copy of the Library of Congress Subject Headings available in SKOS.

This is a great piece of work and fits in perfectly with the work I’ve been doing on Semantic Marc.

After much discussion with Ed he’s provided two URI schemes, the primary scheme is based on the LC Control Number, and the second is based on the natural language term of the heading.

So, the LCSH/SKOS URIs for Beer (a subject close to my heart) are:

http://lcsh.info/label/Beer which currently redirects to http://lcsh.info/sh85012832

The concept URIs then do content negotiation to return either RDF or HTML representations.

The URIs based on the natural language term is something I’ve bent Ed’s ear about constantly, mainly because of the way it makes it possible to link bibliographic data into the LCSH data without the need for a lookup, so I’m chuffed to see it. However, what I badgered Ed for was wrong.

After a long discussion with Tom Heath about stuff I now understand why my suggestion to Ed to simply redirect from the term to the LCCN based URI was wrong - using a redirect basically hides the relationship between the term and its control number form the data layer, leaving the meaning implicit in the HTTP conversation.

What Tom suggested, and I hope edsu can do is to provide a response to the term URI that explains its relationship with the LCCN URI.

Great work Ed.

Search

Right Now (ish)

Meta