Orecchiette with Broccoli and Anchovies

I got a chain email forwarded a week or so ago that I came oh so close to actually joining in with… Then I remembered that no good can ever come of an email chain letter, no matter how good the idea.

So, I’ve decided to start an equivalent game of blog tag. The topic? One of my favourites – FOOD!

I’m going to share with you one of my all time favourite, quick, easy week-day recipes and then tag five others in the hope they’ll do the same.

Orecchiette with Broccoli and Anchovies (serves 2)

 

First things first – please don’t be afraid of the anchovies. The anchovies in here are in oil, not in salt, which means they’re not like the dry, salty, crap you get on pizzas. In this recipe they work with the chilli to create a wonderfully warm and comforting meal. There’s a heat and depth to the flavours that just makes you feel better about a bad day from the first mouthful to the last.

Ingredients

  • 250g Orecchiette pasta (or another short pasta, but the Orecchiette is best)
  • 1 medium head of broccoli, with a good main stalk
  • 2 garlic cloves
  • 50g can of anchovies in oil
  • A large pinch of dried chilli flakes
  • A substantial quantity of Parmesan, grated
  • A knob of butter

Method

  • Put a large pan of water on to boil, you need a pan large enough for the pasta and the broccoli florets, once chopped.
  • Drain a generous tablespoon of the oil from the anchovies into a large frying pan and set on a medium-low heat.
  • Chop the broccoli into small (one mouthful) florets and set aside.
  • Peel or wash the stalk, trim off any dry woody bits and finely chop.
  • Finely chop the garlic.
  • Roughly chop the anchovies (yes, all of them, honestly, trust me).
  • Set the pasta cooking, and set a timer for three minutes less than the cooking time.
  • Put the chopped broccoli stalks, garlic, anchovies and chilli in the frying pan.
  • Fry the anchovy and broccoli mix gently, stirring occasionally to stop it sticking. If it starts to dry out, turn the heat to low and cover.
  • Three minutes before the end of the pasta’s cooking time throw in the broccoli florets to cook.
  • Once cooked, drain the pasta and broccoli, holding back a little of the water and stir in with the anchovy and broccoli mix. Stir through some of the grated parmesan, drop in the knob of butter and spoon over a few tablespoons of the pasta water (you did remember to catch some didn’t you?).
  • Cover and leave for a minute.
  • Serve with more of the parmesan over the top and good twist of black pepper.

Drink

A deep red wine goes well with this, something with some fruit to it. Maybe a Merlot, Cabernet Sauvignon or Pinotage.

Tagging (as in the childrens’ game, aka tig in the UK)

Would the following people please step up and deliver a recipe and tag five further food-interested people:

  • mauvedeity Because I bet the heathen eats some really tasty stuff.
  • nadeem.shabir Because he keeps promising to share family recipes (but failing to deliver).
  • Sarah B. Because she’s a real foodie (hasn’t got kids, so has time and money to cook proper).
  • Ross Because an American perspective would be nice, maybe a nice pumpkin pie ;-).
  • Zach Because he lives in the countryside and might suggest something good to do with wild rabbit.

Please remember to comment/trackback here so we can follow the thread and to tag a further five victims 🙂

Photo: Orecchiette with Broccoli and Anchovies by su-lin on Flickr, licensed under Creative Commons Attribution-Noncommercial-No Derivative Works 2.0 Generic

Betamax, VHS and RDF

I was chatting to a guy a few weeks ago, a Technical Account Manager at a reasonably good consultancy. We got chatting as we’re both “in IT”. I don’t actually consider myself to be “in IT” but that’s another story.

The conversation was somewhat one-sided, with this chap, let’s call him Harry, wanting to tell me all about what he does and his illustrious career with a wide range of technologies. He wasn’t interested in what I did, so I listened.

Harry explained how the consultancy he works for is doing pretty well, despite the economic situation. His group, a team of technology specialists, were not doing so well, however. Harry doesn’t understand why and we quickly moved on.

From not doing well Harry went on to detail his incredible career in technology. Putting in DEC equipment in the mid 80s (when everyone else was putting in PCs), networking several companies with Token Ring (in the late 90s when everyone else was putting in Ethernet), setting up large internal data centres based on Novell and/or IBM OS/2 (when everyone else was putting in Windows). Harry had even thrown out early copies of Microsoft Office in one company to put in Lotus 123 and AmiPro. Great decisions, choosing best-of-breed solutions from great suppliers.

The consistency of these “wrong” decisions seemed to have passed Harry by as he was saying how all of these technologies were “the best”, but were subsequently beaten in the marketplace by inferior products. I suspect Harry still has a Betamax video recorder tucked away somewhere.

What’s common across all of the products that succeeded is that they are superior in some way that the market defines, not in the way that Harry defined. They were successful in many respects simply because they were successful. That is, success begets success.

Many people are highly skeptical about the Semantic Web and RDF in particular, but in large part it seems to be in roughly the state the web was in the very early 90s. One of the browsers (Tabulator) is something that Tim Berners-Lee has written and is touting around as an example of what could be done, sites on the Semantic Web can still (just) be drawn on a single slide and lots of people are still looking at RDF and saying “it won’t work”.

But all of that misses the point. It will be successful if we make it successful. That is, it lives and dies not by how it compares to other approaches of representing data, but by how many people publish stuff this way.

Exploring OpenLibrary Part Two

This post also appears on the n2 blog.

More than two weeks on from my last look at the OpenLibrary authors data and I’m finally finding some time to look a bit deeper. Last time I finished off thinking about the complete list of distinct dates within the authors file and how to model those.

Where I’ve got to today is tagged as day 2 of OpenLibrary in the n2 subversion.

First off, a correction – foaf:Name should have been foaf:name. Thanks to Leigh for pointing that out. I haven’t fixed in this tag, tagged before I realised I’d forgotten it, but next time, honestly.

It’s clear that there is some stuff in the data that simply shouldn’t be there, things that cannot possibly be a birth date such [from old catalog] and *. and simply ,. When I came across —oOo— I was somewhat dismayed. MARC data, where most of this data has come from, has a long and illustrious history, but one of the mistakes made early on was to put display data into the records in the form of ISBD punctuation. This, combined with the real inflexibility of most ILSs and web-based catalogs has forced libraries to hack there records with junk like —oOo— to fix display errors. This one comes from Antonio Ignacio Margariti.

In total there are only 6,156 unique birth date datums and 4,936 unique death dates. Of course there is some overlap, so in total there’s only 9,566 datums to worry about overall.

So what I plan to do is to set up the recognisable patterns in code and discard anything I don’t recognise as a date or date range. Doing that may mean I lose some date information, but I can add that back in later as more patterns get spotted. So far I’ve found several patterns (shown here using regex notation)…

“^[0-9]{1,4}$” – A straightforward number of 4 digits or fewer, no letters, punctuation or whitespace. These are simple years, last week I popped them in using bio:date . That’s not strictly within the rules of the bio schema as that really requires a date formatted in accordance with ISO8601. Ian had already implied his dis-pleasure with my use of bio:date and suggested I use the more relaxed dc elements date. However, on further chatting what we actually have is a date range within which the event occurred, so we need to show that the event happened somewhere within a date range. This can be solved using the W3C Time Ontology which allows for better description.

I spent some time getting hung up on exactly what is being said by these date assertions on a bio:Birth event. That is, are we saying that the birth took place somewhere within that period, or that the event happened over that period. This may seem a daft question to ask, but as others start modelling events in peoples’ bios this could easily become indistinguishable. Say I want to model my grandfather’s experience of the second world war. I’d very likely model that as an event occurring over a four year period. So, I feel the need to distinguish between an event happening over a period and an event happening at an unknown time within a period. I thought I was getting too pedantic about this, but Ian assured me I’m not and that the distinction matters.

The model we end up with is like this


@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix mine: <http://example.com/mine/schema#> .
@prefix time: <http://www.w3.org/TR/owl-time/> .

<http://example.com/a/OL149323A>
	foaf:Name "Schaller, Heinrich";
	foaf:primaryTopicOf <http://openlibrary.org/a/OL149323A>;
	bio:event <http://example.com/a/OL149323A#birth>;
	a foaf:Person .

<http://example.com/a/OL149323A#birth>
	dc:date <http://example.com/a/OL149323A#birthDate>;
	a bio:Birth .

<http://example.com/names/schallerheinrich>
	mine:name_of <http://example.com/a/OL149323A>;
	a mine:Name .

<http://example.com/dates/gregorian/ad/years/1900>
	time:unitType time:unitYear;
	time:year "1900";
	a time:DateTimeDescription .

<http://example.com/a/OL149323A#birthDate>
	time:inDateTime <http://example.com/dates/gregorian/ad/years/1900>;
	a time:Instant .

The simple year accounts for 731,304 of the 748,291 birth dates and for 13,151 of the 181,696 death dates, about 80% of the dates overall. Following the 80/20 rule almost perfectly, the remaining 20% is going to be painful. It has been suggested I should stop here, but it seems a shame to not have access to the rest if we can dig in, and I can, so…

First of the remaining correct entries are the approximate years, recorded as ca. 1753 or (ca.) 1753 and other variants of that. These all suffer from leading and trailing junk, but I’ll catch the clean ones of these with “^[(]?ca\.[)]? ([0-9]{1,4})$”. The difficulty with these is that you can’t really convert these into a single year or even a date range as what people consider as within the “circa” will vary widely in different contexts. So, the interval can be described in the same way as a simple year, but the relationship with the authors birth is not simply time:inDateTime. I haven’t found a sensible circa predicate, so for now I’ll drop into mine.


@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix mine: <http://example.com/mine/schema#> .
@prefix time: <http://www.w3.org/TR/owl-time/> .

<http://example.com/a/OL151554A>
	foaf:Name "Altdorfer, Albrecht";
	foaf:primaryTopicOf <http://openlibrary.org/a/OL151554A>;
	bio:event <http://example.com/a/OL151554A#birth>;
	bio:event <http://example.com/a/OL151554A#death>;
	a foaf:Person .

<http://example.com/a/OL151554A#birth>
	dc:date <http://example.com/a/OL151554A#birthDate>;
	a bio:Birth .

<http://example.com/a/OL151554A#death>
	dc:date <http://example.com/a/OL151554A#deathDate>;
	a bio:Death .

<http://example.com/names/altdorferalbrecht>
	mine:name_of <http://example.com/a/OL151554A>;
	a mine:Name .

<http://example.com/dates/gregorian/ad/years/1480>
	time:unitType time:unitYear;
	time:year "1480";
	a time:DateTimeDescription .

<http://example.com/a/OL151554A#birthDate>
	mine:circaDateTime <http://example.com/dates/gregorian/ad/years/1480>;
	a time:Instant .

Ok, it’s time to stop there until next time. I have several remaining forms to look at and some issues of data cleanup.

Next time I’ll be looking at parsing out date ranges of a few years, shown in the data 1103 or 4. These will go in as longer date time descriptions so no new modelling needed.

Then we have centuries, 7th cent., again just a broader date time description required I hope. There are some entries for works from before the birth of Christ – 127 B.C.. I’ll have to take a look at how those get described. Then we have entries starting with an l like l854. I had thought that these may indicate a different calendaring system, but it appear not. Perhaps it’s bad OCRing as there are also entries like l8l4. Not sure what to do with those just yet.

In terms of data cleanup, there are dates in the birth_date field of the form d. 1823 which means that it’s actually a death date. There are also dates prefixed with fl. which means they are flourishing dates. These are used when a birth date is unknown but the period in which the creator was active is known. These need to be pulled out and handled separately.

Of course, I haven’t dealt with the leading and trailing punctuation yet or those that have names mixed in with the dates, so still much work to do in transforming this into a rich graph.

Is Effort a Myth?

I hadn’t bothered to read Seth’s Blog for a while, not sure why other than having 26,132 other unread posts in the feed reader.

But, scratching around for new and interesting things last night, I spotted his post asking Is effort a myth? and was intrigued. Of course, I was hoping, as you secretly are now as you click on the link, that it turns out it is a myth and that you can make loads of money, be hugely popular and find deep and long-lasting love without having to do anything at all.

Unfortunately, it turns out you can’t.

Arse.

In the absence of an effortless way to to achieve your every desire, Seth suggests

1. Delete 120 minutes a day of ‘spare time’ from your life. This can include TV, reading the newspaper, commuting, wasting time in social networks and meetings. Up to you.

2. Spend the 120 minutes doing this instead:

    * Exercise for thirty minutes.
    * Read relevant non-fiction (trade magazines, journals, business books, blogs, etc.)
    * Send three thank you notes.
    * Learn new digital techniques (spreadsheet macros, Firefox shortcuts, productivity tools, graphic design, html coding)
    * Volunteer.
    * Blog for five minutes about something you learned.
    * Give a speech once a month about something you don’t currently know a lot about.

3. Spend at least one weekend day doing absolutely nothing but being with people you love.

4. Only spend money, for one year, on things you absolutely need to get by. Save the rest, relentlessly.

These are great ideas and several folks at work do this kind of personal development routinely, others don’t. Others are just plain lucky 😉

Technorati Tags: ,,,

Designing the Moment

Robert Hoekman Jr’s sequel to Designing the Obvious, Designing the Moment presents more insight into the steps that Hoekman uses to evolve designs from something difficult and obtuse to something that is foolproof (Poke yoke) and a pleasure to use.

The examples used are different to Designing the Obvious, and the justification is different. In this case creating a consistent sense of the application being pleasurable to use complementing his previous observations about obvious systems being more productive and making more money.

Despite these differences we have two books that could easily have been one. The style of writing is the same, the publishing is identical and the techniques overlap substantially. Despite that both books are well worth reading. Mainly because the value comes mostly from the examples and practical application of techniques such as tabs, concertina interfaces and progressive disclosure.

This is firmly a practitioners book, as is its predecessor. The techniques are explained in ways that you could take and apply to what you’re doing right now. What helps that is Hoekman’s clear evidence-based engineering approach to this. While there is some creativity in the visual aspects, he applies various interaction design techniques as a science, not an art. This is key to it being repeatably successful.

For those working in Libraries, especially anyone developing public search interfaces, there’s a great chapter on advanced search in which Hoekman explains both what is wrong with current approaches to advanced search and also the thinking he went through to produce an alternative – it takes all of 4 pages to explain, including several large pictures, so there’s really no excuse for still having poor advanced search functionality.

Don’t buy this book expecting something different to the first, it’s more of the same – but still very worth reading.

four people doesn't make a meme you soft gits

So, I wrote about a great keynote by Gary Vaynerchuk at Web 2.0 Expo in New York, saying that it was a greatly inspiring piece and well worth watching. It contains such great quotes as

There is no reason in 2008 to do shit you hate, ‘cos you can lose just as much money being happy as hell.

Nad decided to wade in with one of his usual mega-soft wishy washy group hug lets all live in a fucking commune posts (NB: I love the way Nad has the courage to open up about stuff, publish poetry on his blog and stuff – maybe I should lighten up a bit more too). Nad adds a great reference to Paul Graham’s How to Do What You Love to the discussion and a couple of choice quotes.

Rhys then decided to announce that “Do what you love” had achieved meme status. Perhaps a little early. <sarcasm>While clearly I am an international thought leader on many topics</sarcasm>, if one swallow doesn’t make a summer than three posts certainly don’t constitute a meme.

But then… the venerable (yet completely unqualified, apparently) Danny Ayers joins in, also titling his post with the meme word. Danny leaves me unsure of whether he is agreeing or disagreeing with the premise, describing how he is simply compelled to do the things he does.

So on paper I’m not remarkably insane (at worst alcoholic with elective bipolarity, teensy bit Aspergian maybe), but I do things because they need to be done, IMHO. Might be slow at getting them done, but the compulsion’s there. No grand plan either. I can explain why I think the Semantic Web can help mankind save itself, but that’s not the motivation. I do this, I do that, almost hand to mouth but on projects…that last years. Love never really comes into it, just some arbitrary compulsion (it certainly don’t get you laid).

The alcoholism, bi-polar issues and Aspergers I can certainly relate to, but putting that aside…

This compulsion sounds a lot like love to me, not necessarily easy, not necessarily something you can explain, but definitely a driving force that leads you to do stuff – often stuff you wouldn’t otherwise have dreamed of doing.

Anyway, enough of that soft drivel. Stop navel gazing and get on with some work. Let’s not just do what we love, let’s build a business where people come to work because what the business is doing is what they’re passionate about. That’s what our library division has been when at its very best; that’s what Ian’s done with our platform division and that’s what we need to do with our new division too.

To achieve that requires a deep understanding of what it is we believe in. A few months back I read a great book that is intended to help people get to grips with just that.

Authentic Business by Neil Crofts describes how businesses can run more sustainably and provide far more satisfaction if run in an authentic way. Authentic for Crofts means having a purpose (beyond making money) and to pursue that ethically, honestly and sustainably. Crofts introduces the book like this:

Do you dream of stepping off the corporate treadmill? Do the politics and greed of corporate life leave you cold? It doesn’t have to be like that. Authentic Business shows that business can be positive, fun and meaningful as well as profitable.

He’s right, it doesn’t have to be like that. The more I think about it though, and the more I read through Crofts examples of Yeo Valley, Howies and Dove Farm the more I think it’s about caring about what the business is doing rather than about the business necessarily being “noble” in its cause. No matter how hard I try I couldn’t get excited about bringing really great organic yoghurt to market.

Exploring OpenLibrary Part One

This post also appears on the n2 blog.

I thought it was about time I got around to taking a better look at what might be possible with the OpenLibrary data.

My plan is to try and convert it into meaningful RDF and see what we can find out about things along the way. The project is an own-time project mostly, so progress isn’t likely to be very rapid. Let’s see how it goes. I’ll diary here as stuff gets done.

To save me typing loads of stuff out here, today’s source code is tagged and in the n2 subversion as day 1 of OpenLibrary.

Day one, 3rd October 2008, I downloaded the authors data from OpenLibrary and unzipped it. I’m also downloading the editions data from OpenLibrary, but that’s bigger (1.8Gb) so I’m playing with the author data while that comes down the tubes.

The data has been exported by OpenLibrary as JSON, so is pretty easy to work with. I’m going to write some PHP scripts on the command line to mess with it and it looks great for doing that.

Each line of the JSON in the authors file represents a single author, although some authors will have more than one entry. Taking a look at Iain Banks (aka Iain M Banks) we have the following entries:


{"name": "Banks, Iain", "personal_name": "Banks, Iain", "key": "\/a\/OL32312A", "birth_date": "1954", "type": {"key": "\/type\/type"}, "id": 81616}
{"name": "Banks, Iain.", "type": {"key": "\/type\/type"}, "id": 3011389, "key": "\/a\/OL954586A", "personal_name": "Banks, Iain."}
{"type": {"key": "\/type\/type"}, "id": 9897124, "key": "\/a\/OL2623466A", "name": "Iain Banks"}
{"type": {"key": "\/type\/type"}, "id": 9975649, "key": "\/a\/OL2645303A", "name": "Iain Banks         "}
{"type": {"key": "\/type\/type"}, "id": 10565263, "key": "\/a\/OL2774908A", "name": "IAIN M. BANKS"}
{"type": {"key": "\/type\/type"}, "id": 10626661, "key": "\/a\/OL2787336A", "name": "Iain M. Banks"}
{"type": {"key": "\/type\/type"}, "id": 12035518, "key": "\/a\/OL3127859A", "name": "Iain M Banks"}
{"type": {"key": "\/type\/type"}, "id": 12078804, "key": "\/a\/OL3137983A", "name": "Iain M Banks         "}
{"type": {"key": "\/type\/type"}, "id": 12177832, "key": "\/a\/OL3160648A", "name": "IAIN M.BANKS"}

In total the file contains 4,174,245 entries. First job is to get a more manageable set of data to work with. So, I wrote a short script to extract 1 line in every 10 from a file. The resulting sample author data file contains 417,424 entries. This is more manageable for quick testing of what I’m doing.

So now we can start writing some code to produce some RDF. Given the size of these files, I need to stream the data in and out again in chunks. The easiest format I find for that is turtle which has the added benefit of being human readable. YMMV. Previously I’ve streamed stuff out using n-triples. That has some great benefits too, like being able to generate different parts of the graph, for the same subject, in different parts of the file then being them together using a simple command line sort. It’s also a great format for chunking the resulting data into reasonable size files as breaking on whole lines doesn’t break the graph, whereas with rdf/xml and turtle it does.

So, I may end up dropping back to n-triples, but for now I’m going to use turtle.

I also like working on the command line and love the unix pipes model, so I’ll be writing the cli (command line) tools to read from STDIN and write to STDOUT so I can mess with the data using grep, sed, awk, sort, uniq and so on.

First things first, Let’s find out what’s really in the authors data. Reading the json line by line and converting each line into an associative array is simple in PHP, so let’s do that, keep track of all the keys we find in the arrays and recurse into the nested arrays to look at them – then dump the result out. The arrays contain this set of keys:

alternate_names
alternate_names
alternate_names\1
alternate_names\2
alternate_names\3
bio
birth_date
comment
date
death_date
entity_type
fuller_name
id
key
location
name
numeration
personal_name
photograph
title
type
type\key
website

So, they have names, birth dates, death dates, alternate names and a few other bits and pieces. And they have a ‘key’ which turns out to be the resource part of the OpenLibrary url. That’s means we can link back into OpenLibrary nice and easy. Going back to our previous Iain Banks examples, we want to create something like this for each one:


@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://example.com/a/OL32312A>
	foaf:Name "Banks, Iain";
	foaf:primaryTopicOf <http://openlibrary.org/a/OL32312A>;
	bio:event <http://example.com/a/OL32312A#birth>;
	a foaf:Person .

<http://example.com/a/OL32312A#birth>
	bio:date "1954";
	a bio:Birth .

This gives us a foaf:Person for the author and tracks his birth date using a bio:Birth event. While tracking the birth as a separate entity may seem odd it gives the opportunity to say things about the birth itself. We’ll model death dates the same way, for the same reason. I’ve written some basic code to generate foaf from the OpenLibrary authors.

Linking back to the OpenLibrary url has been done here using foaf:primaryTopicOf. I didn’t use owl:sameAs because the url at OpenLibrary is that of a web page, whereas the uri here (http://example.com/a/OL32312A) represents a person. Clearly a person is not the same as a web page that contains information about them.

The only thing worrying me is that the uris we’re using are constructed from OpenLibrary’s keys. This makes matching them up with other data sources hard. Matching with other data sources requires a natural key, but there’s not enough data in these author entries to create one. The best I can do is to create a natural key that will enable people to discover the group of authors that share a name.


@prefix mine: <http://example.com/mine/schema#> .
<http://example.com/names/banksiain>
	mine:name_of <http://example.com/a/OL32312A>;
	a mine:Name .

These uris will enable me to find authors that share the same name easily, either because they do share the same name or because they’re duplicates. The natural key is simply the author’s name with any casing, whitespace or punctuation stripped out. That might need to evolve as I start looking at the names in more detail later.

Next step is to look in more detail at the dates in here, we have some simple cases of trailing whitespace or trailing punctuation, but also some more interesting cases of approximate dates or possible ranges – these occur for historical authors mostly. The complete list of distinct dates within the authors file is in svn. If you know anything about dates, feel free to throw me some free advice on what to do with them…

OpenLibrary

This post is also published on the n2 blog.

I thought it was about time I got around to taking a better look at what might be possible with the OpenLibrary data.

My plan is to try and convert it into meaningful RDF and see what we can find out about things along the way. The project is an own-time project mostly, so progress isn’t likely to be very rapid. Let’s see how it goes. I’ll diary here as stuff gets done.

To save me typing loads of stuff out here, today’s source code is tagged and in the n2 subversion as day 1 of OpenLibrary.

Day one, 3rd October 2008, I downloaded the authors data from OpenLibrary and unzipped it. I’m also downloading the editions data from OpenLibrary, but that’s bigger (1.8Gb) so I’m playing with the author data while that comes down the tubes.

The data has been exported by OpenLibrary as JSON, so is pretty easy to work with. I’m going to write some PHP scripts on the command line to mess with it and it looks great for doing that.

Each line of the JSON in the authors file represents a single author, although some authors will have more than one entry. Taking a look at Iain Banks (aka Iain M Banks) we have the following entries:

 {"name": "Banks, Iain", "personal_name": "Banks, Iain", "key": "\/a\/OL32312A", "birth_date": "1954", "type": {"key": "\/type\/type"}, "id": 81616} {"name": "Banks, Iain.", "type": {"key": "\/type\/type"}, "id": 3011389, "key": "\/a\/OL954586A", "personal_name": "Banks, Iain."} {"type": {"key": "\/type\/type"}, "id": 9897124, "key": "\/a\/OL2623466A", "name": "Iain Banks"} {"type": {"key": "\/type\/type"}, "id": 9975649, "key": "\/a\/OL2645303A", "name": "Iain Banks "} {"type": {"key": "\/type\/type"}, "id": 10565263, "key": "\/a\/OL2774908A", "name": "IAIN M. BANKS"} {"type": {"key": "\/type\/type"}, "id": 10626661, "key": "\/a\/OL2787336A", "name": "Iain M. Banks"} {"type": {"key": "\/type\/type"}, "id": 12035518, "key": "\/a\/OL3127859A", "name": "Iain M Banks"} {"type": {"key": "\/type\/type"}, "id": 12078804, "key": "\/a\/OL3137983A", "name": "Iain M Banks "} {"type": {"key": "\/type\/type"}, "id": 12177832, "key": "\/a\/OL3160648A", "name": "IAIN M.BANKS"} 

In total the file contains 4,174,245 entries. First job is to get a more manageable set of data to work with. So, I wrote a short script to extract 1 line in every 10 from a file. The resulting sample author data file contains 417,424 entries. This is more manageable for quick testing of what I’m doing.

So now we can start writing some code to produce some RDF. Given the size of these files, I need to stream the data in and out again in chunks. The easiest format I find for that is turtle which has the added benefit of being human readable. YMMV. Previously I’ve streamed stuff out using n-triples. That has some great benefits too, like being able to generate different parts of the graph, for the same subject, in different parts of the file then being them together using a simple command line sort. It’s also a great format for chunking the resulting data into reasonable size files as breaking on whole lines doesn’t break the graph, whereas with rdf/xml and turtle it does.

So, I may end up dropping back to n-triples, but for now I’m going to use turtle.

I also like working on the command line and love the unix pipes model, so I’ll be writing the cli (command line) tools to read from STDIN and write to STDOUT so I can mess with the data using grep, sed, awk, sort, uniq and so on.

First things first, Let’s find out what’s really in the authors data. Reading the json line by line and converting each line into an associative array is simple in PHP, so let’s do that, keep track of all the keys we find in the arrays and recurse into the nested arrays to look at them – then dump the result out. The arrays contain this set of keys:

alternate_names
alternate_names
alternate_names\1
alternate_names\2
alternate_names\3
bio
birth_date
comment
date
death_date
entity_type
fuller_name
id
key
location
name
numeration
personal_name
photograph
title
type
type\key
website

So, they have names, birth dates, death dates, alternate names and a few other bits and pieces. And they have a ‘key’ which turns out to be the resource part of the OpenLibrary url. That’s means we can link back into OpenLibrary nice and easy. Going back to our previous Iain Banks examples, we want to create something like this for each one:

 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix bio: <http://vocab.org/bio/0.1/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . <http://example.com/a/OL32312A> foaf:Name "Banks, Iain"; foaf:primaryTopicOf <http://openlibrary.org/a/OL32312A>; bio:event <http://example.com/a/OL32312A#birth>; a foaf:Person . <http://example.com/a/OL32312A#birth> bio:date "1954"; a bio:Birth . 

This gives us a foaf:Person for the author and tracks his birth date using a bio:Birth event. While tracking the birth as a separate entity may seem odd it gives the opportunity to say things about the birth itself. We’ll model death dates the same way, for the same reason. I’ve written some basic code to generate foaf from the OpenLibrary authors.

Linking back to the OpenLibrary url has been done here using foaf:primaryTopicOf. I didn’t use owl:sameAs because the url at OpenLibrary is that of a web page, whereas the uri here (http://example.com/a/OL32312A) represents a person. Clearly a person is not the same as a web page that contains information about them.

The only thing worrying me is that the uris we’re using are constructed from OpenLibrary’s keys. This makes matching them up with other data sources hard. Matching with other data sources requires a natural key, but there’s not enough data in these author entries to create one. The best I can do is to create a natural key that will enable people to discover the group of authors that share a name.

 @prefix mine: . sl:name_of ; a sl:Name . 

These uris will enable me to find authors that share the same name easily, either because they do share the same name or because they’re duplicates. The natural key is simply the author’s name with any casing, whitespace or punctuation stripped out. That might need to evolve as I start looking at the names in more detail later.

Next step is to look in more detail at the dates in here, we have some simple cases of trailing whitespace or trailing punctuation, but also some more interesting cases of approximate dates or possible ranges – these occur for historical authors mostly. The complete list of distinct dates within the authors file is in svn. If you know anything about dates, feel free to throw me some free advice on what to do with them…

Pages, Screens, MVC and not getting it…

MVC Web

About two years ago my colleague Ian Davis and I were talking about different approaches to building web applications. I was advocating that we use ASP.Net; The framework it provides for nesting controls within controls (server control and user controls) is very powerful. I was describing it as a component-centric approach where we could build pages rapidly by plugging controls together.

Ian was describing a page-centric approach, and advocating XSLT (within PHP) as one of several possible solutions. He was suggesting that his approach was both simpler and that we could be more productive using it. Having spent two years working with ASP.Net I was not at all convinced.

Two years on and I think I finally get what he was saying. What can I say, I’m a slow learner. The difference in our opinions was based on two different underlying mental models.

The ASP.Net mental model is that of application software. It tries to bring the way we build windows software to the web. ASP.Net has much of the same feature set that we have if building a Windows Forms app; it’s no coincidence that the two models are now called Windows Forms and Web Forms. In this model we think about the forms, or screens, that we have to build and consider the data on which they act as secondary – a parameter to the screen to tell it which user, or expense claim or whatever to load for editing.

In this mental model we end up focussing on the verbs of the problem. We end up with pages called ‘edit.aspx’, ‘createFoo.aspx’ and ‘view.aspx’; where view is the in the verb form, not the noun. ASP.Net is not unique in this, the same model exists in JSP and many people use PHP this way – it’s not specific to any technology, it’s a style of thinking and writing.

Ian’s mental model is different. Ian’s mental model is that of the web. The term URL means Uniform _Resource_ Locator. It doesn’t say Uniform _Function_ Locator. A URL is meant to refer to a noun, not a verb. This may seem like an esoteric or pedantic distinction to be making, but it affects the way we think about the structure of our applications and changing the way we think about solving a problem is always interesting.

If we think about URLs as being only nouns, no verbs, then we end up with a URL for every important thing in our site. Those URLs can then be bookmarked and linked easily. We can change code behind the scenes without changing the URLs as the URLs refer to objects that don’t change rather than functions that do.

So if URLs refer to nouns, how do we build any kind of functionality? That’s tied up in something else that Ian was saying a long time ago – when he asked me “What’s the difference between a website and a web API?”. My mental model, building web applications the way we build windows apps, was leading me to consider the UI and the API as different things. Ian was seeing them as one and the same. When I was using URIs refer to verbs I found this hard to conceptualise, but thinking about URIs as nouns it becomes clearer – That’s what REST is all about. URIs are nouns and then the HTTP verbs give you your functionality.

That realisation and others from working on Linked Open Data means I now think they’re one and the same too.

At Talis we’ve done a few projects this way. Most notably our platform, but also Project Cenote some time ago and a few internal research projects more recently. The clearest of these so far is the product I’m working on right now to support reading lists (read course reserves in the US) in Higher Education. We’re currently in pilot with University of Plymouth, here’s one of their lists on Financial Accounting and Reporting. The app is built from the ground up as Linked Data and does all the usual content negotiation goodness. We still have work to do on putting in RDFa or micro-formats and cross references between the html and rdf views – so it’s not totally linked data yet.

What I’ve found is that this approach to building web apps beats anything else I’ve worked with (In roughly chronoligical order – Lotus Domino, Netscape Application Server, PHP3, Vignette StoryServer, ASP, PHP4, ASP.Net, JSP, PHP5).

The model is inherently object-oriented, with every object (at least those of meaning to the outside world) having a URI and every object responding to the standard HTTP verbs, GET, PUT, POST, DELETE. This is object-orientation at the level of the web, not at the level of a server-side language. That’s a very different thing to what JSP does, where internally the server-side code may be object-oriented, but the URIs refer to verbs, so look more procedural or perhaps functional.

It’s also inherently MVC, with GET requests asking for a view (GET should never cause a change on the server) and PUT, POST and DELETE being handled by controllers. With MVC though, we typically think of that as happening in lots of classes in a single container, like ASP.Net or Tomcat or something like that. This comes from two factors in my experience. Firstly the friction between RDBMS models and object models and secondly the relatively poor performance of most databases. These two things combine to drive people to draw the model into objects alongside the views and controllers.

The result of this is usually that it’s not clear how update behaviour should be divided between the model and the controllers and how display behaviour should be divided between the model and the views. As a result the whole thing becomes complex and confused. That doesn’t even start to take into account the need for some kind of persistence layer that handles the necessary translation between object model and storage.

We’ve not done that. We’ve left the model in a store, in this case a Talis Platform store, but it could be any triple store. That’s what the diagram at the top shows, the model staying seperate from views and controllers… and having no behaviour.

A simple example may help, how about tagging something within an application. We have the thing we’re tagging, which we’ll call http://example.com/resources/foo and the collection of tags attached to foo which we’ll call http://example.com/resources/foo/tags. A http GET asking for /resources/foo would be routed to some view code which reads the model and renders a page showing information about foo, and would show the tags too of course. It would also render a form for adding a tag which simply posts the new tag to /resources/foo/tags.

The POST gets routed to some controller logic which is responsible for updating the model.

The UI response to the POST is to show /resources/foo again, which will now have the additional tag. Most web development approaches would simply return the HTML in response to the POST, but we can keep the controller code completely seperate from the view code by responding to a successful POST with a 303 See Other with a location of /resources/foo which will then re-display with the new tag added.

“The response to the request can be found under a different URI and SHOULD be retrieved using a GET method on that resource. This method exists primarily to allow the output of a POST-activated script to redirect the user agent to a selected resource.” rfc2616

This model is working extremely well for us in keeping the code short and very clear.

The way we route requests to code is through the use of a dispatcher, .htaccess in apache sends all requests (except those for which a file exists) to a dispatcher which uses a set of patterns to match the URI to a view or controller depending on the request being a GET or POST.

Ian has started formalising this approach into a framework he’s called Paget.

Passion, personal brand, and doing what you love…

Just found this great video of Gary Vaynerchuk keynoting at Web 2.0 Expo in New York. I found it via the Natural User Interface blog.

Vaynerchuk’s keynote is entitled “Building Personal Brand Within the Social Media Landscape” but the main thrust of it is that everyone should stop doing what they hate and do something they really love.

The timing of this is interesting as I’ve recently had a few conversations with folks at work about loving, or not, what they’re doing. At Talis it can be hard to find what it is you really want to do; we expect people to self-organize and have a high degree of self-awareness.

When I worked at Egg, we found very strongly that people joining the company either loved it and thrived or hated it and didn’t. This came down to one thing – was what Egg was doing something you were passionate about.

We’ve come round to a similar culture at Talis, one where we’re all actively encouraged to find out what it is that we love doing and supported in getting into that role. Of course, we’re not in just any business so if I decided I wanted to become a goat farmer then Talis and I would have to part company.

Knowing what it is that you love doing, and often that’s based on discovering and understanding your strengths, is crucial to understanding if an organizational culture is right for you and even if the organization’s business ambitions are right for you.

Having worked in companies that I’ve loved and in companies I loathed and in roles of both kinds within both kinds of company I’m starting to get clear on what I love doing. Right now my role involves bringing together really cool ideas like Linked Data and great interaction design with existing problems in higher education. We’re building lightweight stuff quickly and easily to test out ideas and constantly looking for the very essence of solutions. And more importantly than all of that, I’m working with people who are smarter than me and who inpsire me with great ideas.

That’s what I love, but the very same culture can be experienced by others very differently.

Right now I’ve found something pretty close to what I love doing – you should do that too 🙂