Marc, RDF and FRBR
I’ve been playing with some ideas over the past six months on how we can really move bibliographic data forwards into a structure that could have huge benefits.
The impetus to describe some of that for everyone finally came in the form of a conference, with a deadline for submission that is in just a few days time. The conference is WWW2008 and the workshop is entitles Linked Data on the Web. There are a whole load of reasons to go – I got to go to last years and learnt a huge amount as well as getting to speak about data licensing.
This year I’ve submitted a substantial paper about the work I’ve done on finding relationships in MARC data, something I’m already scheduled to present on at Code4Lib late next month. I’ve had a lot of help thinking about these problems from Nad and he’s helped out enormously on getting the paper finished. Thanks also have to go to Danny, he’s been of huge help understanding how to think about RDF and how to describe it – he wrote chunks of the paper too.
Please, grab the paper, have a read and let me know what you think.
Semantic Marc, MARC21 and The Semantic Web. (PDF, 440Kb)
Update: Thanks to Damian for pointing out the error in the example turtle – don’t know how we missed that!
4 Comments to Marc, RDF and FRBR
Still working my way through, but it looks interesting (maybe I’ll finally understand MARC21).
One major (yet trivial) issue needs fixing in you syntax examples: <marc21:recordStatus>. You don’t want the > < there.
It’s an interesting paper and approach.
A few small things:
o When matching personal names I believe the subfields needed are abcdq
o I’d recommend you NACO-normalize the names. That’s not quite the same as ‘paying no attention to case, whitespace, punctuation…’ see http://outgoing.typepad.com/outgoing/2006/03/naco_normalizat.html
The approach of just doing some normalization on the fields and hoping the resulting strings will match across records works fairly well, but not to the level that people expect from their library systems, and not as well as can be done. With a bit more effort you can put controlled terms in the records and have unambiguous URIs.
Before building WorldCat Identities, the first thing we do is make sure we have all the names we can linked to an appropriate authority. For example, J. K. Rowlings Identity page is: http://worldcat.org/identities/lccn-n97-108433, which should be a fairly stable URI. We do the same thing when linking to the authority file itself: http://errol.oclc.org/laf/n97-108433.html.
Of course for many names such an URI isn’t available and we fall back to something similar to what you are doing, although we use the standard NACO normalization rules; anything else risks ambiguity even with perfect MARC-21 records. These strings are useful, but are just not as stable as the LCCN’s. This year we should have the Virtual International Authority File up, and that will be completely open for use, offering another larger set of names of interest to the library community.
–Th
Styles, Ayers, Shabir: Semantic MARC, MARC 21 and the Semantic Web
Last week Talisman Rob Styles posted MARC, RDF and FRBR, two initialisms and an acronym that probably get your heart racing like they do mine. In it, he points to a paper he wrote with fellow Talismen Danny Ayers and Nadeem Shabir: Semantic MARC, MARC…
Hi Rob,
Sorry – I’ve been meaning to write this for weeks, but only just got round to it.
Overall I think the paper is a good intro to using RDF to present bib data instead of MARC – I’m not sure I have a lot of wisdom to add, but here I my comments:
I’m suprised in section 8 you say that you still get a high degree of uniqueness even when you re-order the characters (NB – just suprised, not claiming you are wrong!). Have you done any work on this for non-Name fields?
In section 9 you mention using FOAF to represent people/organisations. Have you looked at whether FOAF can adequately represent the information from MARC name fields? I’m not incredibly familiar with FOAF, but it seems limited to me in comparison
In section 10 you mention the approach taken by Thom Hickey et al of using authority information to clean data – I’m not very clear as to why you don’t do this? This section loses me a bit. If you are committed to creating the URIs algorithmically then don’t we need to see a similar mechanism for creating a URI algorithmicallly from an authority record, and check that these match up – obviously logically they ought to, but in my (limited) experience UK libraries aren’t very uniform in terms of using the LC authority files, and are more interested in internal consistency in author usage rather than an authority file. I also don’t understand when you say
“The authority data may also contain additional information about relationships between authors’ names; one example being that Iain Banks also publishes under the name Iain M Banks, another common example being that Mark Twain was the pen name of Samuel Clemens. These relationships are between different resources rather than different URIs representing the same resource, so require a “see also” relationship rather than a “same as” relationship. ”
I would have thought ideally you’d want Mark Twain and Samuel Clemens to point to the same URI? You seem to be suggesting that it is correct for them to point at different URIs – I don’t understand why?
In section 11/14 I like the idea of disambiguation along the lines of wikipedia – it seems to work quite well
Section 13 – I think I like the idea of creating a URI for the work based on author/title information, as it goes along with my feeling that rather than saying explicitly ‘these are the same work’, it may be better to say ‘we will call these the same work if they share x attributes’. However, you don’t mention how this extends to manifestations and expressions – presumably the more information you feed to an algorithm creating a URI the more likely you are going to see data variation causing different URIs for things that are actually equivalent (I guess that at the manifestation level things are particularly messy as you are dealing with largely uncontrolled fields where consistency won’t have been a key concern during the cataloguing process)
Clearly you are dealing with data that is already in existence – which is something we clearly need to worry about – but I think another question is around how we expect to catalogue in the future. If we catalogue ‘into’ a FRBR environment that seems to point at different requirements to existing records than if we continue to essentially catalogue at an item level, and then present in a FRBR way.
A thought that has just occurred to me – why do you go back to the author/title when you do the URI for the Work? Why not say that all things that share the same URI for author and title (based on the URIs already created by your algorithm) they are the same work – and if you want a URI for the work then create it based on the existing URIs rather going back to the ‘raw’ author and title info again?
Leave a comment
Additional comments powered by BackType
Search
What I'm Doing...
- @moustaki, would you recommend an equivalent to music ontology for visual recordings? 4 hrs ago
- @chriskeene Does the uni have it's own local weather system? (http://twitter.com/chriskeene/status/10314171215 and go left) in reply to chriskeene 13 hrs ago
- @_philjohn should I expect a late arrival then? in reply to _philjohn 13 hrs ago
- More updates...
Recent Comments
- Patents are Property – Like it or Not « Chasing the Power Curve on When Patents Go Wrong…
- Arizona Joe on Fixing a plasma TV
- alex_turner11 on Ground roundup of new eReaders at CES on CNN
- negative_charge on Hacking Into Your Account is as Easy as 123456
- infopeep on Hacking Into Your Account is as Easy as 123456
- BenenhaleyBrian on The 18 Mistakes That Kill Startups
- Brian Benenhaley on The 18 Mistakes That Kill Startups
- infopeep on The 18 Mistakes That Kill Startups
- Rob Styles on Ruby Mock Web Server
- Jim on Fixing a plasma TV
Categories
- .Net Technical (8)
- Blog on Blog (6)
- commands I have issued (9)
- Enterprise Architecture (19)
- event (4)
- Fiction Book Review (2)
- Food (2)
- Intellectual Property (9)
- Interaction Design (27)
- Internet Social Impact (43)
- Internet Technical (16)
- IP Law (10)
- Library Tech (19)
- Music (2)
- New Toy (4)
- Non-Fiction Book Review (7)
- Ontologies (6)
- Open Data (7)
- Other Technical (20)
- Personal (36)
- Random Thought (16)
- Resourcing (4)
- Review (1)
- Security And Privacy (11)
- Semantic Web (30)
- Software Business (10)
- Software Engineering (37)
- Talis Technical (9)
- Uncategorized (44)
- Working at Talis (26)
- [grid::blogpaper] (8)
- [grid::fatherhood] (4)
Archives
- February 2010 (1)
- January 2010 (4)
- November 2009 (10)
- October 2009 (4)
- September 2009 (2)
- August 2009 (9)
- July 2009 (12)
- June 2009 (5)
- May 2009 (6)
- April 2009 (7)
- March 2009 (3)
- February 2009 (6)
- January 2009 (10)
- December 2008 (4)
- November 2008 (4)
- October 2008 (9)
- September 2008 (23)
- August 2008 (8)
- July 2008 (1)
- June 2008 (1)
- May 2008 (6)
- April 2008 (14)
- March 2008 (3)
- January 2008 (5)
- December 2007 (6)
- November 2007 (13)
- October 2007 (9)
- July 2007 (2)
- June 2007 (1)
- May 2007 (10)
- April 2007 (5)
- March 2007 (11)
- February 2007 (10)
- January 2007 (13)
- December 2006 (8)
- November 2006 (8)
- September 2006 (2)
- August 2006 (1)
- June 2006 (2)
- February 2006 (2)
- January 2006 (3)
- December 2005 (3)
- November 2005 (2)
- September 2005 (2)
- August 2005 (5)
- July 2005 (8)
- June 2005 (3)
- May 2005 (2)
- February 2005 (1)
- January 2005 (4)
- December 2004 (3)
- November 2004 (6)
- October 2004 (2)
- September 2004 (2)
- August 2004 (5)
- July 2004 (1)
- June 2004 (4)
- May 2004 (4)
- April 2004 (3)
- March 2004 (13)
- February 2004 (6)
- December 2003 (3)
- November 2003 (1)
- August 2003 (2)
- July 2003 (1)
- June 2003 (2)
- May 2003 (1)
- March 2003 (1)
- January 2003 (1)
- October 2002 (1)
- May 2002 (1)
- March 2002 (1)
- August 2001 (1)
- May 2001 (1)
- April 2001 (1)
- January 2001 (1)
- December 2000 (1)
- November 2000 (1)
- December 1999 (1)
- November 1999 (1)
- July 1999 (1)
January 28, 2008