Linked Data, Big Data and Your Data

Week five of my new challenge and I figured I really should get around to scribbling down some thoughts. I talked in my last post about RDF and graphs being useful inside the enterprise; and here I am, inside an enterprise.

Callcredit is a data business. A mixture of credit reference agency (CRA) and consumer data specialists. As the youngest of the UK’s CRAs, 12 years old, it has built an enviable position and is one of few businesses growing strongly even in the current climate. I’ve worked with CRAs from the outside, during my time at Internet bank Egg. From inside there’s a lot to learn and some interesting opportunities.

Being a CRA, we hold a lot of data about the UK population – you. Some of this comes from the electoral roll, much of it from the banks. Banks share their data with the three CRAs in order to help prevent fraud and lower the risk of lending. We know quite a lot about you.

Actually, if you want to see what we know, check out your free credit report from Noddle – part of the group.

Given the kind of data we hold, you’d hope that we’re pretty strict about security and access. I was pleased to see that everyone is. Even the data that is classed as public record is well looked after; there’s a very healthy respect for privacy and security here.

The flip side to that is wanting those who should have access to be able to do their job the best way possible; and that’s where big data tools come in.

As in my previous post, variety in the data is a key component here. Data comes from lots of different places and folks here are already expert at correcting, matching and making consistent. Volume also plays a part. Current RDBMS systems here have in excess of 100 tables tracking not only data about you but also provenance data so we know where information came from and audit data so we know who’s been using it.

Over the past few weeks I’ve been working with the team here to design and start building a new product using a mix of Hadoop and Big Data® for the data tiers and ASP.net for the web UI, using Rob Vesse’s dotNetRDF. The product is commercially sensitive so I can’t tell you much about that yet, but I’ll be blogging some stuff about the technology and approaches we’re using as I can.

Fun 🙂

RDF, Big Data and The Semantic Web

I’ve been meaning to write this post for a little while, but things have been busy. So, with this afternoon free I figured I’d write it now.

I’ve spent the last 7 years working intensively with data. Mostly not with RDBMSs, but with different Big Data and Linked Data tools. Over the past year things have changed enormously.

The Semantic Web

A lot has been talked about the Semantic Web for a long time now. In fact, I often advise people to search for Linked Data rather than Semantic Web as the usefulness of the results in a practical context is vast. The semantic web has been a rather unfortunately academic endeavour that has been very hard for many developers to get into. In contrast, Linked Data has seen explosive growth over the past five years. It hasn’t gone mainstream though.

What does show signs of going mainstream is the schema.org initiative. This creates a positive feedback loop between sites putting structured data into their pages and search engines giving those sites more and better leads as a result.

Much has been said about Microdata killing RDF blah, blah but that’s not important. What is important is that publishing machine-understandable data on the web is going mainstream.

As an aside, as Microdata extends to solve the problems it currently has (global identifiers and meaningful links) it becomes just another way to write down the RDF model anyway. RDF is an abstract model, not a data format, and at the moment Microdata is a simplified subset of that model.

Big Data and NoSQL

In the meantime another data meme has also grown enormously. In fact, it has dwarfed Linked Data in the attention it has captured. That trend is Big Data and NoSQL.

In Planning for Big Data, Edd talks about the three Vs:

To clarify matters, the three Vs of volume, velocity and variety are commonly used to characterize different aspects of big data. They’re a helpful lens through which to view and understand the nature of the data and the software plat- forms available to exploit them. Most probably you will contend with each of the Vs to one degree or another.

Most Big Data projects are really focussed on volume. They have large quantities, terabytes or petabytes, of uniform data. Often this data is very simple in structure, such as tweets. Fewer projects are focussed on velocity, being able to handle data coming in quickly and even fewer on variety, having unknown or widely varied data.

You can see how the Hadoop toolset is tuned to this and also how the NoSQL communities focus mostly on denormalisation of data. This is a good way to focus resources if you have large volumes of relatively simple, highly uniform data and a specific use-case or queries.

Apart from Neo4J, which is the odd-one-out in the Big Data community this is the approach.

RDF

So, while we wait for the semantic web to evolve, what is RDF good for today?

That third V of the Big Data puzzle is where I’ve been helping people use graphs of data (and that’s what RDF is, a graph model). Graphs are great where you have a variety of data that you want to link up. Especially if you want to extend the data often and if you want to extend the data programmatically — i.e. you don’t want to commit to a complete, constraining schema up-front.

The other aspect of that variety in data that graphs help with is querying. As Jem Rayfield (BBC News & Sport) explains, using a graph makes the model simpler to develop and query.

Graph data models can reach higher levels of variety in the data before they become unwieldy. This allows more data to be mixed and queried together. Mixing in more data adds more context and more context adds allows for more insight. Insight is what we’re ultimately trying to get at with any data analysis. That’s why the intelligence communities have been using graphs for many years now.

What we’re seeing now, with the combination of Big Data and graph technologies, is the ability to add value inside the enterprise. Graphs are useful for data analysis even if you don’t intend to publish the data on the semantic web. Maybe even especially then.

Microsoft, Oracle and IBM are all playing in the Big Data space and have been for some time. What’s less well-known and less ready for mainstream is that they all have projects in the graph database space: DB2 NoSQL Graph StoreOracle Database Semantic TechnologiesConnected Services Framework 3.0.

Behind the scenes, in the enterprise, is probably the biggest space where graphs will be adding real value over the next few years; solving the variety problem in Big Data.

Where does the time go…

Busy times, child number three on the way, due at the end of next month; along with trying to buy and sell houses and work has been great fun. I want to blog about some of the technicalities of what we’ve been building, and will try to get around to that, but in the meantime we’re talking about some of it at:

http://silkworm.talis.com

Essentially, we’re working on services out in the cloud that provide content discovery, interoperability and access services for content providers. I can highly recommend the white paper by our CTO Justin, available for download from above.

more coming soon.

More Security or Different Security

We’ve just been having a discussion at work about the benefits of Impersonation and Delegation in .Net. That is, the ability of an application to perform actions using the identity of the human user driving them.

For web applications, web services, database access etc this can be very useful, giving a trail throughout a multi-tier application showing which user performed an action.

The obvious perception is that this is, of course, more secure. But that’s just not true…

Continue reading

Subversive Patterns

Well, we all do a jolly good job at making the right technology decisions and having a good go at building great systems for our customers, but from time to time you get one of those situations where some functionality has been chosen or bought and mandated despite it’s lack of suitability.

You know the situation, flashy sales demo is followed by a purchase, the CTO/Strategy Team/Architecture Group/Project Sponsor hands you the box and says “we’re going to use this for our foobits logic processing”.

Well, here are two subversive patterns I’ve used to tackle the problem…

Continue reading