Big Data, Large Batches and My Mistake

This is week 9 for me in my new challenge at Callcredit. I wrote a bit about what we’re doing last time and can’t write much about the detail right now as the product we’re building is secret. Credit bureaus are a secretive bunch, culturally. Probably not a bad thing given what they know about us all.

Don’t expect a Linked Data tool or product. What we’re building is firmly in Callcredit’s existing domain.

As well as the new job, I’ve been reading Eric Ries’ The Lean Startup, tracking Big Data news and developing this app. This weekend the combination of these things became a perfect storm that let me to a D’Oh! moment.

One of the many key points in Lean Startup is to maximise learning by getting stuff out as quickly as possible. The main aspect of getting stuff out is to work in small batches. There are strong parallels here with Agile development practices and the need to get a single end-to-end piece of functionality working quickly.

This GigaOm piece on Hadoop’s days being numbered describes the need for faster, smaller batches too; in the context of data analysis responses and incremental changes to data. It introduces a number of tools, some of which I’ve looked at and some I haven’t.

The essence of moving to tools like Percolator, Dremel and Giraph is to reduce the time to learning; to shorten the time it takes to get round the data processing loop.

So, knowing all of this, why have I been working in large batches? I’ve spent the last few weeks building out quite detailed data conversions, but without a UI on the front to make any value from it! Why, given everything I know and all that I’ve experienced didn’t I build a narrow end-to-end system that could be quickly broadened out?

A mixture of reasons, all of which aren’t really valid, just tricks of the mind.

Yesterday I started to fix this and built a small batch, end-to-end, run that I can release soon for internal review.


3 thoughts on “Big Data, Large Batches and My Mistake

  1. Isn’t that the nature sometimes? The ‘Map’ needs all or as much data as possible to add accuracy/detail to the reduce?

    If the process is a a relatively simple conversion, where certain chunks of data can be treated independently from the body, then all good.

    However even in the first case it is typical to select a reasonably representative sample fraction of the whole. In fact that is even the case when the domain data set is not insignificant but not ‘big data’, perhaps anywhere over >1e6 facts, and the compute is much more than O(n)

  2. Interesting to read you’re using BigData. How are you getting along with it? We tried it out, but found that the federated setup was quite complex to get working.

  3. Are you trying to shorten development cycles? Pervasive Big Data a visual end-to-end platform, called RushAnalyzer, that allows data scientists to quickly drag-and-drop work flows for the full data life cycle that can be challenging to parallel code.

Leave a Reply

Your email address will not be published. Required fields are marked *