Big data is the production of data on a large scale – like searches in a search engine or purchases in a national retailer. The mind-blowing potential to understand consumers, employees and infrastructure could provide a tool to make fundamental changes in the way we do business, from PR to HR.
However there’s concern the advantages of big data for today’s businesses have been overblown and that methods for processing data need to mature before we start throwing resources at them. According to Kapowsoftware, 23 per cent of business leaders said it hadn’t lived up to its ‘promises’, even thought investment is relatively large: IDG recently published a report estimating that $8m or £4.67m investment went into big data research from enterprises in 2012.
A recent case study of ‘Google Flu Trends’ illustrates this potential for error all too well.
Google Flu Trends is a project by Google which uses an algorithm to predict flu and flu-like illness (ILI) outbreaks around the world. It sounds like a perfect use of big data: people are always searching on Google (it is far and way the most visited site online) and indeed it was called ‘remarkable’ by peer studies when it first began generating data about the outbreak of H1N1 virus in the Autumn of 2009.
But soon the veneer – typically of a company the size of Google – began to crack. As said in a recent study which revised the GFT project: “Studies demonstrating modest associations with [field] outcomes have been presented as accurate, without further model validation.”
In plainer speech: GFT was good enough to create a rough idea of what was happening in the real-world, yes, and that’s probably why people started calling it remarkable. Not to be facetious: but I could make a fair summary of IFI outbreaks by watching the news.
By the time of this study, it was becoming clear the GFT was not a good way to monitor a flu epidemic. Studies began to pop up which found it had missed the 2009 outbreak and an updated GFT algorithm later that year would overestimate flu outbreaks in 2013 and 2014. GFT was simply not a precise tool for estimating flu.
From GFT, what can we learn about using data in businesses?
The first thing we have to talk about is what big data is. Some people think it’s just like small data, but a lot of it. This isn’t the case. Big data is a large data set but, importantly, with a huge number of variables, which small data sets generated in a lab typically don’t have.
If you can imagine ‘traditional’ data done in a lab: with scientists deciding what affects factors combine to create a data point (heat, time, environment); in contrast, big data typically comes from real life – with nothing controlling it – and you can see that it becomes very difficult to establish what is affecting the result coming out at the end.
In other words, the biggest strength and weakness of big data is that it’s generated in the real-world, which makes it accurate but difficult to understand.
This makes traditional statistical tests unsuitable for measuring large data sets. Clifford Lam, an associate professor in the Statistic Dept. at the London School of Economics, says: “In analysing really big data, it is attractive to use those traditional techniques like we did with “small data’’ without thinking too much.”
However these techniques, such as linear regression (which amateur statisticians might be familiar with) rely on a limited number of variables, but big data has loads. Lam continues: “There are many traditional statistical techniques that needed to be updated with new statistics research in order to be utilised for Big Data. The problems for linear regression are just the tip of an iceberg.”
He advocates understanding the tests being used on the data if you’re going to extrapolate from it, as well as cross-referencing it with historical data – the kind that unravelled the GFT algorithm.
So are businesses being sold down the river by big data enthusiasts? Words like “consumable,” “actionable” and “user-centric” are applied to these data points, and while they’re correct they are not necessarily helpful. We need to show some big data maturity by using phrases like “corroborated by…” or “built-upon experiments by…” – which show everyone that this data isn’t just coming out of the ether.
The big question is: can managers make use of this information for effective results? Yes – of course – but very, very carefully. Lam says: “It is better to evaluate the results whenever one has the chance [and] always use historical data to test if the methods you are using makes sensible estimation/predictions.”
This is undoubtedly an exciting time. Big Data is like a newly exposed gold seam: there’s so much we can grab, but right now we lack the technology to mine it effectively. Data scientists are deep in research in order to sharpen up these technologies; soon we might be in a new era of understanding people.
Share this story