From where we’ve been with our phones to what type of coffee we drink at a cafe, the digital trail we leave every day carries the sort of data that companies and governments seem eerily keen to collect these days.
Walk into a Big Data seminar, as I did at an Intel event this week in Vietnam, and you’re often told how important it is to get geared up with the latest in data centre technologies to prepare for the deluge of data coming in the years ahead.
Petabytes of data, for sure, will be collected, stored and analysed. Faster chips are needed to compute. Cloud computing provides the flexibility and scalability, while greener data centres keep behemoth data warehouses running more cheaply by cutting power usage.
Yet, while geeks fascinate over the possibilities of the technology, you wonder if the people making decisions based on what the Big Data machine churns out have really put enough thought into the magic before their eyes.
There is a scary assumption that you can throw a bunch of data – usually, the more data sets you got the better, like historical records going back decades – into a machine and it can churn out answers.
How can a telecom operator win back customers who have “churned” or left for a rival? Can you detect which users logging in is a potential hacker? Big Data provides some answers.
Or more curiously, can you predict how likely a person will commit a crime? Or whether he’s a criminal? Apparently, you can too, with varying accuracy.
By analysing data as innocuous as your skin and eye colour and whether you had tattoos and traffic tickets, a background check system in the United States can now “reasonably” determine if you have actually committed a serious crime previously.
Sounds like the predictive policing in Minority Report? Certainly, it rings alarm bells. Putting aside the controversy of policemen targeting a specific race group, or even privacy concerns with the collection of such data, the bigger issue is how accurate these results are.
Interestingly, according to the Bloomberg report that detailed the technology, the accuracy can be tweaked. Depending on how many false positives you can tolerate, you can identify all the felons in a state through this sort of Big Data profiling.
So, to pick up all 51,246 felons in Kentucky, you would have to accept that you would wrongly identify 2,220 non-felons. If you slide the scale the opposite direction, you could identify only 37,842 felons, but misidentify a smaller number of non-felons – 152.
When accuracy is on a sliding scale, you have to ask how the result is actually derived. More worryingly, who will be sliding this scale and deciding how much of a tyrant he wants to be with the data?
This brings up a huge issue for Big Data. Many of its proponents don’t mention that results are often based on correlation, not causation.
In other words, there is some sort of link between, say, a person who has light skin and hazel eyes and the chances of him being a felon. However, those characteristics certainly don’t make a person commit a crime.
The difference is obvious to any trained statistician or social science practitioner.
But what of the police chief of a small town looking at data suggesting the probability of a suspect being a criminal? Or the CEO placing his biggest bet in a market based on what his newfangled Big Data machine is telling him?
Many decisions in the years ahead, you fear, will be made on a loose correlation rather than a strong connection among various sets of data.
Indeed, with the hype that surrounds Big Data now, organisations are even pressurised to quickly make use of all the information they have collected over the years.
That’s not the only issue. Often, the other unknown is the quality of data. There are automated tools, for example, to “scrub” or remove identical entries in a database, but there aren’t many to stop an inexperienced data scientist or analyst from dumping irrelevant data into the mix and coming up with a wildly off-tangent result.
If Nokia were to look at its historical data on customer buying patterns, would it be able to sell its future phones better? Possibly, by identifying what people liked, for example, the industrial design.
But you have to remember the data might come from a time before the iPhone turned up and turned the market upside down. Apple’s touch-screen smartphone marked the start of an abrupt downturn in Nokia’s fortunes, and data before that black swan-like arrival of the iPhone may not help the Finnish giant find helpful insights into future trends.
This is a simplistic example, but it shows that data analysts have to find the right data to look for answers. Garbage in, garbage out, as they say. That holds true for Big Data as well.
To be fair, there have been some early successes in recent years. According to Bloomberg, police in various American states have used historical data to identify where crime hotspots are and better allocate resources.
That’s the kind of boring but straightforward stuff that previously wasn’t possible without better hardware and new analysis tools. There just wasn’t a way to process and store all that data, which includes video these days.
Lots more cool but practical stuff are done with Big Data, of course.
For example, Formula One drivers get huge amounts of data fed live to their team, who then analyse that in real time and advise on the best course of action in the cockpit.
Big Data is also about going deeper, not just wider. Did you know you can drill down to all the detailed player stats on NBA’s website, going back decades? That sort of power is only possible with new technology – fast in-memory processing of data, for one – that you’ll hear increasingly at Big Data industry forums.
The technology is indeed providing the magic. What is not so convincing is the belief that Big Data is the answer to so many questions we seek, from predicting future trends to identifying felons.
So far, the industry has been beating the drum with success stories. What we need to hear more of are the false positives, the horror stories.
With social science, experts took the data, analysed it and put forward a theory to be critiqued and scrutinised. Surely, some sort of peer review is due for many of the big promises from Big Data.