Big data makes it possible to make sense of heaps of data in diverse industries from healthcare and transportation to defence and security, but more big-data applications are needed to drive wider adoption of the technology.
Speaking to Techgoondu on the sidelines of the Strata + Hadoop World conference in Singapore earlier this month, Doug Cutting, creator of Hadoop, an open source software framework for crunching big data sets, noted that while many organisations are now more data-driven, there aren’t many big-data applications aimed at specific industries.
“The conversations that are taking place within industries are more about technology rather than industry-specific applications,” said Cutting, who also serves as chief architect at Cloudera.
“But I think (big-data applications) is something we’re going to see unfolding, as we switch from an older model of proprietary systems that handles data in narrow areas within an organisation, to an open platform that stores data about just any aspect of a business.”
For now, such applications are typically developed in-house by organisations, though Cutting expects independent software vendors to lead the way over the next few years, particularly for common use cases such as fraud and risk management.
Gaps in the Hadoop stack
Much like the Linux operating system, which comprises the Linux kernel and a slew of open source projects, the Hadoop platform is made up of multiple open source components that process large data sets. But because the components are developed by independent open source development groups, integration has remained a key challenge.
“In the Hadoop stack today, there’s a lack of integration so it’s hard to seamlessly go from one tool to the next, and have your data remain secure and in compatible formats,” Cutting said. “But it also offers an opportunity for a company like Cloudera to address this gap with a seamless platform for users.”
He noted that addressing the integration issue is a mammoth task, with security often an afterthought given that technologies are always evolving. That said, he credited the Hadoop community for continuing to address the technical gaps on the big data platform.
“Initially, we had a platform that was good for batch processing, but over the past few years, there have been more online processing tools,” he said, citing examples such as HBase, a storage engine typically used for online transaction processing and Solr, an enterprise search platform.
However, Cutting pointed out that there’s still a lack of a full transaction engine, not only because such engines are complicated and difficult to build, but also because most big data processing tasks today do not involve transactional data.
“Most of the value that people are getting from big data is from time-series data and log data where you don’t just want to know the current value, you also want to analyse the past values, so there hasn’t been great demand for transaction processing,” he said.
Cutting also touched on the evolution of the Hadoop platform, which he says would not suffer the same forking issues that had plagued Linux. “There’s been a lot less fragmentation for Hadoop, as the compatibility between components in different Hadoop distributions is very high,” he said.
This compatibility will also drive greater innovation on the Hadoop platform based on the evolving needs of the market.
“The Hadoop ecosystem has shown that it can get better over time, where we’ve already seen components being replaced, such as MapReduce (a processing engine) being replaced by Spark to a large degree,” Cutting said.