BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
The IT world is already struggling to handle the exponentially increasing data volumes becoming available as part of big data applications. And now an even bigger data challenge, in every sense of the word, is hurtling toward us: the internet of things.
The combination of the internet of things (IoT) and big data is significantly expanding the scope of data management initiatives. In 2003 -- still not so very long ago -- the biggest data warehouse in the world was 30 terabytes (TB) in size. Now, the engines in airplanes such as the Boeing 787 produce half a terabyte of data -- if not more -- in a single flight. By some estimates, a self-driving car could churn out more than 1 TB per hour. Collecting operational data multiple times per second from thousands of manufacturing devices on a factory floor can also quickly add up to terabytes of information -- or even petabytes for large manufacturers.
Gartner forecasts that nearly 21 billion devices will be connected to the IoT by 2020; IDC is even more bullish, predicting 30 billion connected devices by then. Putting sensors in fridges, cars, wearable health monitors, smart meters and various kinds of industrial equipment -- even on farm animals -- creates a raft of business possibilities. A 2015 McKinsey Global Institute study reckoned that within a decade, the IoT could generate as much as $11 trillion a year in overall economic value.
While numbers like those may be based on only educated speculation, they're big enough to get the attention of CEOs around the world. However, the amounts of data being generated by IoT and big data systems bring with them considerable challenges for IT teams, and not just the basic ones of data storage and processing.
Architectural upgrades required for IoT
Many organizations looking to take advantage of IoT data need to start by bulking up their IT architecture through the addition of Hadoop clusters and related big data technologies. That isn't a simple matter: We've seen plenty of issues with the Hadoop systems being used in big data applications today, often in place of mainstream relational databases. Such systems still require the use of traditional tools and processes to manage data and help data scientists and other analysts make sense of it.
Early adopters of Hadoop had to resort to tricky programming to carry out tasks that were automated in databases decades ago. The IT industry has responded, and many existing technologies that have relational database roots are now being retrofitted with support for Hadoop. For example, some data quality tools now include Hadoop connectors. But there's a long way to go to build up a functionally equivalent set of data management capabilities for IoT and big data environments.
The big data community has also adapted by coming up with technical advancements such as YARN, the cluster resource manager released in late 2013 to separate the Hadoop Distributed File System (HDFS) from the MapReduce batch processing framework. YARN lets Hadoop clusters be more easily used in real-time and interactive applications, including processing and analysis of IoT data streams.
The Spark processing engine is another example. It uses in-memory techniques to dramatically speed up many batch workloads; it also supports data streaming and machine learning, while giving IT teams a choice between using HDFS or other data stores. But Spark, YARN and other big data tools create a multitude of technology choices for IT managers to consider, and keeping up with the pace of development on all of them can be a tall order.
A question of data ownership
Not all IoT data management challenges are technical in nature. One issue that has barely been addressed is who actually owns all the data being produced by cars, fitness trackers, electricity meters and other devices -- consumers or the companies collecting and processing the data? Some of it will be quite sensitive, revealing aspects of people's lives, including their whereabouts during the day and some types of medical information. The potential consequences of, say, personal health data being made available to insurers or the human resources department of a prospective employer are serious indeed.
A more immediate and practical issue is the security of IoT data. We have enough trouble as it is securing the data on computers and phones from malware and hackers, but at least there's a healthy selection of antivirus, firewall and other security software to help us out. Data protection measures on sensors and machinery are frequently rudimentary. Such devices often have a simple keypad interface, limiting them to short numeric passcodes that are eminently hackable.
A series of vulnerabilities have been demonstrated in automobiles with internet connectivity -- one incident in 2015 resulted in the recall of 1.4 million Jeep Cherokees. That same year, a group of University of South Alabama medical students showed how they hacked into a wireless-enabled "patient simulator" and took control of a pacemaker installed in it. Given that IoT devices control items with such serious safety implications, the possibility of malevolent attacks is particularly worrying.
Every new technology faces challenges, and there, doubtless, will be no shortage of creative efforts brought to bear to address these problems and others yet to be identified. However, the danger is that the business opportunities presented by the IoT and big data are so vast that there will be a temptation to rush onward and hope that all the data management issues will somehow get fixed later. IT history suggests that trying to retrofit solutions to such challenges rather than designing them in from the beginning is messy at best.
Closing the barn door after the horse has bolted may be a useful metaphor in this context. Indeed, given that IoT-connected devices are already being fitted on horses, cows and other barnyard animals, it may turn out to be all too literal an expression.
About the author
Andy Hayler is co-founder and CEO of The Information Difference Ltd. and a frequent keynote speaker at conferences on master data management, data governance and data quality.
Podcast: What's ahead for IoT and big data applications?
Why the internet of things requires a measured approach
IoT and big data projects can lead to unique storage issues