After Red Hat Summit, I traveled up to Maine for a vacation with my family. The area was cool, quiet and beautiful. The rugged coastline was remarkable. I found myself particularly entranced by the natural glacial lakes of New England, so different from the man-made creations we have in Georgia. You see, lakes here are engineered by constructing dams and diverting water from ponds, rivers and swamps. The water released downstream is regulated according to water use, including power generation, navigation, recreation and agricultural irrigation. It got me to thinking about the difference between siloed data stores — sometimes referred to as data swamps — and the natural data lakes which seem better suited for IoT implementations.
Natural versus engineered
If you think about it, data warehouses are similar to Georgia’s ponds, rivers and swamps. These are separate, but important, ecosystems that could be architected to work as one — if you apply a whole bunch of engineering to it. They are great for giving you a real-time picture of the state of a business. But you must remember that they only provide subsets of data. This data can be biased based on the selection criteria of your search. Like our southern reservoirs, they are very regulated, and you pretty much know exactly what kind of fish you’re going to get if you decide to throw in a line.
Data lakes are a different type of data store that grow organically. They can help you find correlations and patterns you may not have seen before to assist with decision-making processes. When you go fishing for information here, you’re more likely to catch a keeper. Casting off in these waters will often reveal correlations and patterns you may not have seen before.
The basic differences between a data warehouse and a data lake are summarized in the table below found in Tamara Dull’s blog, “Data Lake vs. Data Warehouse: Key Differences.”
Finding your prize in a lake full of data
In an IoT architecture, we are faced with thousands of sensors collecting masses of unstructured data, from sound to video to clicks. And everything is streaming down quickly into the raw data pool. Yet much of this data isn’t being used. A study by McKinsey & Company in 2015 on the energy industry found that less than 1% of the data gathered by sensors on offshore oil rigs was used for decision-making. A recent survey Cisco conducted indicated that up to 60% of IoT projects either stalled or failed due to businesses not being well-enough prepared to reap the benefits of their data, as well as the lack of quality of the data collected.
It seems to me that we need better fishing tools. What if I went fishing in this data lake with the right application as bait? I could start to see patterns from what I was pulling in. Imagine a business setting where I wanted to identify the conditions for failure on my equipment. I can look at the conditions and note that, when it hits a particular temperature, the humidity is over a certain percentage and the machinery has been in use for a specific number of hours, there’s a failure. Once I’ve identified a replicable pattern, I can take appropriate actions. I could work to reduce the temperature and humidity and possibly modify the maintenance schedule. That’s one type of fish I could catch.
At the same time, I’ve noticed that the machinery deployed before 2012 is having repeated failures not related to these conditions. Noticing that the machinery is from different manufacturers, I can tell that the older machinery should probably be replaced. Once again, I’ve hauled in another prize catch.
Use the right fishing pole and avoid the swamps
Georgia’s reservoirs remind me more of a data store where you are pulling together several swamps filled with databases, file systems and Hadoop clusters containing a bunch of siloed data with no efficient way to find, prepare and analyze it unless the engineers managed to build something connecting it all. It takes a whole lot of engineering to create those lakes.
With analytics applications like those provided by our open source partner, Cloudera, you can use its Hadoop engine and machine learning to sift through patterns and correlations in data lakes. Its application in IoT environments, with the scalability, security and flexibility of its Hadoop platform, is a natural fit; as natural as the glacial lakes I found in Maine. Analytics solutions like these can help you take advantage of complete lakes of data to optimize your IoT implementation. They can help you find the nuggets that are hidden within. If you use a data store architecture more suited to IoT implementations, you can show a greater ROI on your IoT investment and start fishing where they’re biting.
All IoT Agenda network contributors are responsible for the content and accuracy of their posts. Opinions are of the writers and do not necessarily convey the thoughts of IoT Agenda.