In part one of our examination of the distributed computing lexicon, we discussed the importance of ensuring data availability and usability in distributed internet of things architectures. In order that data remain always available for our applications, we must construct distributed infrastructures so that if one node or network component fails, the data it houses is still accessible from other nodes of the same system. Usability refers to limiting latency as data volume grows and demand from dispersed geographies and disparate networks expands. In this second part we’ll look at what it means to achieve data accuracy and ease of operational support while maintaining cost effectiveness.
It’s available and usable, but how do you make sure your distributed data’s accurate?
A traditional database holds data for transactional systems. Transactional systems ensure data accuracy by serializing updates to the data. This ensures data accuracy in a scenario with multiple potential writes to a data record, because the record is locked until the write is complete. This works when architectures are centralized, but a centralized infrastructure is likely to prove ineffective in most large IoT deployments.
Large IoT deployments look to distributed databases like NoSQL, which are built to ensure data reliability and usability and the ability to effectively deal with geographically distributed data. However, write conflicts are inevitable when using a highly available, distributed database architected for availability. The most common ways this can happen are as a result of the inevitable network partition or hardware/node failure. Additionally, the latency inherent in distributed systems — where data centers and clients may be on opposite sides of the world — can impact the order in which your system sees updates. This, in turn, can disrupt the system’s ability to understand which data value is right. The way your distributed database resolves conflicts that arise from these issues determines how accurate your data is.
It’s not a simple problem to solve, and there are tradeoffs (see Dr. Eric Brewer’s CAP theorem) that need to be made between the consistency and availability of data when looking to scale an IoT deployment. A distributed architecture — designed to ensure availability — will inevitably create a situation in which multiple writes occur that conflict with one another. When deploying any distributed database, the availability of the information in that database to your application is critical, and the control a database gives you over resolving these conflicts is a key part of this deployment.
There are many examples of situations that can generate these conflicts in IoT deployments. A sensor(s) may disconnect due to a malfunction, node failure or network partition. When the condition heals, multiple updates to the same value may conflict with each other. Another example is if an application is keeping count of how many errors an IoT deployment sees from devices over a certain time period and a network error occurs, then different nodes will provide different values for the application’s counters. Another situation that can occur when multiple copies of data are being kept is data rot on disk. If over time one copy of the data gets corrupted, how do we know which copy of the data is accurate?
To take advantage of the scale and resiliency of distributed databases, techniques have been developed to address the data accuracy challenges. Look for systems that use logical clocks vs. system or network clocks to determine the order of writes and updates, and look for systems that offer automatic read repair for corrupt or inconsistent data. These types of capabilities, once words thrown around the halls of academia, have become relevant again as enterprises look for ways to ensure their distributed IoT systems function correctly. Now that the data is accurate, it’s time to scale the system for the growth along the horizon.
How do you build a distributed computing system without massive overhead costs?
The distributed computing world of old revolved around vertical architectures. When one database machine became overwhelmed, an organization added resources to the machine. In some instances, scaling vertically could mean moving from a machine with fewer resources to one with a larger resource pool. These transitions involved downtime and purchasing new components for each upgrade. Scaling vertically today would mean inducing downtime and constantly purchasing new resources to handle growing data volumes. The overhead costs of operating a vertical architecture would balloon out of proportion to the utility the system offered. Cost isn’t the only issue with traditional relational database management systems; we meet companies who can’t find hardware at any cost that can’t possibly scale to meet the demands of their applications.
If a weather company has only 10 sensors in the field, it’s likely that one server will be able to handle ingesting, storing and analyzing measurements from those sensors. As the deployment grows to hundreds, even thousands of sensors, the company will have to add more servers to handle the ingesting of data and look to start regionalizing this data collection. Also, as the sensor volume grows, the company will want to perform data analysis closer to where the devices and analysis. These are challenges traditional databases cannot handle without administrators having to manually segment the data across the constantly growing number of servers as well as maintaining application logic that is aware of which server has the required dataset.
Instead, IoT needs horizontal scaling to mitigate costs. Rather than update one server, horizontal architectures link new servers to existing servers in clusters. This helps disperse the resource load and add redundancy, and allows organizations to upgrade and scale without introducing planned downtime. As well as adding scale, reliability and accuracy, modern databases are often more cost effective too as they are often available as open source software and run on commodity hardware. This gives corporations the ability to trial solutions and therefore experience very little production risk.
How do you take distributed data from concept to reality?
We are in an exciting era for IoT big data. Tools exist that finally allow us to do something with all of the information we generate and pull into our pipelines. Distributed, horizontal architectures offer perhaps the best way to elicit value from the data in those pipelines. Distributed architectures enable us to move away from unwieldy data lakes and move us toward powerful edge architectures where we can incorporate data into our business processes. Yet, distributed data offers its own set of issues we must contend with.
A lot of solutions exist that offer to fulfill the promise of distributed data. Not all of these solutions offer the availability and resiliency, the conflict-resolution standards, or the bang-for-buck necessary to really generate value from IoT big data. Executives charged with making multimillion-dollar distributed data decisions must understand how to achieve data availability, usability and accuracy without incurring high overhead. Otherwise, they risk implementing a data strategy that doesn’t live up to its potential. If you see the potential that IoT applications have for helping you grow your business, I suggest you start developing distributed systems expertise in your business now.
All IoT Agenda network contributors are responsible for the content and accuracy of their posts. Opinions are of the writers and do not necessarily convey the thoughts of IoT Agenda.