Get started Bring yourself up to speed with our introductory content.

The internet of things, database systems and data distribution, part two

In part one of this two-part series, I covered where, in the internet of things, data needs to be collected: on edge devices, gateways and servers in public or private clouds. And I discussed the characteristics of these systems, as well as the implications for choosing appropriate database management system technology.

In this installment, I’ll talk about legacy data distribution patterns and how the internet of things introduces a new pattern that you, or your database system vendor of choice, need to accommodate.

Historically, there are some common patterns of data distribution in the context of database systems: high availability, database clusters and sharding.

I touched on sharding in part one. It is the distribution of a logical database’s contents across two or more physical databases. Logically, it is still one database and it is the responsibility of the database system to maintain the integrity and consistency of the logical database as a unit. Exactly how the data is distributed varies widely between database systems. Some systems delegate (or at least allow delegation) of the responsibility to the application (for example, “put this data on shard three”). Other systems are at the other end of the spectrum, deploying intelligent agents that monitor how the data is queried and by which clients, and moving the data between shards to colocate data that is queried together, and/or to move data to a shard that is closer to the client(s) most frequently using that data. The database system should isolate applications from the physical implementation. See Figure 2.

data distribution, logical database, physical databaseThe purpose of high availability is as its name implies: to create redundancy and provide resilience against the loss of a system that stores a database. In a high availability system, there is a master/primary database and one or more replica/standby database to which the overall system can failover to in the event of a failure of the master system. A database system providing high availability as a service needs to replicate changes (for example, insert, update and delete operations) from the master to the replica(s). In the event of the master’s failure, a means is provided to promote a replica to master. The specific mechanisms by which these features are carried out are different for every database system. See Figures 3a and 3b.

database system mechanisms

The purpose of database clusters is to facilitate distributed computing and/or scalability. Clusters don’t use the concept of master and standby; each instance of a database system in a cluster is a peer to every other instance and works cooperatively on the content of the database. There are two main architectures of database system clusters. See Figure 4a and 4b.

database system clusters architecture

database system clusters architecture

As illustrated in Figure 4a, when clusters are used, and unlike sharding, where each shard contains a fraction of the entire logical database, each database system instance in a cluster maintains a copy of the entire database. Local reads are extremely fast. Insert, update and delete operations must be replicated to every other node in the cluster, which is a drag on performance, but overall system performance (i.e., the performance of the cluster in the aggregate) is better because there are N nodes executing those operations. Nevertheless, this architecture is best suited to read-heavy usage patterns versus write-heavy usage patterns.

High availability is sometimes combined with sharding, wherein each shard has a master and a standby database and database system. Because all of the shards represent a single logical database, the failure of a node hosting a shard would make the entire logical database unavailable. Adding high availability to a sharded database improves availability of the logical database, insulating it from the failure of nodes.

Replication of databases on edge devices adds a new wrinkle to data distribution. Recall that with high availability, and depending on the architecture, cluster, the contents of the entire database are replicated between nodes and each database is a mirror image of the other. However, IoT cloud server database systems need to receive replicated data from multiple edge devices. A single logical cloud database needs to hold the content of multiple edge device databases. In other words, the cloud database server’s database is not a mirror image of any one edge device database, it is an aggregation of many. See Figure 5.

Cloud database server aggregationFurther, for replication in a high availability context, the sending node is always the master and the receiving node is always the replica. In the IoT context, the receiver is most definitely not a replica of the edge device.

Also, for replication in a cluster environment, as depicted in Figure 4a, the database system must ensure the consistency of every database in the cluster. This implies a two-phase commit and synchronous replication. In other words, a guarantee that a transaction succeeds on every node in the cluster or on none of the nodes in the cluster. However, synchronous replication is neither desirable nor necessary for replication of data from edge to cloud.

So, the relationship between the sender and receiver of replicated data in the IoT system is different than the relationship between primary and standby, and between node peers in a database cluster, so the database system must support this unique relationship.

In part one of this series, I led with the assertion: “If you’re going to collect data, you need to collect it somewhere.” But, it’s not always necessary to actually collect data at the edge. Sometimes, an edge device is just a producer of data and there is no requirement for local storage of the data, so a database is unnecessary. In this case, you can consider another alternative for moving data around: data distribution service (DDS). This is a standard defined by the Object Management Group to “enable scalable, real-time, dependable, high-performance and interoperable data exchanges.” There are commercial and open source implementations of DDS available. In layman’s terms, DDS is publish-subscribe middleware that does the heavy lifting of transporting data from publishers (for example, an edge IoT device) to subscribers (for example, gateways and/or servers).

DDS isn’t only limited to use cases in which there is no local storage at an edge device. Another use case would be to replicate data between two dissimilar database systems. For example, one embedded database used with edge devices and SAP HANA. Or, between a NoSQL database and a conventional relational database management system.

In conclusion, designing the architecture for an internet of things technology includes consideration of the characteristics and capabilities of the various hardware components and the implications for selecting appropriate database system technology. A designer also needs to decide where to collect data, where data needs to be processed (e.g., to provide command and control of an industrial environment or to conduct analytics to find actionable information and so on), and what data needs to be moved around and when. These considerations will, in turn, inform the choice of both database systems and replication/distribution solutions.

All IoT Agenda network contributors are responsible for the content and accuracy of their posts. Opinions are of the writers and do not necessarily convey the thoughts of IoT Agenda.