What is a distributed database system?

McObject LLC

I almost want to cringe when I receive a telephone call or email from a potential customer who says something to this effect: “I need a distributed database for a [fill in system name here] that I’m developing, and it looks like eXtremeDB might be a good fit. Can you share licensing terms with me?” I know it’s going to require a few exchanges to get to the bottom of what the person really needs, and I fear that the process will frustrate them. Frustrating an interested party is not in the “Sales 101” handbook.

The problem is, “distributed database” is a severely overloaded term. Let’s see if we can get it sorted.

Wikipedia authors have taken a collective stab at defining a distributed database: “A distributed database is a database in which storage devices are not all attached to a common processor. It may be stored in multiple computers, located in the same physical location, or may be dispersed over a network of interconnected computers. Unlike parallel systems, in which the processors are tightly coupled and constitute a single database system, a distributed database system consists of loosely coupled sites that share no physical components.” That definition, itself, derives partly from the Institute for Telecommunications Sciences associated with the U.S. Department of Commerce.

That definition is actually pretty narrow. There are at least three other use cases that I get asked about under the general heading of “distributed database:” high availability, cluster database, and blockchain. The Wikipedia definition, high availability and cluster all have applicability to the internet of things.

High availability: For a database system to achieve high availability (HA), it needs to maintain, in real time, an identical copy of the physical database in a separate hardware instance. By maintain, I mean keep the copy consistent with the master. In this scenario, there are (at least) two copies of the database that we call the master and the slave(s) (sometimes called replicas). Actions applied to the master database (i.e., insert, update, delete operations) must be replicated on the slave, and the slave must be ready to change its role to master at any time. This is called failover. The master and replica are normally attached to different physical systems, though in telecommunications a common HA setup is multiple boards within a chassis: a master controller board, a standby controller board and some number of line cards that each serves some protocol (BGP, OSPF, etc.). Here, the master database is maintained by processes on the master controller board. The database system replicates changes to a slave database on the standby controller board, which has identical processes waiting to take over processing in the event that the master controller board fails. In the internet of things, HA is desirable for mission-critical industrial systems, to maintain availability of gateways and in the cloud to ensure that real-time analytics can continue to execute even in the face of hardware failures.

Cluster database: A cluster database is one in which there are multiple physical copies of the entire database which are kept synchronized. The difference with HA is that any physical instance of the database can be modified and will replicate its modifications to the other database instances within the cluster. This is where the similarities between database cluster implementations end.

Broadly speaking, there are two implementation models: ACID and eventual consistency. In ACID implementations, modifications are replicated synchronously in a two-phase commit protocol to assure that once they are committed, the changes are immediately reflected in every physical instance of the database. In other words, all the database instances are consistent, all the time. With eventual consistency, changes are replicated asynchronously, possibly long after the originating node committed the change to the database. This implies some sort of reconciliation process to resolve conflicting changes originated by two or more nodes. With eventual consistency, applications must be written to contend with the possibility of having stale data in the physical instance of the database to which they’re attached. For example, consider a worldwide online bookseller. There may be one copy of a certain book in stock. Buyers in New York and Sydney will both see that the book is available, and both can put the book in their shopping cart and check out. The system will have to sort out who really gets the book and whose order is backordered. Users have come to accept this. But this model would never work for a cellular telephone network needing to verify that a subscriber has a certain service or has sufficient funds. This type of system requires a consistent database view. Because of the nature of the synchronous replication required for the ACID implementation, horizontal scalability is limited, but the implementation is straightforward (no conflict resolution needed). Scalability of eventual consistency implementations is quite high, but so is the complexity.

Cluster implementations abound in the internet of things. For example, IoT gateways can be clustered for improved scalability and reliability. (See Figure 1.) The number of nodes in each gateway cluster is modest, so both the immediate and eventual consistency model are suitable. The cluster can handle more traffic from edge devices than a single gateway would be able to, and reliability/availability is improved (the inherent limits to scalability with the immediate consistency model don’t come in to play in small clusters).

Blockchain. The term distributed database is often associated with blockchain technologies, bitcoin being the most well-known. It is used synonymously with distributed ledger, which is more apt (in this author’s opinion). My problem with using the term distributed database in the context of blockchain technologies is that distributed database implies a distributed database management system. But there is rarely a database management system involved in blockchain. Not to belabor the point, but it is important to draw a distinction between a database and a database management system. A blockchain is, in fact, a distributed database. But, as previously mentioned, there is rarely a database management system involved in creating or maintaining the blockchain distributed ledger.

The distributed database topology implied by the Wikipedia definition — “…stored in multiple computers, located in the same physical location…” — is what is colloquially called sharding a database. The key difference between sharding versus HA and cluster distributed databases is that each physical database instance (shard) houses just a fraction of all the data. All the shards together represent a single logical database, which is manifested in many physical shards. I part company with the Wikipedia definition in that I don’t differentiate between “distributed” and “parallel” database systems. Logically, the purpose is the same: scalability. Whether the shards are distributed across servers, CPUs or CPU cores is immaterial. Further, in all cases, processing is parallel. How the shards are physically distributed is an unimportant artifact. For example, in STAC-M3 published benchmarks we’ve conducted since 2012, we’ve utilized single servers with 24 cores, creating 72 shards, and we’ve utilized four to six servers, each with 16 to 22 cores, creating 64 to 128 shards. In all cases, our goal is to saturate the I/O channels to get data into the CPU cores for processing. While STAC-M3 is a capital markets (tick database) benchmark, the principles apply equally to IoT’s big data analytics. IoT data is overwhelmingly time-series data (for example, sensor measurements), just like a tick database is time-series data.

Sharding a database implies support for distributed query processing. Each shard is managed by its own instance of a database server. Since each shard/server represents some fraction of the whole logical database, the potential exists that a query result returned by any shard is just a partial result set and needs to be merged with the partial result sets of every other shard/server and only then be presented to the client application as a complete result set. If the data is distributed among the shards in the most optimal way, then all of the data for a given query can be found on a single shard and the query can be distributed to the specific server instance that manages that shard. Often, both approaches must be supported. For example, consider a large smart building IoT deployment that spans multiple campuses, each with multiple buildings. We might choose to distribute (shard) information about each campus across multiple physical databases. If we want to calculate some metric for a specific building, for example, power consumption in 15-minute windows, we only need to query the shard that contains the data for that building. But if we want to calculate the same metric for multiple buildings across campuses, then we need to distribute that query to many shards/servers, and this is where the parallelism comes in to play. Each server instance works on its portion of the problem in parallel with every other server instance.

Database sharding also supports vertical scalability (i.e., being able to store 10s or 100s of terabytes, petabytes and beyond). To create a single 100 terabyte logical database, I can create 50 instances of 2 terabyte physical databases. Distributed databases systems often support elastic scalability, allowing me to add shards, which could also mean adding servers to the distributed system so that the system is scalable in both the vertical and horizontal dimensions. Vertical and horizontal scalability is essential for large IoT systems that generate very high volumes of data. You need vertical scalability to handle the ever-growing volume of data, and you need horizontal scalability to maintain the ability for timely processing and analytics of the data as it grows from 1T B to 10 TB to 100 TB to petabytes and beyond.

Figure 1 illustrates all these concepts at play. Clusters of gateways collect data from edge devices. Cluster database implementations differ, of course, but generally speaking as long as a quorum of gateways are present in any one cluster, then the cluster remains operational and the edge devices can continue to send their data. If the number of edge nodes increases, we have the option of adding a new cluster or adding gateways to existing clusters (scalability), and we’ve improved reliability because the failure of any single gateway does not stop the operation of the cluster.

Figure 1 also illustrates sharding (multiple physical databases representing a single logical database), parallel query processing and high availability. Each of the gateway clusters ship their collected data upstream to a shard. Each shard consists of a master database and replica database that can be promoted to master in the event of a problem with the original master. The system is horizontally and vertically scalable by virtue of being able to add shards. And since each shard has a hot standby, the system can withstand the failure of a shard master and continue to provide real-time analytics to the enterprise.

Figure 1. Source: McObject

In summary, the term distributed database encompasses three different database system arrangements for three distinct purposes. High-availability database systems distribute a master database to one or more replicas for the express purpose of preserving the availability of the system in the face of failure. Cluster database systems distribute a database for massive/global scalability (eventual consistency) or for cooperative computing among a relatively small number of nodes (ACID). Finally, sharding partitions a logical database into multiple shards to facilitate parallel processing and horizontal scalability. All capabilities are integral to the deployment of scalable and reliable IoT systems.

All IoT Agenda network contributors are responsible for the content and accuracy of their posts. Opinions are of the writers and do not necessarily convey the thoughts of IoT Agenda.