BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
IoT presents a new set of challenges for database management systems, including ingesting data in real time, processing events as they stream in and securing larger volumes of IoT devices and data than previously encountered in enterprise applications.
At the same time, IoT imposes fewer data quality and integrity constraints. For example, an application that collects data from vehicles in a fleet can tolerate the loss of data for a few minutes and yet still be able to monitor the overall function capabilities of the vehicles. Although IoT sensors generate data rapidly, they do not entail the same kinds of transactions found in traditional enterprise business applications. This reduces the need for atomicity, consistency, isolation and durability transactions.
In order to find the right IoT database to properly handle IoT data and its requirements, organizations must put aside preconceptions about designing database applications for traditional business operations.
Identify IoT database requirements
Organizations must keep four main considerations in mind when choosing a database:
- Scalability. A database for IoT applications must be scalable. Ideally, IoT databases are linearly scalable so adding one more server to a 10-node cluster increases throughput by 10%. IoT databases will usually be distributed unless the application collects only a small amount of data that will not grow substantially. Distributed databases can run on commodity hardware and scale by adding new servers instead of swapping out a server for a larger one. Distributed databases are especially well-suited for IaaSclouds, since it is relatively easy to add and remove servers from the database cluster as needed.
- Fault tolerance. An IoT database should also be fault tolerant and highly available. If a node in the database cluster is down, it should still be able to accept read and write requests. Distributed databases make copies of data and write them to multiple servers. If one of the servers storing a particular data set fails, then one of the other servers storing a replica of the data set can respond to the read query. Write requests can be handled in a couple of ways. If the server that would normally accept a write request is down, another node in the server can accept the write request and forward it to the target server when it is back online.
- High availability. Ensure high availability with regards to writes by using a distributed messaging system such as Apache Kafka or Amazon Kinesis, which is based on Apache Kafka. These systems can accept writes at high volumes and store them persistently in a publish-and-subscribe system. If a server is down or the volume of writes is too high for the distributed database to ingest in real time, data can be stored in the messaging system until the database processes the backlog of data or additional nodes are added to the database cluster.
- Flexibility. IoT databases should be as flexible as required by the application. NoSQL databases -- especially key-value, document and column family databases -- easily accommodate different data types and structures without the need for predefined, fixed schemas. NoSQL databases are good options when an organization has multiple data types and those data types will likely change over time. In other cases, applications that collect a fixed set of data -- such as data on weather conditions -- may work better on a relational database model. In-memory SQL databases, such as MemSQL or PostgreSQL, offer this option.
Is a managed or in-house IoT database the right fit?
Organizations can keep their database in-house to better control the equipment, software, security and data. In-house means that organizations can change their equipment based on current needs without waiting for a service provider to make the changes. Organizations are responsible for their own IT pros to maintain and properly secure IoT databases onsite. The right IT pros, equipment and software can add up to a greater total cost compared to a managed database.
Database-as-a-service could cost less than purchasing equipment; in-house admins don't need to maintain the database and services typically provide at least basic security. Service providers may offer around-the-clock support. Security is also a drawback of database services because it increases the attack surface. Organizations' options in a managed IoT database are limited to what the service offers.
In-house IoT databases to consider
For organizations managing their own databases, they have a selection to choose from. DataStax Cassandra, a highly scalable distributed database, supports a flexible big table schema and fast writes and scales to large volumes of data. It was built on Apache Cassandra, the open source NoSQL database. Riak IoT is a distributed, highly scalable key-value data store which integrates with Apache Spark, a big data platform with real-time analytics. Cassandra also integrates with Spark and big data analytics platforms, such as Hadoop MapReduce.
Open source time series database OpenTSDB can run on Hadoop and HBase. The database is made up of command line interfaces and a Time Series Daemon (TSD). TSDs, which are responsible for processing all database requests, run independently of one another. Even though TSDs use HBase to store time-series data, TSD users have little to no contact with HBase itself.
MemSQL is a relational database tuned for real-time data streaming. With MemSQL, streamed data, transactions and historical data can be kept within the same database. The database also has the capacity to work well with geospatial data out of the box, which could be useful for location-based IoT applications. MemSQL supports integration with data warehouse products, including Hadoop Distributed File System and Apache Spark.
How to choose between managed IoT database options
Enterprises and developers have many managed database services with different advantages to choose from. They should consider using AWS offerings such as Kinesis and DynamoDB, Azure Cosmos DB, MongoDB or Google Firebase. Organizations can use Kinesis to capture data in real time and send that data to NoSQL document database DynamoDB, which integrates with the Elastic MapReduce service that provides Hadoop and Spark-based analytic services.
If an organization does choose to work with a database-as-a-service provider, they should carefully consider the cost structure. AWS Kinesis Streams pricing is primarily based on shard hours; a shard hour allots 1 MB per second of input capacity and 2 MB per second of output. Customers pay $0.015 per shard hour. As for the data itself, Kinesis Streams uses PUT Payload Units to calculate total usage. Each unit comprises between one and 25 KB of data, so, for example, a 10 KB record is billed as one unit and a 30 KB record is billed as two. Customers are charged $0.014 per 1,000,000 units.
With AWS DynamoDB, the first 25 GB every month is free. Every GB after that is charged at $0.25 per month. As for operations, write throughput is charged at $0.0065 per hour per 10 units of write capacity. Read throughput is charged at the same rate for every 50 units. Note that data consumption in the cloud will exceed the raw size of your data and prices vary depending on region.
Azure Cosmos DB bases price on request units -- for example, the cost to read a 1 KB item -- whether the database operation is read, write or query. Provisioned throughput of 100 request units from a single-region account costs $0.008 per hour.