The Internet of Things presents a new set of challenges to database management systems, such as ingesting large volumes of data in real time or near real time, processing events as they stream in rather than on a schedule of fixed intervals, and dealing with significantly larger volumes of data than often encountered in enterprise applications.
At the same time, IoT imposes fewer data quality and integrity constraints. For example, an application that collects data from vehicles in a fleet can tolerate the loss of data for a few minutes and yet still be able to monitor the overhaul function capabilities of the vehicles. Although IoT sensors generate data rapidly, they do not entail the same kinds of transactions one finds in traditional enterprise business applications. This reduces the need for ACID transactions.
In order to properly handle IoT data and its requirements, it is critical to find the right IoT database. However, to choose a database for IoT applications, we need to put aside many of the things we have learned about designing database applications for business operations.
IoT database requirements
There are many factors to keep in mind when choosing a database for an IoT application, and they do not always align with the needs of other more traditional enterprise databases. Some of the most important considerations are scalability, ability to ingest data at sufficient rates, schema flexibility, integration with analytics tools and costs.
A database for IoT applications must be scalable. Ideally, IoT databases are linearly scalable so adding one more server to a 10 node cluster increases throughput by 10%. IoT databases will usually be distributed unless the application collects only a small amount of data that will not grow substantially. Distributed databases can run on commodity hardware and scale by adding new servers instead of swapping out a server for a larger one. Distributed databases are especially well suited for IaaS clouds since it is relatively easy to add and remove servers from the database cluster as needed.
An IoT database should also be fault tolerant and highly available. If a node in the database cluster is down, it should still be able to accept read and write requests. Distributed databases make copies, or replicas, of data and write them to multiple servers. If one of the servers storing a particular data set fails, then one of the other servers storing a replica of the data set can respond to the read query. Write requests can be handled in a couple of ways. If the server that would normally accept a write request is down, another node in the server can accept the write request and forward it to the target server when it is back online.
Another approach to ensuring high availability with regards to writes is to use a distributed messaging system such as Apache Kafka or Amazon Kinesis, which is based on Apache Kafka. These systems can accept writes at high volumes and store them persistently in a publish-and-subscribe system. If a server is down or the volume of writes is too high for the distributed database to ingest in real time, data can be stored in the messaging system until the database processes the backlog of data or additional nodes are added to the database cluster.
IoT databases should be as flexible as required by the application. NoSQL databases -- especially key-value, document and column family databases -- easily accommodate different data types and structures without the need for predefined, fixed schemas. NoSQL databases are good options when an organization has multiple data types and those data types will likely change over time. In other cases, applications that collect a fixed set of data -- such as data on weather conditions -- may benefit from a relational model. In-memory SQL databases, such as MemSQL, offer this benefit.
Managing a database for IoT applications in-house
For those organizations choosing to manage their own databases, DataStax Cassandra is a highly scalable distributed database that supports a flexible big table schema and fast writes and scales to large volumes of data. Riak IoT is a distributed, highly scalable key-value data store which integrates with Apache Spark, a big data analytics platform that enables stream analytic processing. Cassandra also integrates with Spark as well as other big data analytics platforms, such as Hadoop MapReduce.
OpenTSDB is an open source database capable of running on Hadoop and HBase. The database is made up of command line interfaces and a Time Series Daemon (TSD). TSDs, which are responsible for processing all database requests, run independently of one another. Even though TSDs use HBase to store time-series data, TSD users have little to no contact with HBase itself.
MemSQL is a relational database tuned for real-time data streaming. With MemSQL, streamed data, transactions and historical data can be kept within the same database. The database also has the capacity to work well with geospatial data out of the box, which could be useful for location-based IoT applications. MemSQL supports integration with Hadoop Distributed File System and Apache Spark, as well as other data warehousing solutions.
Managed IoT database options
Enterprises and developers that prefer a managed database service should consider using AWS offerings such as Kinesis and DynamoDB. Kinesis is used to capture data as it is generated and send that data to NoSQL document database DynamoDB, which integrates with the Elastic MapReduce service that provides Hadoop and Spark-based analytic services.
If an organization does choose to work with a database as a service provider, carefully consider the cost structure. AWS Kinesis Streams pricing is primarily based on shard hours; a shard hour allots 1 MB per second of input capacity and 2 MB per second of output. Customers pay $0.015 per shard hour. As for the data itself, Kinesis Streams uses PUT Payload Units to calculate total usage. Each unit comprises between one and 25 KB of data, so, for example, a 10 KB record is billed as one unit while a 30 KB record is billed as two. Customers are charged $0.014 per 1,000,000 units.
With AWS DynamoDB, the first 25 GB every month is free. Every GB after that is charged at $0.25 per month. As for operations, write throughput is charged at $0.0065 per hour per 10 units of write capacity. Read throughput is charged at the same rate for every 50 units. Note that data consumption in the cloud will exceed the raw size of your data and prices vary depending on region.
About the authors:
Dan Sullivan is an author, systems architect and consultant with over 20 years of IT experience with engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence.
James Sullivan is a technology writer with concentrations in cloud database services, IoT and security. He is based out of Portland, OR.
Coming soon to an infrastructure near you: IoT
Explore how to support IoT with AWS
IoT data requires significant IT provisions