1
Distributed and Non-Relational Databases
Mark Feltner
Department of Computer Science
University of Wisconsin – Platteville
Platteville, WI 53818
Abstract
The data storage climate was much different when SQL – the standard data storage model –was conceived in the 1970s. Today’s data is adaptive, moves quickly, can be astronomical in size, and may not even be located in a single geographic spot. Consider: the New York Stock Exchange generates about one terabyte of trade data per day, Facebook hosts approximately ten billion photos which take up about one petabyte of storage, and the Large Hadron Collider in Geneva, Switzerland will produce about fifteen petabytes of data per year[1]. New data storage solutions have been, and are being, conceived which specialize in handling these sorts of problems. These non-relational models (so called because they stray from the foundations of relational, SQL databases) are flexible to design and develop with, can easily be scaled horizontally, and are specialized for dealing with Big Data.
Theory
This section deals broadly with the theory of databases and distributed systems. The foundational concepts of databases and both the relational and non-relational models are described. A few important concepts relevant to data storage with distributed systems are also featured.
Relational Databases
The first coinage of a database management system was in 1970 by Edgar Codd atIBM Almden Research Center. Codd proposed a relational model, based on Set Theory, to structure stored data[2].
The relational model defines a relation as a set of tuples which have the same attribute. Usually one tuple represents an object and its associated information. Basically, each tuple is a row in the table and each attribute is a column[3]. Usually, both rows and attributes are unordered sets. A relation requires a primary key, or at least a set of attributes which are guaranteed to be distinct among all tuples in the relation [4].
Relations are referenced via foreign keys. This allows a single tuple to be referenced by any number of other tuples (i.e., one-to-many relationship).
The relational model also allows one to define precompiled indexes which can find tuples quicker than checking each single tuple for a match.
Data in the relational model is often normalized to retain integrity and remove redundancy [1].
Database transactions are the standard units of work for a relational database management system (DBMS). Transactions are small and simplified work units. Examples being INSERT, UPDATE, SELECT, and DELETE. A good DBMS is usually able to take a complex query and divide it up into efficient, small, discrete, and reliable transactions. This provides the DBMS with the ability to keep the database in a consistent state even if a transaction fails. Transaction-based systems benefit from being able to ROLLBACK from changes by simply reversing the transactions that were executed.
Besides querying data with the SELECT statement, users can modify relations using INSERT, DELETE, UPDATE, and, of course, other DBMS provided procedures.
ACID
ACID is one of the core characteristics of a "reliable" database. ACID is an acronym for Atomicity, Consistency, Isolation, and Durability. This set of properties guarantees that reliable database transactions may take place, and has been a mainstay in modern relational database systems.
Atomicity simply means that each transaction is all-or-nothing. If any operation in the transaction group fails, then the entire transaction must be undone. If one were to insert a million rows into a table, and the 999,999 thousandth one failed before a COMMIT was issued, the database would revert back to its previous state.
For a database to be consistent, the data must always be in a valid, or consistent, state. Relational databases are designed with rules in mind. Strict schemas are defined and specify maximum lengths, value types, relations, and other rules. An ACID compliant database would never break the rules it is defined with.Of course, this consistency does not guarantee programmer correctness, just that the database will follow the rules laid out for it.
Isolation defines a DBMS property that enables concurrent execution of transactions. Isolation ensures that transactions executed serially will result in the same system state as transactions executed concurrently.
Durability means that once a programmer has executed a COMMIT then the transaction is permanent across all clients.
All of these checks and rules, of course, do add overtime to each transaction that occurs in an ACID compliant DBMS. For systems in the banking or government sectors that do require a strong degree of data reliability, then founding their models on that of ACID will benefit them.
The relational model is backed by decades of use, strong foundations in well-studied mathematics, and a variety of ready-to-use implementations. That being said, it is not the end-all-be-all of data storage. It is inflexible, tedious to work with, and the benefits of ACID do not always outweigh the costs [4]. Relational databases lend themselves to systems which require a large degree of reliability and availability, and are generally better suited for single-node systems. The rise of distributed systems and web services, and the difficulty found when scaling traditional DBMSs, have led to new paradigms in data storage.
Non-Relational Databases
In 2009, the landscape of databases had changed vastly from that conceived by Codd in the 1970s. Nowadays, we are dealing with systems made to handle thousands of concurrent connections a second, store petabytes of data, and sync with other similar servers and services around the globe simulatenously.
NoSQL(so named because of its focused on data storage techniques sans SQL) defines non-relational, distributed data stores that often did not attempt to provide atomicity, consistency, isolation, and durability – the components of ACID. As consumers became more and more connected, data needed to be delivered faster. Two methods of solving this problem were conceived. First, spread out the data stores geographically. Second, specialize the data stores for the types of data they hold and process[4].
NoSQL systems store semi-structured orunstructured data. Semi-structured data is much like a spreadsheet. You know that the data is organized in cells, rows, and columns, but you only have a loose idea of what kind of data is stored. Unstructured data has no particular internal structure (i.e., plain text or image data).
Recently, there has been a surge of NoSQL and NoSQL-esque database management systems. Most, if not all, of these systems lack fixed schemas, avoid joins, and scale really well.
Different systems satisfy different needs. Document-oriented ones, for instance, have an immense easy-of-use, while key-value or column oriented systems sacrifice this in order to more efficiently scale data over clusters [4].
Key-Value
Key-value stores are basically “big, distributed, persistent, fault-tolerant hash” tables. Based on an extremely simple concept, hash tables, key-value stores are prized for their simplicity and speed[4]. Key-value stores allow the application developer to store schema-less data. This is beneficial when using a database for persistent storage of object oriented data.
The data in a key-value store is usually just a string that represents a key, and a value which is some type of primitive of whatever programming language is being used (string, integer, array) [8].
There are a variety of ways to implement this. The most common key-value store is the file system. The key, for example, would be the path to a file, and the value is the binary data associated with that file. This simplicity means that key-value stores are not only easy to use, but the code produced is simple and readable. An example of accessing a key-value store in ruby:
require "pstore" # the key-value store interface
store = PStore.new("data-file.pstore")
store.transaction do # begin transaction
# load some data into the store...
store[:single_object] = "Lorem ipsum dolor sit amet..."
store[:obj_heirarchy] = { "Marc Seeger" => ["ruby", "nosql"],
"Rainer Wahnsinn" => ["php", "mysql"] }
end # commit changes to data store file [8].
Some of the most popular key-value storage solutions include: Apache Cassandra, Redis, memcached, and Google’s BigTable.
Document-oriented
Document-oriented databases are defined as semi-structured data storages implemented without tables and relations[4]. Document-stores are highly flexible and have no fixed schema. The idea is to allow the client application to add and remove attributes to any single tuple without wasting any space by creating empty attributes for all the other tuples.
Document-oriented databases are suited well for dynamic data because of their lack of fixed-schemas and developer flexibility. Because of this, though, document-oriented databases have a lack of safety (much like using a duck-typed language versus a statically typed one) and there is a relative performance hit.
Many popular document-oriented databases provide layers over already standardized semi-structured file formats such as JSON, YAML, or XML. This allows them to map easily to web services and other distributed systems which operate by passing these small, structured files around.
Some of the most popular document-oriented databases include MongoDB, CouchDB, and SimpleDB.
Graph
Graph-based databases use graph-like structures and use nodes, edges, and properties to represent data. Graph database are powerful tools, especially for graph-like queries such as shortest path computation and community analysis.
These databases provide another advantage: index-free adjacency. This basically means no index lookups are necessary because each element contains a direct pointer to its adjacent element.
Graph database scale more naturally for large datasets and are often faster that object-oriented applications for associative data sets. Graphs typically don’t require expensive JOIN operations and their flexible schemas are suitable for ad-hoc and changing data.
Some popular graph-based solutions include AllegroGraph, InfiniteGraph, and FlockDB.
Other
Of course, there are many other types of distributed databases including tuple stores, tabular databases, RDF databases, object databases, multivalue databases, RAM stores, and many other sub-categories of previously listed categories. All-in-all, there exist quite a number of solutions for almost any data storage need.
Distributed Systems
Distributed systems, also referred to commonly as “grids”, are in rising popularity. There has been a growing need for distributed systems in both commercial and scientific domains. Grid systems can be classified into the computational, data, and service types. For the purposes of this paper, we will focus on the data grid. The data grid is defined as “systems that provide a hardware and software infrastructure for synthesizing new information from data repositories that are distributed in a wide area network” [9].
Almost any application that relies on a database which houses a large volume of data, generally the bottleneck of that application will be its database. I/O operations are expensive, disk reads are slow, writes are slower, and since many databases exist in remote data centers one must take network latency into account as well. A simple solution is to add more hardware to the database system, but this is innately unsustainable.
Another solution is to parallelize and distribute your data processing. A parallel database management system can achieve both high throughput and interquery parallelism and low response time in intraoperation parallelism [10].
Parallelism does add architectural complexity, and creates a lot of system administration overhead. The biggest disadvantage in parallel computing is the lack of standards and applicable examples [10].
The structure of distributed systems grants them different performance metrics and different goals. These systems are measured on their efficiency, or utilization rate of resources; dependability; adaptability, or ability to support billions of job requests of massive data sets and virtualized cloud servers under various workloads and service models; and flexibility, or the ability of the system to run in both high-performance and high-throughput computing environments [11].
CAP Theorem
In 2000, Eric Brewer laid out the CAP Theorem in a keynote at the Proceedings of the Annual ACM Symposium on Principles of Distributed Computing, and Seth Gilbert and Nancy Lynch formalized his ideas into a theory in 2009. The CAP Theorem states that it is impossible to guarantee consistency, availability, and partition tolerance in a distributed system [4]. This idea has vast implications, especially today where almost every system is distributed in one way or another.
A service that is consistent operates fully or not at all. That is, a system is consistent if an update is applied to all relevant nodes at the same logical time [5]. One way to drop consistency with your service is to simply deal with inconsistencies. Some services simply accept that things are eventually consistent.
Eventual consistency is a consistency model used in the parallel programming domain. It means that given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate throughout the system and replicas will be consistent [6].
“For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” [7]. Availability is one of the hardest things to test and plan for because it tends to desert you when you need it most. It is only inevitable that services go down at busy times just because they are busy. Implementing a system that can deal with availability gracefully can be non-trivial [4].
“In order to model partition tolerance, the network will be allowed to lose arbitrarily many messages sent from one node to another. When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost” [7]. Once a system has become distributed, there is a high chance that it will be in a partitioned state. Say the data has not traversed the wires to reach your entire data store yet, or perhaps some of your machines have crashed and are no longer online. To handle issues of consistency and availability, we can scale horizontally, but as the number of partitions increases, the more atomic entities you need to keep synchronized. You can drop partition tolerance by placing everything on a single machine, but this severely limits your ability to scale [4].
To have a consistent system you must sacrifice some level of availability or partition tolerance, a system with good availability will have to sacrifice some consistency or partition tolerance, and a system that is partitioned will need to sacrifice some degree of consistency or availability.
Choosing which part of CAP to drop has been argued over-and-over again since CAP’s inception. Hale believes that you cannot drop partition tolerance because the probability that one or more node in your system will fail increases exponentially as you add nodes. Instead, Hale says that you should be asking yourself: “In the event of failures, which will this system sacrifice? Consistency or availability?”
Algorithms
Distributed Systems and Data Parallelization Techniques
MapReduce
MapReduce is a popular web programming model used for scalable data processing on large clusters over large data sets[4]. MapReduce is suited for applications where data is written once and read many times [1]. The user provides two functions: a map function which generates a list of intermediate key/value pairs, and a reduce function which merges all the intermediate values with the same intermediate key. Being purely functional, MapReduce operates well on semi- or unstructured data
The map function serves to process a subset of the original problem and returns the result as a list of (A,B)-tuples.
The reduce function will, given an A value and a list of B values, produces a list of results that is the final answer for all A-parts. The beauty of these two operations is that they are purely functional and have no side-effects. What does this mean? It means they tend to scale well, are highly parallelizable, and extremely flexible [4].
Figure 1: MapReduce Visual Representation
This operation is linearly scalable and is massively parallelizable [1]; if you double the amount of data to operate on, the operation takes twice as much time, but you can also double the size of the cluster and have the operation take half as much time [1]. It has been proven to work on terabytes of data on hundreds of thousands of client machines with hundreds of simultaneous MapReduce programs. In fact, there are estimated thousands of these operations occurring on Google’s clusters every day [11].