BigDAWG is an open source project from researchers within the Intel Science and Technology Center for Big Data (ISTC). Everything we do at the ISTC is open intellectual property so anyone is free to use whatever we produce.
The ISTC is based at MIT but includes researchers from Brown University, the University of Chicago, Northwestern University, the University of Washington, Portland State University, Carnegie Mellon University, the University of Tennessee, and, of course, Intel.
The slogan is now famous in the database community. “One size does not fit all.” If data storage engines match the data, performance of data intensive applications are greatly enhanced. We’ve done significant performance analysis and have found that using the right storage engine for the job can give you orders of magnitude in performance advantage. Beyond performance advantages, however, there's an opportunity to improve efficiencies for organizations that already have their data spread across a number of storage engines. Writing connectors across N different systems can lead to a lot of work for developers and make the cost of adding a new system very high.
This has led us to develop database technologies we call “polystore systems.” A polystore system is any database management system (DBMS) that is built on top of multiple, heterogeneous, integrated storage engines. Each of these terms is important to distinguish a polystore from conventional federated DBMS.
Obviously, a polystore must consist of multiple data stores. However, polystores should not to be confused with a distributed DBMS, which consists of replicated instances of a storage engine sitting behind a single query engine. The key to a polystore is that the multiple storage engines are distinct and accessed separately through their own query engines via a common interface.
Therefore, storage engines must be heterogeneous in a polystore system. If they were the same, it would violate the whole point of polystore systems: i.e., the mapping of data onto distinct storage engines well suited to the features of components of a complex data set.
Finally, the storage engines must be integrated. In a federated DBMS, the individual storage engines are independent. In most cases, they are not managed by a single administration team. In a polystore system, the storage engines are managed together as an integrated set. This is key since it means that in a polystore system, you can modify engines or the middleware managing them such that “the whole is greater than the sum of their parts.”
The challenge in designing a polystore system is to balance two often conflicting forces.
The BigDAWG project described in this document is our reference implementation of this polystore concept. As we will see in the next section, BigDAWG uses the concepts of “islands” to balance these forces.
The above figure describes the overall BigDAWG architecture. This figure is a representation of the BigDAWG polystore system integrated with higher-level components to solve end-user applications. At the bottom, we have a collection of disparate storage engines (we make no assumption about the data model, programming model, etc., of each of these engines). These are organized into a number of islands. An island is composed of a data model, a set of operations and a set of candidate storage engines. An island provides location independence among its associated storage engines.
A shim connects an island to one or more storage engines. The shim is basically a translator that maps queries expressed in terms of the operations defined by an island into the native query language of a particular storage engine.
A key goal of a polystore system is for the processing to occur on the storage engine best suited to the features of the data. We expect in typical workloads that queries will produce results best suited to particular storage engines. Hence, BigDAWG needs a capability to move data directly between storage engines. We do this with software components we call casts.
BigDAWG is at its core middleware that supports a common application programming interface (API) to a collection of storage engines. The middleware contains a number of key elements: