How MemSQL Works

Warning

This documentation does not reflect the latest version of MemSQL.
This page refers to MemSQL 4.1.

For the latest MemSQL product documentation, visit http://docs.memsql.com/latest.

MemSQL is a distributed, relational database for simultaneous transactions and analytics at scale. Querying is done through standard SQL drivers and syntax, leveraging a broad ecosystem of drivers and applications.

Two-tiered Architecture

MemSQL has a two-tiered, clustered architecture. Each instance of the MemSQL program is called a “node”, and runs identical software. The only difference is the role the nodes are configured to play.

  • Aggregator nodes provide a single interface to database clients and applications. Aggregators broker SQL queries to the cluster and aggregate results.
  • Leaf nodes store and process data. Communication between leaves and aggregators is all over standard SQL.
../_images/aggs-leaves.png

High Throughput

MemSQL is designed to enable high throughput on concurrent workloads. A distributed query optimizer evenly divides the processing workload to maximize the efficiency of CPU usage. Queries are compiled to machine code and cached to expedite subsequent executions. Rather than cache the results of the query, MemSQL caches a compiled query plan to provide the most efficient execution path. The compiled query plan does not pre-specify values for the parameters, which allows MemSQL to substitute the values upon request, enabling subsequent queries of the same structure to run quickly, even with different parameter values. Moreover, due to MemSQL’s use of MVCC and lock-free data structures, data remains highly accessible, even amidst a high volume of concurrent reads and writes.

Highly Scalable

MemSQL is designed to be highly scalable. The cluster can be scaled out at any time to provide increased storage capacity and processing power. Sharding is done automatically, and the cluster re-balances data and workload distribution. Because the data is stored in memory, queries run at full speed on clusters built from commodity hardware. In addition to being fast, consistent, and scalable, MemSQL is also durable. Data is replicated across shards, and a node can go down with negligible effect on performance. Also, leaf nodes regularly commit transactions to disk as logs. Periodically, full backups are committed as compressed snapshots of the entire database. If any node goes down, it can restart using one of these snapshots.

Highly Compatible

MemSQL is an ODBC-compatible database. Moreover, it is wire protocol compatible with MySQL. This means that applications that use a MySQL driver can connect to and use MemSQL transparently. MemSQL supports a subset of the MySQL syntax, plus extensions to support advanced features not in MySQL such as distributed joins, geospatial, and JSON.

In-Memory and On-Disk Storage

MemSQL supports storing and processing data using a completely in-memory rowstore or a disk-backed columnstore. The MemSQL in-memory rowstore is best for optimum performance in transactional workloads. The MemSQL columnstore is best for cost-effective data storage of large amounts of historical data. A combination of the MemSQL rowstore and columnstore engines allow merging of real-time and historical data in a single query.

Tight Spark Integration for Real-Time Data Streaming

MemSQL has tight Apache Spark integration, giving MemSQL users the simplest way to create and manage real-time data pipelines. Through MemSQL Ops, users can install Apache Spark with one-click, and create custom data extractors and transformers for streaming real-time data into MemSQL.

Where to Go from Here