This page addresses some frequently asked questions about MemSQL.
Is MemSQL a storage engine for MySQL?
No. MemSQL is a standalone database that is compatible with the MySQL client. MemSQL includes its own storage engine and SQL-based execution engine built around lock-free data structures and machine code generation.
Is MemSQL a row-based or column-based store?
MemSQL provides both in-memory row-based and on-disk column-based stores.
The in-memory row store:
- Works best for mixed transactional and analytical workloads
- Provides low latency and highly concurrent reads and writes of individual rows as well as sophisticated analytical SQL queries
- Supports PRIMARY and UNIQUE keys
- Supports geospatial indexes
- Has longer recovery times (as the entire table needs to be loaded into memory)
The on-disk column store:
- Works best for analytical workloads
- Allows tables larger than the amount of available RAM in the cluster
- Uses compression (which lowers disk usage and accelerates replication)
- Provides fast and efficient scans of large datasets
- Provides sorted columnstore indexes
- Is optimized for batch
- Requires more expensive query compilation (as compared to the row store)
How does MemSQL’s in-memory lock-free storage engine work?
MemSQL’s storage engine uses multi-version concurrency control with lock-free skip lists and lock-free hash tables which allow highly concurrent reads and writes at very high throughput. Reads in MemSQL are never blocked, but updates to the same row can conflict with logical locks.
What is the advantage of MemSQL over traditional databases like Oracle, SQL Server or MySQL with data in a “ramdisk” or large buffer pools?
Two common techniques for leveraging large amounts of memory in traditional databases is storing data files on a “ramdisk” or running a disk-based storage engine with a large buffer pool. For example, MySQL performs much better with InnoDB configured to use a large buffer pool.
While running an existing storage engine like InnoDB in memory can alleviate some of the bottlenecks involved with disk, MemSQL has four distinguishing memory-optimized features that enable it to perform significantly better than disk-based storage engines running in memory:
- MemSQL is a distributed scale-out system. MemSQL scales to thousands of machines on commodity hardware.
- No buffer pool. Traditional databases manage a global buffer pool since they assume that the dataset can’t fit into memory. The buffer pool is a resource shared across all databases and all tables, which creates significant contention.
- Lock-free data structures. MemSQL uses memory-optimized, lock-free skip lists and hash tables as its indexes. Unlike B+ Trees, these data structures are designed from the ground up to be fast in memory.
- Code generation. Lock-free data structures are so fast that dynamic SQL interpretation quickly becomes the limiting factor for query execution. With code generation, MemSQL compiles SQL down to native code for maximum performance.
What is the advantage of MemSQL over other distributed databases?
- Full SQL. MemSQL support full SQL and transactional semantics.
- Storage pyramid. MemSQL combines row and column store engines tuned for memory and flash storage.
- Scales on commodity hardware. MemSQL doesn’t require exotic hardware and can run on premises or in the cloud.
- Enterprise ready. MemSQL supports a large number of enterprise security and manageability features.
What is MemSQL not for?
MemSQL excels at real-time and high throughput query use cases. It is a great general purpose database for running both transactional and analytic workloads. However, there are use cases which MemSQL is not designed to run. Some of these are listed below:
Object store. MemSQL is not designed to be a blob store or “data lake”. It is designed for high value data that is structured or semi-structured and ready to query. MemSQL has open-source connectors for integrating with a variety of great object stores, including Amazon S3 and Hadoop File System (HDFS). See Data Loading for more information.
Running on low hardware. MemSQL is not designed to run on “micro instances”, mobile phones or other low-powered computers. It is designed to run on machines with at least 4 cores and 8GB of RAM. The easiest way to run MemSQL for development is to use the MemSQL Quick Start Docker Container; see Quick Start with Docker.
In-process database. MemSQL is not run as a library or in-process with an application. MemSQL is a distributed database which runs in separate processes from the application, and applications connect to MemSQL via a client driver.
Serializable transactions. MemSQL supports extremely fast, distributed “READ-COMMITTED” transactions, but it is not suitable for applications which require “SERIALIZABLE” transactions.
Full-text search. MemSQL does not have built-in full-text search capabilities. MemSQL supports basic search queries with the
REGEXPoperations and is compatible with full-text search technologies like Sphinx and ElasticSearch/Lucene/Solr, which can connect to MySQL-compatible databases.
Why does CREATE TABLE take so long in MemSQL?
In order to speed up the compilation of queries in MemSQL,
CREATE TABLE will precompile code used to access and update table indexes. This is a one-time operation for each unique table schema, and compiled table code will be cached on disk for future uses.
Why do MemSQL queries typically run faster the second time they are executed?
Traditional relational database management systems interpret SQL queries the same way interpreters for languages like Python and Ruby run programs. The first time a MemSQL server encounters a given query shape, it will optimize and compile the query for future invocations. This incurs overhead which does not depend on the amount of data to be processed, but rather the complexity of the query. The process of code generation involves extracting parameters from the query then transforming the normalized query into a MemSQL-specific intermediate representation tailored to the system. Subsequent requests with the same shape can reuse this plan to complete both quickly and consistently. Starting with MemSQL 5, MemSQL embeds an industrial compiler (LLVM) for code generation, leading to fast query performance for even the first time queries are run.
How much can you change a query before it needs to be recompiled?
If you only change an integer or string constant in a query, it will not require recompilation.
MemSQL strips out numeric and string parameters from a query and attaches the resulting string to a compiled plan. This string is referred to as a parameterized query. For example,
SELECT * FROM foo WHERE id=22 AND name='bar' is parameterized to
SELECT * FROM foo WHERE id=@ AND name=^.
You can list the distinct parameterized queries corresponding to all executed queries by running
The one exception to this rule is constants in the projection clause without an alias. These constants are compiled directly into the plan’s assembly code for performance reasons. For example,
SELECT id + 1, id + 2 AS z FROM foo is converted to
SELECT id + 1, id + @ AS z FROM foo.
What is the durability guaranteed by MemSQL?
MemSQL provides several options which control tradeoffs between performance and durability (see Durability and Recovery). In its most durable state, MemSQL will not lose any transactions which have been acknowledged. Many users, however, find it useful to risk a bounded amount of data loss to for greatly improved latencies.
Can I configure MemSQL to be fully durable?
Yes. You can get full durability at the cost of increased query latency by setting
transaction_buffer=0 in memsql.cnf.
Does being in-memory mean that MemSQL will lose all data upon system failure or restart?
No. Unlike traditional relational database management systems, MemSQL uses RAM as the primary storage for data. However, MemSQL continuously backs up data to disk with transaction logs and periodic snapshots. These features can be tuned all the way from synchronous durability (every write transaction is recorded on disk before the transaction completes) to purely in-memory durability (maximum sustained throughput on writes).
On restart, MemSQL uses the snapshot and log files to recover its state to what it was before shutting down. Because the recovery process is parallelized across CPUs, the bottleneck in this process is the sequential hard drive speed.
See Using Durability and Recovery for more information.
If MemSQL writes data to disk, how can it be faster than disk-based databases?
Traditional relational database management systems use disk as the primary storage for data and memory as a cache. Managing this caching layer adds bookkeeping overhead and contention thus reducing throughput and concurrency. These constraints result in random read and write I/O, which puts significant pressure on both rotational and solid state disks.
On the other hand, MemSQL stores data primarily in memory and backs it up to disk in a compact format. As a result, MemSQL uses only sequential I/O and the transaction log size is significantly smaller. This I/O pattern is optimized for both rotational and solid state disks. Furthermore, reads in MemSQL can use memory-optimized lock-free skip lists and hash tables that cannot be managed in a buffer pool.
What isolation levels does MemSQL provide?
MemSQL provides the “READ COMMITTED” isolation level. This guarantees that no transaction will read any uncommitted data from another transaction. Furthermore, once a change is observed in one transaction, it will be visible to all future transactions.
Unlike the “REPEATABLE READ” or “SNAPSHOT” isolation level, “READ COMMITTED” isolation level does not guarantee that a row will remain the same for every read query in a given transaction. Applications that use MemSQL should take this into account.
Even though regular transactions use “READ COMMITTED” isolation level, backups created using
BACKUP command use “SNAPSHOT” isolation level.
Deployment and Management
How can I get a copy of MemSQL?
Visit the download page to obtain a download link for MemSQL. MemSQL offers a free community edition and an enterprise edition available for 30-day trial and purchase.
Does MemSQL run in the cloud?
Yes. The easiest way to get up and running with MemSQL is the MemSQL CloudFormation template for Amazon. Once you sign up for a license key, visit Quick Start with Amazon Webservices or Quick Start with Microsoft Azure for more instructions.
Does MemSQL run on Windows?
Yes, the MemSQL Quick Start “cluster-in-a-box” packaged via Docker can run on Windows. A MemSQL “cluster-in-a-box” is a MemSQL master aggregator and leaf node collocated on a single host. This configuration is intended to provide familiarity with MemSQL but is not recommended for production. Note that Docker Toolbox is required to configure Docker on Windows environments. See the Quick Start with Docker section for more information.
MemSQL production environments run on modern 64-bit Linux versions, as described in the System Requirements section.
You can also run MemSQL on Microsoft Azure, using Azure Linux instances.
What hardware is recommended for MemSQL?
See System Requirements.
On which Linux distribution does MemSQL run best?
MemSQL is developed and tested most extensively on Ubuntu 14.04 and CentOS 6.4. See System Requirements for the full list of Linux distributions that are officially supported.
How much disk space should I allocate for MemSQL?
MemSQL uses disk for three types of storage:
- Snapshot and log files that backup row store data. You should allocate about as much space on disk for this purpose as memory on your machine.
- Compressed columnstore data files that contain column store data in MemSQL.
- Object files that are the result of code generation. This includes for Data Definition Language (DDL) queries like
ALTER TABLEand for Data Manipulation Language (DML) queries like
SELECT. On average, these usually require about 0.1 MB per unique plan.
Therefore, you should allocate roughly the amount of memory on your machine + space for compressed column store data + 0.1 MB for each plan. Note that the exact disk requirements will vary with the application, so it is advisable (and usually cheap) to allocate some extra disk space.
What happens if I run out of memory?
If the amount of memory used by row store tables (
SHOW STATUS EXTENDED) is greater than the
maximum_table_memory global variable (from
SHOW GLOBAL VARIABLES), MemSQL will refuse to start new write queries (
LOAD DATA). Note that
DELETE queries are not affected by this limit.
If a currently running query runs out of memory, it will rollback and notify the client of the error. See Memory Management for more information.
What happens if I run out of disk space?
If the amount of available disk space (in the
<MEMSQL HOME>/data directory) is less than the
minimal_disk_space global variable (from
SHOW GLOBAL VARIABLES), MemSQL will refuse to start new write queries (
LOAD DATA). Note that
DELETE queries are not affected by this limit, and the database will remain online for reads.
If a currently running write query exhausts the available disk space before making its changes durable, it will wait until more disk space becomes available before continuing. Queries may appear to “hang” when this happens. To determine how many queries and background threads are
waiting for disk space run
SHOW STATUS EXTENDED LIKE 'Threads_waiting_for_disk_space'.
How does MemSQL shard tables?
Every distributed table (except reference tables, which are replicated in whole on each “leaf” node) has a
SHARD KEY that specifies which columns of a row to hash to determine what partition a row should reside in. When rows are inserted into a sharded table, they are hashed by the table’s shard key and sent to the leaf carrying the corresponding partition. This technique is commonly referred to as hash-based partitioning. You can choose how to shard each table by specifying its
SHARD KEY as part of the
CREATE TABLE statement. See Distributed SQL for more details.
What are aggregator and leaf nodes?
MemSQL stores and computes data on leaf nodes. You can linearly scale both storage and computational power by adding more leaf nodes. Clients query an aggregator node, which in turn queries one or more leaf nodes to collect the rows required to execute the query. Multiple aggregators nodes perform the same functions with respect to executing Data Manipulation Language (DML) queries and allow clients to load-balance queries across the aggregators. Leaf nodes should not be queried directly except for maintenance purposes in exceptional situations.
What is a “master aggregator”?
The Master Aggregator is an aggregator responsible for executing DDL and clustering operations (e.g.
ADD LEAF ... or
What happens if the master aggregator crashes?
If the Master Aggregator becomes unresponsive, clients can continue to execute DML queries (e.g.
SELECT) against the other aggregators, but DDL and clustering operations can not be performed until the master aggregator is revived or another aggregator is “promoted” to be the master aggregator.
How do I add nodes to MemSQL?
From within MemSQL Ops, select the “Add Host” button in the top right to register a physical or virtual machine with MemSQL Ops. Then, provision it as an aggregator or leaf by selecting the adjacent button “Add Node”. To increase the capacity of the cluster, provision new leaf nodes. To increase the data loading bandwidth or load balancing bandwidth to the cluster, provision new aggregator nodes.
If you are not using MemSQL Ops, you can use the clustering commands detailed in the Cluster Administration section.
How many aggregator and leaf nodes do I need?
MemSQL stores data in leaf nodes, so you need enough leaf nodes to store all your data in memory. If you are replicating data (redundancy level 2), you need twice as many leaf nodes.
The recommended number of aggregators depends on your use case. If, for instance, your cluster is being used for more than one type of workload (for example, it is the backend for a web application and also being queried by analysts), it is probably best to have multiple aggregators, or pools of aggregators, for these separate workloads. Aside from distribution of workload, the most significant factor to consider is network bandwidth. As a rule of thumb, clusters with 50 nodes or fewer should have about a 5:1 leaf to aggregator ratio. Clusters with more than 50 nodes can have closer to a 10:1 leaf to aggregator ratio. Note that you can also add nodes to a cluster to tune performance after it is up and running.
The appropriate ratio of aggregators to leaves also depends on the type of workload running. Transactional workloads that run many small queries or queries that involve only a single partition require more aggregators, since those queries interact with one aggregator and one leaf. Analytical workloads,especially those involving distributed joins, require fewer aggregators because almost all the work is performed on the leaves.
Can I query the leaf nodes individually?
Yes, but this is not recommended. It only should be done in troubleshooting scenarios.
After I ran a single host install, why are there 2 MemSQL nodes (or 4 MemSQL processes) running?
This is the “single host cluster” setup. Your machine is running both a master aggregator process and a leaf process, and hence “2 MemSQL Nodes” is shown (each MemSQL instance or node contains two processes). Make sure to send your queries to the aggregator.
Can I JOIN multiple sharded tables in a query?
Yes. MemSQL supports advanced join capabilities and will automatically redistribute data as necessary to complete a query. MemSQL can also take advantage of collocated data across shard keys and reference tables to reduce data movement. See Distributed Joins.
Can I optimize a distributed join involving a small, static table?
Yes, a small table which does not change frequently can be made into a reference table, which is replicated to all the leaf nodes. This ensures that the table does not need to be moved when joined against, at the cost of using more memory. See Distributed SQL.
Why do I get errors about UNIQUE KEYs?
MemSQL does not support unique keys unless the rows in the unique key are a superset of the rows in the shard key. For more information about the shard key, see Distributed SQL.
Import and Backup
How can I backup a MemSQL database?
MemSQL supports consistent, online, cluster-wide
RESTORE operations that do not require blocking write operations on the database like
mysqldump does. See Backing up and Restoring MemSQL section for more information.
How can I import data from MySQL, Postgres, MS-SQL etc?
How can I easily copy a table?
You can use
CREATE TABLE dest AS SELECT * FROM source. See CREATE TABLE.
Or, create a new empty table using the schema of the original table from
SHOW CREATE TABLE source and copy data from source table into the new table using
INSERT INTO dest SELECT * FROM source.
How can I easily copy a database?
There are two options:
- Use replication. Run
REPLICATE DATABASE dest_db FROM user@host:port/src_db, and after it fully synchronizes run
STOP REPLICATING dest_db. See Replication .
- BACKUP the database and RESTORE it on the same (or a different) cluster under a different name.
Where are the important data files (recovery log, binary logs, snapshots, data files etc)?
SHOW STATUS EXTENDED LIKE ‘%_directory’ to get the full paths.
How are MemSQL and Apache Spark related?
MemSQL and Apache Spark are both distributed, in-memory technologies. MemSQL is a SQL database, while Spark is a general computation framework. MemSQL has tight integration with Apache Spark through its integrated MemSQL Spark solution offerings. Specifically, within MemSQL Ops, users can deploy a Spark cluster colocated with a MemSQL cluster, and leverage MemSQL and Spark functionality together. For instance, with MemSQL and Spark clusters deployed, users can use MemSQL Streamliner to quickly configure, run, and manage real-time data pipelines - extracting data from real-time sources such as Kafka, transforming the data within Spark, and finally loading the data into MemSQL.
What are the differences between MemSQL and Spark SQL?
- Spark SQL treats datasets (RDDs) as immutable - there is currently no concept of an INSERT, UPDATE, or DELETE. You could express these concepts as a transformation, but this operation returns a new RDD rather than updating the dataset in place. In contrast, MemSQL is an operational database with full transactional semantics.
- MemSQL supports updatable relational database indexes. The closest analogue in Spark is IndexRDD, which is currently under development, and provides updateable key/value indexes.
How do MemSQL and Spark software interact with each other?
Manually through the MemSQL Spark Connector: The MemSQL Spark Connector is an open source library that can be added as a dependency for any Spark application. Under the hood, it creates a mapping between MemSQL database partitions and Spark RDD partitions. It takes advantage of both systems’ distributed architectures to load data in parallel. The connector includes the
MemSQLRDDclass - allowing the user to create a Spark RDD from the result of a SQL query in MemSQL, and the
saveToMemSQLfunction which makes it easy to write data into MemSQL after processing in Spark.
Through MemSQL Ops: MemSQL Ops can deploy a Spark cluster and link it to MemSQL by leveraging the MemSQL Spark Connector and the MemSQL Spark Interface under the hood. The MemSQL Spark Interface is a Spark application that serves as the interface for MemSQL Ops to create and manage real-time data pipelines within Spark.
Does MemSQL Streamliner support “exactly once” semantics when consuming data from Kafka?
Yes. Streamliner has very precise “at least once” semantics when consuming data from Kafka. MemSQL leverages its transactional nature to provide stronger semantics than what Kafka and Spark can offer out of the box, with much higher performance and precision (i.e. the “at least once” window of repeated values is minimal). That said, if a user really wants “exactly-once” semantics, they can achieve it by loading into the the row store with a unique key.
SQL push down
What happens if SQL push down fails?
The MemSQL Connector takes a best effort approach towards query push down. While Spark is preparing the query for execution, the MemSQL push down strategy attempts to push down every subtree starting with the entire query. If anything fails, we simply leave the tree as is and Spark handles executing the unsupported section of the tree.
How can I check to see if a query is pushed down?
Every DataFrame has a method called
.explain which will print the final plan before execution. If the first element in that plan is a
MemSQLPhysicalRDD then the DataFrame has been fully pushed down.
What SQL push downs are not supported?
We are constantly improving push down, so the best thing to do is just try your query and then use
.explain to check to see if it got pushed down. If you find a query which is not pushed down, please raise an Github issue on the Connector repo.
Does MemSQL support compression?
Yes, column store tables are compressed by default.
Does MemSQL perform random IO?
No, only sequential IO is used by MemSQL.
What are the index types MemSQL supports?
The in-memory row store supports skip lists, hashtables and geospatial indexes. The on-disk column store supports ordered columnstore indexes.
What is ERROR 2002, which appears when connecting to MemSQL with the stock MySQL client using default arguments?
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)
When the MySQL client connects to
localhost, it attempts to use a socket file instead of TCP/IP. The socket file used is specified in /etc/mysql/my.cnf when the MySQL client is installed on the system. This is a MySQL socket file, which MemSQL does not use by default. Therefore, connecting with
localhost attempts to connect to MySQL and not MemSQL.
There are two solutions:
Specify “127.0.0.1” as the host instead of localhost, i.e.
mysql -h 127.0.0.1 -u rootinstead of
mysql -h localhost -u root. Note that if you omit the host (
mysql -u root) the MySQL client will implicitly use
/etc/mysql/my.cnfyou should see this near the top of the file:
[client] port = 3306 socket = /var/run/mysqld/mysqld.sock
Change socket to the location of your MemSQL socket file. By default, this is
My hosts have less than minimum MemSQL system requirements. How can I make MemSQL run on those hosts?
MemSQL is designed to run on machines with at least 4 CPU cores and 8 GB RAM. MemSQL can run on machines with lower resources, but users who choose to run MemSQL this way should understand that this could lead to a less ideal experience.
The only users who should deploy MemSQL this way are those who are:
- Running MemSQL on a development environment with low system resources, fully understanding that their test and production systems should have at least 4 CPU cores and 8 GB RAM to function optimally.
- Configuring MemSQL on aggregator nodes after having run performance benchmarks on their MemSQL cluster and determining it is acceptable for those machines to run with low resources during peak production workloads.
To run MemSQL on systems below the minimum system requirements, do the following:
- Install MemSQL Ops using the
--ignore-min-requirementsflag when running the
- Go through the standard process of installing MemSQL using MemSQL Ops.
Add the following lines to the
memsql.cnffile for each specific MemSQL node you would like to lower the requirements for:
minimum_core_count = <NUM_CORES> minimum_memory_mb = <NUM_MB>
Restart the MemSQL nodes for which you have modified the
memsql.cnffile. You can do so using the