Product Solutions Support
Try MemSQL

MemSQL FAQ

This page addresses some frequently asked questions about MemSQL.

General

Is MemSQL a storage engine for MySQL?

No. MemSQL is a standalone database that is compatible with the MySQL client. MemSQL includes its own storage engine and SQL-based execution engine built around lock-free data structures and machine code generation.

Is MemSQL a row-based or column-based store?

MemSQL provides both in-memory row-based and on-disk column-based stores.

The in-memory row store:

The on-disk column store:

How does MemSQL’s in-memory lock-free storage engine work?

MemSQL’s storage engine uses multi-version concurrency control with lock-free skip lists and lock-free hash tables which allow highly concurrent reads and writes at very high throughput. Reads in MemSQL are never blocked, but updates to the same row can conflict with logical locks.

What is the advantage of MemSQL over traditional databases like Oracle, SQL Server or MySQL with data in a “ramdisk” or large buffer pools?

Two common techniques for leveraging large amounts of memory in traditional databases is storing data files on a “ramdisk” or running a disk-based storage engine with a large buffer pool. For example, MySQL performs much better with InnoDB configured to use a large buffer pool.

While running an existing storage engine like InnoDB in memory can alleviate some of the bottlenecks involved with disk, MemSQL has four distinguishing memory-optimized features that enable it to perform significantly better than disk-based storage engines running in memory:

What is the advantage of MemSQL over other distributed databases?

What is MemSQL not for?

MemSQL excels at real-time and high throughput query use cases. It is a great general purpose database for running both transactional and analytic workloads. However, there are use cases which MemSQL is not designed to run. Some of these are listed below:

Query Compilation

Why does CREATE TABLE take so long in MemSQL?

In order to speed up the compilation of queries in MemSQL, CREATE TABLE will precompile code used to access and update table indexes. This is a one-time operation for each unique table schema, and compiled table code will be cached on disk for future uses.

Why do MemSQL queries typically run faster the second time they are executed?

Traditional relational database management systems interpret SQL queries the same way interpreters for languages like Python and Ruby run programs. The first time a MemSQL server encounters a given query shape, it will optimize and compile the query for future invocations. This incurs overhead which does not depend on the amount of data to be processed, but rather the complexity of the query. The process of code generation involves extracting parameters from the query then transforming the normalized query into a MemSQL-specific intermediate representation tailored to the system. Subsequent requests with the same shape can reuse this plan to complete both quickly and consistently. Starting with MemSQL 5, MemSQL embeds an industrial compiler (LLVM) for code generation, leading to fast query performance for even the first time queries are run.

How much can you change a query before it needs to be recompiled?

If you only change an integer or string constant in a query, it will not require recompilation.

MemSQL strips out numeric and string parameters from a query and attaches the resulting string to a compiled plan. This string is referred to as a parameterized query. For example, SELECT * FROM foo WHERE id=22 AND name='bar' is parameterized to SELECT * FROM foo WHERE id=@ AND name=^.

You can list the distinct parameterized queries corresponding to all executed queries by running SHOW PLANCACHE.

The one exception to this rule is constants in the projection clause without an alias. These constants are compiled directly into the plan’s assembly code for performance reasons. For example, SELECT id + 1, id + 2 AS z FROM foo is converted to SELECT id + 1, id + @ AS z FROM foo.

Durability

What is the durability guaranteed by MemSQL?

MemSQL provides several options which control tradeoffs between performance and durability (see Durability and Recovery). In its most durable state, MemSQL will not lose any transactions which have been acknowledged. Many users, however, find it useful to risk a bounded amount of data loss to for greatly improved latencies.

Can I configure MemSQL to be fully durable?

Yes. You can get full durability at the cost of increased query latency by setting transaction_buffer=0 in memsql.cnf.

Does being in-memory mean that MemSQL will lose all data upon system failure or restart?

No. Unlike traditional relational database management systems, MemSQL uses RAM as the primary storage for data. However, MemSQL continuously backs up data to disk with transaction logs and periodic snapshots. These features can be tuned all the way from synchronous durability (every write transaction is recorded on disk before the transaction completes) to purely in-memory durability (maximum sustained throughput on writes).

On restart, MemSQL uses the snapshot and log files to recover its state to what it was before shutting down. Because the recovery process is parallelized across CPUs, the bottleneck in this process is the sequential hard drive speed.

See Using Durability and Recovery for more information.

If MemSQL writes data to disk, how can it be faster than disk-based databases?

Traditional relational database management systems use disk as the primary storage for data and memory as a cache. Managing this caching layer adds bookkeeping overhead and contention thus reducing throughput and concurrency. These constraints result in random read and write I/O, which puts significant pressure on both rotational and solid state disks.

On the other hand, MemSQL stores data primarily in memory and backs it up to disk in a compact format. As a result, MemSQL uses only sequential I/O and the transaction log size is significantly smaller. This I/O pattern is optimized for both rotational and solid state disks. Furthermore, reads in MemSQL can use memory-optimized lock-free skip lists and hash tables that cannot be managed in a buffer pool.

What isolation levels does MemSQL provide?

MemSQL provides the “READ COMMITTED” isolation level. This guarantees that no transaction will read any uncommitted data from another transaction. Furthermore, once a change is observed in one transaction, it will be visible to all future transactions.

Unlike the “REPEATABLE READ” or “SNAPSHOT” isolation level, “READ COMMITTED” isolation level does not guarantee that a row will remain the same for every read query in a given transaction. Applications that use MemSQL should take this into account.

Even though regular transactions use “READ COMMITTED” isolation level, backups created using BACKUP command use “SNAPSHOT” isolation level.

Deployment and Management

How can I get a copy of MemSQL?

Visit the download page to obtain a download link for MemSQL. We offer:

Does MemSQL run in the cloud?

Yes. The easiest way to get up and running with MemSQL is the MemSQL CloudFormation template for Amazon. Once you sign up for a license key, visit Quick Start with Amazon Webservices or Quick Start with Microsoft Azure for more instructions.

Does MemSQL run on Windows?

Yes, the MemSQL Quick Start “cluster-in-a-box” packaged via Docker can run on Windows. A MemSQL “cluster-in-a-box” is a MemSQL master aggregator and leaf node collocated on a single host. This configuration is intended to provide familiarity with MemSQL but is not recommended for production. Note that Docker Toolbox is required to configure Docker on Windows environments. See the Quick Start with Docker section for more information.

MemSQL production environments run on modern 64-bit Linux versions, as described in the System Requirements section.

You can also run MemSQL on Microsoft Azure, using Azure Linux instances.

See System Requirements.

On which Linux distribution does MemSQL run best?

MemSQL is developed and tested most extensively on Ubuntu 14.04 and CentOS 6.4. See System Requirements for the full list of Linux distributions that are officially supported.

How much disk space should I allocate for MemSQL?

MemSQL uses disk for three types of storage:

Therefore, you should allocate roughly the amount of memory on your machine + space for compressed column store data + 0.1 MB for each plan. Note that the exact disk requirements will vary with the application, so it is advisable (and usually cheap) to allocate some extra disk space.

What happens if I run out of memory?

If the amount of memory used by row store tables (Alloc_table_memory from SHOW STATUS EXTENDED) is greater than the maximum_table_memory global variable (from SHOW GLOBAL VARIABLES), MemSQL will refuse to start new write queries (INSERT, UPDATE and LOAD DATA). Note that DELETE queries are not affected by this limit.

If a currently running query runs out of memory, it will rollback and notify the client of the error. See Memory Management for more information.

What happens if I run out of disk space?

If the amount of available disk space (in the <MEMSQL HOME>/data directory) is less than the minimal_disk_space global variable (from SHOW GLOBAL VARIABLES), MemSQL will refuse to start new write queries (INSERT, UPDATE and LOAD DATA). Note that DELETE queries are not affected by this limit, and the database will remain online for reads.

If a currently running write query exhausts the available disk space before making its changes durable, it will wait until more disk space becomes available before continuing. Queries may appear to “hang” when this happens. To determine how many queries and background threads are waiting for disk space run SHOW STATUS EXTENDED LIKE 'Threads_waiting_for_disk_space'.

Clustering

How does MemSQL shard tables?

Every distributed table (except reference tables, which are replicated in whole on each “leaf” node) has a SHARD KEY that specifies which columns of a row to hash to determine what partition a row should reside in. When rows are inserted into a sharded table, they are hashed by the table’s shard key and sent to the leaf carrying the corresponding partition. This technique is commonly referred to as hash-based partitioning. You can choose how to shard each table by specifying its SHARD KEY as part of the CREATE TABLE statement. See Distributed SQL for more details.

What are aggregator and leaf nodes?

MemSQL stores and computes data on leaf nodes. You can linearly scale both storage and computational power by adding more leaf nodes. Clients query an aggregator node, which in turn queries one or more leaf nodes to collect the rows required to execute the query. Multiple aggregators nodes perform the same functions with respect to executing Data Manipulation Language (DML) queries and allow clients to load-balance queries across the aggregators. Leaf nodes should not be queried directly except for maintenance purposes in exceptional situations.

What is a “master aggregator”?

The Master Aggregator is an aggregator responsible for executing DDL and clustering operations (e.g. ADD LEAF ... or CREATE TABLE...).

What happens if the master aggregator crashes?

If the Master Aggregator becomes unresponsive, clients can continue to execute DML queries (e.g. INSERT and SELECT) against the other aggregators, but DDL and clustering operations can not be performed until the master aggregator is revived or another aggregator is “promoted” to be the master aggregator.

How do I add nodes to MemSQL?

From within MemSQL Ops, select the “Add Host” button in the top right to register a physical or virtual machine with MemSQL Ops. Then, provision it as an aggregator or leaf by selecting the adjacent button “Add Node”. To increase the capacity of the cluster, provision new leaf nodes. To increase the data loading bandwidth or load balancing bandwidth to the cluster, provision new aggregator nodes.

If you are not using MemSQL Ops, you can use the clustering commands detailed in the Cluster Administration section.

How many aggregator and leaf nodes do I need?

MemSQL stores data in leaf nodes, so you need enough leaf nodes to store all your data in memory. If you are replicating data (redundancy level 2), you need twice as many leaf nodes.

The recommended number of aggregators depends on your use case. If, for instance, your cluster is being used for more than one type of workload (for example, it is the backend for a web application and also being queried by analysts), it is probably best to have multiple aggregators, or pools of aggregators, for these separate workloads. Aside from distribution of workload, the most significant factor to consider is network bandwidth. As a rule of thumb, clusters with 50 nodes or fewer should have about a 5:1 leaf to aggregator ratio. Clusters with more than 50 nodes can have closer to a 10:1 leaf to aggregator ratio. Note that you can also add nodes to a cluster to tune performance after it is up and running.

The appropriate ratio of aggregators to leaves also depends on the type of workload running. Transactional workloads that run many small queries or queries that involve only a single partition require more aggregators, since those queries interact with one aggregator and one leaf. Analytical workloads,especially those involving distributed joins, require fewer aggregators because almost all the work is performed on the leaves.

Can I query the leaf nodes individually?

Yes, but this is not recommended. It only should be done in troubleshooting scenarios.

After I ran a single host install, why are there 2 MemSQL nodes (or 4 MemSQL processes) running?

This is the “single host cluster” setup. Your machine is running both a master aggregator process and a leaf process, and hence “2 MemSQL Nodes” is shown (each MemSQL instance or node contains two processes). Make sure to send your queries to the aggregator.

Query Execution

Can I JOIN multiple sharded tables in a query?

Yes. MemSQL supports advanced join capabilities and will automatically redistribute data as necessary to complete a query. MemSQL can also take advantage of collocated data across shard keys and reference tables to reduce data movement. See Distributed Joins.

Can I optimize a distributed join involving a small, static table?

Yes, a small table which does not change frequently can be made into a reference table, which is replicated to all the leaf nodes. This ensures that the table does not need to be moved when joined against, at the cost of using more memory. See Distributed SQL.

Why do I get errors about UNIQUE KEYs?

MemSQL does not support unique keys unless the rows in the unique key are a superset of the rows in the shard key. For more information about the shard key, see Distributed SQL.

Import and Backup

How can I backup a MemSQL database?

MemSQL supports consistent, online, cluster-wide BACKUP and RESTORE operations that do not require blocking write operations on the database like mysqldump does. See Backing up and Restoring MemSQL section for more information.

How can I import data from MySQL, Postgres, MS-SQL etc?

See How To Load Data Into MemSQL

How can I easily copy a table?

You can use CREATE TABLE dest AS SELECT * FROM source. See CREATE TABLE.

Or, create a new empty table using the schema of the original table from SHOW CREATE TABLE source and copy data from source table into the new table using INSERT INTO dest SELECT * FROM source.

How can I easily copy a database?

There are two options:

Where are the important data files (recovery log, binary logs, snapshots, data files etc)?

Run SHOW STATUS EXTENDED LIKE ‘%_directory’ to get the full paths.

MemSQL and Spark

MemSQL and Apache Spark are both distributed, in-memory technologies. MemSQL is a SQL database, while Spark is a general computation framework. MemSQL has tight integration with Apache Spark through its MemSQL Spark Connector offering. For instance, with MemSQL and Spark clusters deployed, users can extract data from real-time sources such as Kafka, run the data through a Spark machine learning library model, and store the model result into MemSQL to be persisted and queryable.

What are the differences between MemSQL and Spark SQL?

How do MemSQL and Spark software interact with each other?

Manually through the MemSQL Spark Connector:

The MemSQL Spark Connector is an open source library that can be added as a dependency for any Spark application. Under the hood, it creates a mapping between MemSQL database partitions and Spark RDD partitions. It takes advantage of both systems’ distributed architectures to load data in parallel. The connector includes the MemSQLRDD class - allowing the user to create a Spark RDD from the result of a SQL query in MemSQL, and the saveToMemSQL function which makes it easy to write data into MemSQL after processing in Spark.

SQL push down

What happens if SQL push down fails?

The MemSQL Connector takes a best effort approach towards query push down. While Spark is preparing the query for execution, the MemSQL push down strategy attempts to push down every subtree starting with the entire query. If anything fails, we simply leave the tree as is and Spark handles executing the unsupported section of the tree.

How can I check to see if a query is pushed down?

Every DataFrame has a method called .explain which will print the final plan before execution. If the first element in that plan is a MemSQLPhysicalRDD then the DataFrame has been fully pushed down.

What SQL push downs are not supported?

We are constantly improving push down, so the best thing to do is just try your query and then use .explain to check to see if it got pushed down. If you find a query which is not pushed down, please raise an Github issue on the Connector repo.

Miscellaneous

Does MemSQL support compression?

Yes, column store tables are compressed by default.

Does MemSQL perform random IO?

No, only sequential IO is used by MemSQL.

What are the index types MemSQL supports?

The in-memory row store supports skip lists, hashtables and geospatial indexes. The on-disk column store supports ordered columnstore indexes.

Troubleshooting

What is ERROR 2002, which appears when connecting to MemSQL with the stock MySQL client using default arguments?

ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)

When the MySQL client connects to localhost, it attempts to use a socket file instead of TCP/IP. The socket file used is specified in /etc/mysql/my.cnf when the MySQL client is installed on the system. This is a MySQL socket file, which MemSQL does not use by default. Therefore, connecting with localhost attempts to connect to MySQL and not MemSQL.

There are two solutions:

  1. Specify “127.0.0.1” as the host instead of localhost, i.e. mysql -h 127.0.0.1 -u root instead of mysql -h localhost -u root. Note that if you omit the host (mysql -u root) the MySQL client will implicitly use localhost.

  2. In /etc/mysql/my.cnf you should see this near the top of the file:

    [client]
    port          = 3306
    socket        = /var/run/mysqld/mysqld.sock
    

    Change socket to the location of your MemSQL socket file. By default, this is /var/lib/memsql/data/memsql.sock.

My hosts have less than minimum MemSQL system requirements. How can I make MemSQL run on those hosts?

MemSQL is designed to run on machines with at least 4 CPU cores and 8 GB RAM. MemSQL can run on machines with lower resources, but users who choose to run MemSQL this way should understand that this could lead to a less ideal experience.

The only users who should deploy MemSQL this way are those who are:

  1. Running MemSQL on a development environment with low system resources, fully understanding that their test and production systems should have at least 4 CPU cores and 8 GB RAM to function optimally.
  2. Configuring MemSQL on aggregator nodes after having run performance benchmarks on their MemSQL cluster and determining it is acceptable for those machines to run with low resources during peak production workloads.

To run MemSQL on systems below the minimum system requirements, do the following:

  1. Install MemSQL Ops using the --ignore-min-requirements flag when running the install.sh script.
  2. Go through the standard process of installing MemSQL using MemSQL Ops.
  3. Add the following lines to the memsql.cnf file for each specific MemSQL node you would like to lower the requirements for:

    minimum_core_count = <NUM_CORES>
    minimum_memory_mb = <NUM_MB>
    
  4. Restart the MemSQL nodes for which you have modified the memsql.cnf file. You can do so using the memsql-ops memsql-restart command.

Was this article useful?