Distributed Architecture

MemSQL distributed architecture is designed to be straightforward, simple, and fast. This section provides an overview of MemSQL clusters, including how the various components interact, and what happens in the MemSQL environment when you perform query or administrative operations.

Design Principles

MemSQL’s distributed system is designed around a few key principles:

This topic explains the high level concepts behind how MemSQL accomplishes these goals.

Cluster Components

A MemSQL cluster consists of two tiers:

Aggregators and leaves share the same MemSQL binary, so you can deploy the same build to every machine in the cluster.

The minimal setup for a MemSQL cluster is just one aggregator (the master aggregator) and one leaf. You can add more aggregators, which will read metadata from the master aggregator, and can run DML commands on the leaves.

The number of deployed aggregator and leaf nodes determines the storage size and performance of a cluster. Typical deployments have a 5:1 ratio of leaf:aggregator nodes. In a well-designed cluster:

You can list all the aggregators and leaves in cluster using the SHOW AGGREGATORS and SHOW LEAVES commands.

| Host          | Port  | State  | Opened_Connections | Average_Roundtrip_Latency | Master_Aggregator |
|     |  3306 | online |                  0 |                      NULL |                 1 |
| |  3306 | online |                  1 |                     0.200 |                 0 |
2 rows in set (0.00 sec)

memsql> SHOW LEAVES;
| Host          | Port  | Availability_Group | Pair_Host     | Pair_Port | State  | Opened_Connections | Average_Roundtrip_Latency |
| |  3306 |                  1 | |      3306 | online |                  5 |                     0.299 |
| |  3306 |                  1 | |      3306 | online |                  5 |                     0.293 |
| |  3306 |                  2 | |      3306 | online |                  1 |                     0.271 |
| |  3306 |                  2 | |      3306 | online |                  1 |                     0.264 |
4 rows in set (0.00 sec)

Cluster Communication

Except for the transfer of data for replication and heartbeats, all communication between MemSQL nodes are implemented with SQL commands.

Querying MemSQL

The first time a query is executed against MemSQL, it is compiled and cached in memory (see Code Generation).

User queries are always directed to an aggregator. Except for DDL operations and writes to reference tables, which must go through the master aggregator, any query can be run against any aggregator.

Queries that involve only reference tables are executed directly on the aggregator. The aggregator does not send these queries to the leaves because reference tables are fully replicated to every aggregator and leaf.

Queries that involve sharded tables are more involved.

Some queries involve significantly more query transformation and aggregation logic, but they follow the same general flow. Note that running EXPLAIN on a query will show the planned division of labor between aggregator and leaves, including the rewritten queries that will be sent to the leaves.

Data Distribution

MemSQL automatically shards data on distributed tables by hashing each row’s primary key (hash partitioning). Because each primary key is unique and the hash function is roughly uniform, the cluster is able to enforce a relatively even data distribution and minimize data skew.

At CREATE DATABASE time, MemSQL splits the database into a number of partitions. Each partition owns an equal portion of the hash range. You can specify the number of partitions explicitly with the PARTITIONS=X option to CREATE DATABASE. By default the total number of partitions is 8 times the number of leaves. MemSQL distributes partitions evenly among available leaves.

Each partition is implemented as a database on a leaf. When a sharded table is created, it is split according to the number of partitions of its encapsulating database. This table holds the partition’s slice of data. Secondary indexes are managed within each partition and currently MemSQL enforces that unique indexes are prefixed by the primary key. If you run a query that seeks on a secondary index, the aggregator will fan out the query across the cluster and, in parallel, each partition will employ its local secondary index.

Queries that match the shard key exactly (INSERT queries and selective UPDATE , DELETE , and SELECT queries) are routed to a single leaf. Otherwise, the aggregator sends the query across the cluster and aggregates results. You can use EXPLAIN on the aggregator to examine the generated query for the leaves and the query distribution strategy.

Availability Groups

An availability group is set of leaves which store data redundantly to ensure high availability. Each availability group contains a copy of every partition in the system - some as masters and some as slaves. Currently, MemSQL supports up to two availability groups. You can set the number of availability groups via the redundancy_level variable on the master aggregator. From this point forward, we’ll discuss the redundancy-2 case.

Each leaf in an availability group has a corresponding pair node in the other availability group. A leaf and its pair share the same set of partitions, but the masters are split evenly between them. In the event of a failure, MemSQL will automatically promote slave partitions on a leaf’s pair.

By default, the ADD LEAF command will add a leaf into the smaller of the two groups. However, if you know your cluster’s topology in advance, you can specify the group explicitly with the INTO GROUP N suffix. By grouping together machines that share resources like a network switch or power supply, you can isolate common hardware failures to a single group and dramatically improve the cluster’s uptime.

MemSQL automatically displays which availability group a leaf belongs to in the SHOW LEAVES command.

Was this article useful?