Detecting and Resolving Data Skew min read


An important property of a distributed database is that the data gets distributed more or less evenly. In rare cases the data may be “skewed” out of balance. This topic discusses how skew can happen, how to detect it, and how to resolve it.

“Skew” is a condition in which a table’s data is unevenly balanced among partitions in the cluster. Probability being what it is, small amounts of skew are inevitable and harmless. By default, MemSQL distributes data based on a hash of the primary key. Since the hash values are unique and evenly distributed, this keeps skew to a minimum.

You can tell MemSQL to shard a table by a set of columns instead of the primary key. For example, consider a table of URLs from web domains. It might be defined as follows:

CREATE TABLE urls (
    id BIGINT,
    domain_id BIGINT,
    path VARCHAR(8192),
    first_seen INT UNSIGNED NOT NULL,
    crawl count INT UNSIGNED NOT NULL,
    ...
    SHARD KEY (domain_id)
);

As you can see, we are sharding on domain_id. Queries made against this table will compile and run just fine. However, it’s probable that some domains will have many more times the number of URLs. For example, the partition that contains links for “youtube.com” will almost certainly have more rows than average, throwing off the balance of the cluster. That unlucky partition containing “youtube.com” will not only have to store more data, it will probably be asked to do more work during SELECT, UPDATE, and DELETE queries. For best performance, you should choose a shard key that minimizes data skew.

Measuring and Detecting Skew

Info

While bulk-loading data, recovering from a restart, or after deleting lots of data, it’s possible that one partition may temporarily show more memory use and memory skew than others, as memory allocation and garbage collection operations may not take effect immediately across partitions. This is normal, and corrects itself over time as the system reaches a new steady state.

You can use the following methods to measure and detect data skew.

Method 1: Inspect memory settings

With any version of MemSQL, you can manually compare maximum_table_memory and Alloc_table_memory (memory used by tables) on each node:

SHOW VARIABLES LIKE "maximum_%";
****
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| maximum_memory       | 48291 |
| maximum_table_memory | 43462 |
+----------------------+-------+
SHOW STATUS EXTENDED LIKE "Alloc_table_memory";
****
+--------------------+----------+
| Variable_name      | Value    |
+--------------------+----------+
| Alloc_table_memory | 43462 MB |
+--------------------+----------+
1 row in set (0.01 sec)

If memory consumption is significantly higher on specific nodes, skew is present. If memory consumption is fairly uniform across leaves, no skew is present.

Method 2: Find the number of rows and memory use per partition

The built-in INFORMATION_SCHEMA database contains metadata about tables, columns, indexes, and partitions. In particular, the TABLE_STATISTICS table contains one row per table partition, and can be used to see if a given partition has a lot more rows than average:

SELECT
    DATABASE_NAME,
    TABLE_NAME,
    ORDINAL AS PARTITION_ID,
    ROWS,
    MEMORY_USE
FROM INFORMATION_SCHEMA.TABLE_STATISTICS
WHERE TABLE_NAME = 'domains';
****
+---------------+------------+--------------+-------+------------+
| DATABASE_NAME | TABLE_NAME | PARTITION_ID | ROWS  | MEMORY_USE |
+---------------+------------+--------------+-------+------------+
| web           | domains    | 0            | 85916 |   11527936 |
| web           | domains    | 1            | 85925 |   11528512 |
| web           | domains    | 2            | 85427 |   11627712 |
| web           | domains    | 3            | 86185 |   11545152 |
| web           | domains    | 4            | 85592 |   11376128 |
| web           | domains    | 5            | 85549 |   11504448 |
...

Method 3: Calculate the memory skew for all tables, per database

Use the relative standard deviation (RSD) to calculate the memory skew across all partitions and tables in your database. The RSD is a decent way to measure “skewness”.

SELECT
    DATABASE_NAME,
    TABLE_NAME,
    FLOOR(AVG(ROWS)) AS avg_rows,
    ROUND(STDDEV(ROWS)/AVG(ROWS),3) * 100 AS row_skew,
    FLOOR(AVG(MEMORY_USE)) AS avg_memory,
    ROUND(STDDEV(MEMORY_USE)/AVG(MEMORY_USE),3) * 100 AS memory_skew
FROM INFORMATION_SCHEMA.TABLE_STATISTICS
GROUP BY 1, 2
HAVING SUM(ROWS) > 10000
ORDER BY row_skew DESC;
****
+---------------+----------------------+----------+----------+--------------+-------------+
| DATABASE_NAME | TABLE_NAME           | avg_rows | row_skew | avg_memory   | memory_skew |
+---------------+----------------------+----------+----------+--------------+-------------+
| web           | urls                 | 28881401 |    9.520 |  25733328291 |       3.200 |
| web           | domains              |   902817 |    0.100 |    111046491 |       0.200 |
...

A skew score of N means that there is about N% variation in the number of rows (or memory allocated) in a partition. As a general rule of thumb, skew less than 10% should not greatly affect system stability or performance, and is not a cause for worry. If you do see persistent skew and observe other issues as well, such as a server running particularly “hot”, then you should further investigate the distribution of the values in your chosen shard keys.

Method 4: Calculate the skew per partition for the columns in a table

Use the relative standard deviation (RSD) to calculate the memory skew across all partitions and tables in your database. The RSD is a decent way to measure “skewness”.

If the skew factor is high, that indicates that some groups have very high cardinality and some have very low cardinality and can lead to long single-threaded operations.

SELECT   ROUND(STDDEV(c)/AVG(c),3)*100 AS group_skew,
            PARTITION_ID()
FROM     (
                    SELECT  col1,
                            col2,
                            ...,
                            coln,
                            count(*) c
                    FROM    <table_name>
                    GROUP BY 1, 2) sub
GROUP BY PARTITION_ID();

Method 5: Show the data distribution for a table if you were to reshard on another column or set of columns

SELECT   WITH(leaf_pushdown=true) SUM(c) rows,
        PARTITION_ID()                  partition_id
FROM     (
                    SELECT   count(*) c
                    FROM     <table_name>
                    GROUP BY <potential_shard_key>) reshuffle
GROUP BY PARTITION_ID();

Resolving Data Skew

MemSQL distributes data based on the SHARD KEY specified in a table’s schema. If no SHARD KEY is explicitly specified, the PRIMARY KEY is used as the default SHARD KEY. If the SHARD KEY shards on a low cardinality column, data will accumulate on certain nodes. To resolve this, you should export your data, modify your schema, and then reload your data. To export your data, see Exporting Data From MemSQL. After you backup your data, you can DROP TABLE low_cardinality_shard_key_table, and then CREATE TABLE with a higher cardinality SHARD KEY. When you reload the data, MemSQL will automatically shard your data based on the new SHARD KEY.

Example:

$ mysqldump -h 127.0.0.1 -u root my_db low_cardinality_shard_key_table --no-create-info > databackup.sql

memsql> DROP TABLE low_cardinality_shard_key_table;

memsql> source updated_shard_key_schema.sql;