Data Skew

An important property of a distributed database is that the data gets distributed more or less evenly. In rare cases the data may be “skewed” out of balance. This topic discusses how skew can happen, how to detect it, and what to do about it.

Sharding and Skew

“Skew” is a condition in which a table’s data is unevenly balanced among partitions in the cluster. Probability being what it is, small amounts of skew are inevitable and harmless. By default, MemSQL distributes data based on a hash of the primary key. Since the hash values are unique and evenly distributed, this keeps skew to a minimum.

You can tell MemSQL to shard a table by a set of columns instead of the primary key. For example, consider a table of URLs from web domains. It might be defined as follows:

CREATE TABLE urls (
    id BIGINT,
    domain_id BIGINT,
    path VARCHAR(8192),
    first_seen INT UNSIGNED NOT NULL,
    crawl count INT UNSIGNED NOT NULL,
    ...
    SHARD KEY (domain_id)
);

As you can see, we are sharding on domain_id. Queries made against this table will compile and run just fine. However, it’s probable that some domains will have many more times the number of URLs. For example, the partition that contains links for “youtube.com” will almost certainly have more rows than average, throwing off the balance of the cluster. That unlucky partition containing “youtube.com” will not only have to store more data, it will probably be asked to do more work during SELECT, UPDATE, and DELETE queries. For best performance, you should choose a shard key that minimizes data skew.

Measuring Skew

Measuring the skew of a table is straightforward in MemSQL. The built-in INFORMATION_SCHEMA database contains metadata about tables, columns, indexes, and partitions. In particular, the TABLE_STATISTICS table contains one row per table partition, and can be used to see if a given partition has a lot more rows than average:

SELECT
    DATABASE_NAME,
    TABLE_NAME,
    ORDINAL AS PARTITION_ID,
    ROWS,
    MEMORY_USE
FROM INFORMATION_SCHEMA.TABLE_STATISTICS
WHERE TABLE_NAME = 'domains';

+---------------+------------+--------------+-------+------------+
| DATABASE_NAME | TABLE_NAME | PARTITION_ID | ROWS  | MEMORY_USE |
+---------------+------------+--------------+-------+------------+
| web           | domains    | 0            | 85916 |   11527936 |
| web           | domains    | 1            | 85925 |   11528512 |
| web           | domains    | 2            | 85427 |   11627712 |
| web           | domains    | 3            | 86185 |   11545152 |
| web           | domains    | 4            | 85592 |   11376128 |
| web           | domains    | 5            | 85549 |   11504448 |
...

Just looking over the raw values, it’s plain there is only a tiny amount of variation in the domains table, as expected. You can use this raw TABLE_STATISTICS data to compute the relative standard deviation (RSD) across all partitions and tables in your database. The RSD is a decent way to measure “skewness”. Unlike a simple STDDEV(), it allows you to compare the amount of skew between tables.

SELECT
    DATABASE_NAME,
    TABLE_NAME,
    FLOOR(AVG(ROWS)) AS avg_rows,
    ROUND(STDDEV(ROWS)/AVG(ROWS),3) * 100 AS row_skew,
    FLOOR(AVG(MEMORY_USE)) AS avg_memory,
    ROUND(STDDEV(MEMORY_USE)/AVG(MEMORY_USE),3) * 100 AS memory_skew
FROM INFORMATION_SCHEMA.TABLE_STATISTICS
GROUP BY 1, 2
HAVING SUM(ROWS) > 10000
ORDER BY row_skew DESC;

+---------------+----------------------+----------+----------+--------------+-------------+
| DATABASE_NAME | TABLE_NAME           | avg_rows | row_skew | avg_memory   | memory_skew |
+---------------+----------------------+----------+----------+--------------+-------------+
| web           | urls                 | 28881401 |    9.520 |  25733328291 |       3.200 |
| web           | domains              |   902817 |    0.100 |    111046491 |       0.200 |
...

A skew score of N means that there is about N% variation in the number of rows (or memory allocated) in a partition. As a general rule of thumb, skew less than 10% should not greatly affect system stability or performance, and is not a cause for worry. If you do see persistent skew and observe other issues as well, such as a server running particularly “hot”, then you should further investigate the distribution of the values in your chosen shard keys.

Detecting Skew

While bulk-loading data, recovering from a restart, or after deleting lots of data, it’s possible that one partition may temporarily show more memory use than others. Memory allocation and garbage collection are not instant across the machines in a cluster. During that period, the query above may produce a large memory_skew number. This is normal, and corrects itself over time as the system reaches a new steady state.

Was this article useful?