Product Solutions Support
Try MemSQL

ANALYZE

The ANALYZE command enables collecting value-level statistics on a table to facilitate accurate query optimization.

The ANALYZE Command

The main forms of the ANALYZE command are:

ANALYZE TABLE table_name;
ANALYZE TABLE table_name COLUMNS column_name [, ...] {ENABLE | DISABLE};

ANALYZE collects two types of statistics: column statistics on every column, and additionally range statistics on designated columns (see the Types of Statistics section). Columns are designated or un-designated with the ANALYZE TABLE table_name COLUMNS ... {ENABLE | DISABLE} commands, as described below.

Warning
  • ANALYZE may invalidate plans, i.e. cause them to be reoptimized and recompiled the next time they are run, when data statistics change and a new plan could be faster. See the Plan Invalidation section for more details.

  • Statistics are collected and updated only when you run the ANALYZE command.

  • The ANALYZE command can be expensive for large tables, especially when collecting range statistics. Because statistics are collected only when you run ANALYZE, you can choose when to run it to best manage performance.

  • ANALYZE does not block concurrent queries and can be safely used while running concurrent queries. Any query compilations will use the present statistics until the ANALYZE completes, at which point query compilations atomically switch to using the new statistics.

  • You may need to collect statistics for optimal performance. See the Missing Statistics Warnings section.

Info

Statistics are tied to a base table. Statistics are persisted and remain after restarts. Statistics are not changed when the table is modified through DML statements, including INSERT, UPDATE, DELETE, LOAD DATA, and TRUNCATE. Statistics are only collected and updated when the ANALYZE command is called, which completely overwrites any previous statistics collected on the table.

Notes

All forms of ANALYZE are subject to the following restrictions:

Forms of the ANALYZE command

ANALYZE TABLE table_name;

Collect column statistics on all columns in table_name and range statistics over previously designated columns. Invalidate all plans on this table which were compiled with significantly different statistics than the newly observed statistics.

ANALYZE TABLE table_name COLUMNS column_name [, ...] ENABLE;

Designate the specified columns to have range-statistics collected, in addition to any previously designated columns. Collect column statistics on all colums in table_name and range statistics over all designated columns, and invalidate all plans on this table which were compiled with significantly different statistics than the newly observed statistics (the same effects as ANALYZE TABLE table_name).

ANALYZE TABLE table_name COLUMNS column_name [, ...] DISABLE;

Delete range statistics on specified columns if present, and designate the specified columns to not have range statistics collected. No range or column statistics are collected.

ANALYZE TABLE table_name DROP;

Delete all stored statistics for a table if present. This includes disabling histograms on any columns for which they have been enabled.

ANALYZE statistics export and import

It is also possible to export and import statistics, which can be useful for testing purposes.

ANALYZE TABLE table_name INTO OUTFILE 'path/file.json';

Save column and range statistics recorded on table_name in a JSON representation into the file path/file.json. Relative paths are rooted in the data directory of the MemSQL instance. Note that this exports the statistics that are currently saved; it does not collect new statistics.

ANALYZE TABLE table_name INFILE 'path/file.json';

Import range and column statistics for table_name from the file path/file.json. Relative paths are rooted in the data directory of the MemSQL instance.

Configuration parameters export and import

In addition, you can export and import configuration parameters used by the optimizer.

ANALYZE OPTIMIZER_STATE INTO OUTFILE 'file.json';
ANALYZE OPTIMIZER_STATE INFILE 'file.json';

Types of Statistics

Column Statistics

Column statistics are collected on every column in a table when the table is processed by the ANALYZE command. They include the number of distinct values in each column, the number of NULLs in each column, and the number of rows in the table. When collected, column statistics can be queried through the INFORMATION_SCHEMA.OPTIMIZER_STATISTICS table:

memsql> describe INFORMATION_SCHEMA.OPTIMIZER_STATISTICS;
+----------------------+--------------+------+------+---------+-------+
| Field                | Type         | Null | Key  | Default | Extra |
+----------------------+--------------+------+------+---------+-------+
| DATABASE_NAME        | varchar(512) | NO   |      | NULL    |       |
| TABLE_NAME           | varchar(512) | NO   |      | NULL    |       |
| COLUMN_NAME          | varchar(512) | NO   |      | NULL    |       |
| ROW_COUNT            | bigint(21)   | NO   |      | NULL    |       |
| MIN_VALUE_SERIALIZED | varchar(255) | YES  |      | NULL    |       |
| MAX_VALUE_SERIALIZED | varchar(255) | YES  |      | NULL    |       |
| NULL_COUNT           | bigint(21)   | NO   |      | NULL    |       |
| CARDINALITY          | bigint(21)   | NO   |      | NULL    |       |
| DENSITY              | double       | NO   |      | NULL    |       |
| RANGE_STATS          | bigint(21)   | NO   |      | NULL    |       |
| SAMPLE_SIZE          | bigint(21)   | NO   |      | NULL    |       |
+----------------------+--------------+------+------+---------+-------+
11 rows in set (0.00 sec)

The RANGE_STATS column indicates whether range statistics are stored for this column. Some of the columns including MIN_VALUE_SERIALIZED, MAX_VALUE_SERIALIZED, and DENSITY are currently unused.

Range Statistics

The ANALYZE command can additionally collect histograms over designated columns. When histograms are enabled for a column, ANALYZE will sample the rows in that column and create an equi-depth histogram from the samples, such that the frequency of each bucket is the same, but their widths vary. Since the frequencies are equal, only the endpoints of the recorded buckets are stored. When collected, range histograms can be queried through the INFORMATION_SCHEMA.RANGE_STATISTICS table:

memsql> describe INFORMATION_SCHEMA.RANGE_STATISTICS
+-------------------+--------------+------+------+---------+-------+
| Field             | Type         | Null | Key  | Default | Extra |
+-------------------+--------------+------+------+---------+-------+
| DATABASE_NAME     | varchar(512) | NO   |      | NULL    |       |
| TABLE_NAME        | varchar(512) | NO   |      | NULL    |       |
| COLUMN_NAME       | varchar(512) | NO   |      | NULL    |       |
| BUCKET_ID         | bigint(21)   | NO   |      | NULL    |       |
| MIN_VALUE         | varchar(255) | NO   |      | NULL    |       |
| MAX_VALUE         | varchar(255) | YES  |      | NULL    |       |
| TO_LEFT           | bigint(21)   | NO   |      | NULL    |       |
| TO_RIGHT          | bigint(21)   | NO   |      | NULL    |       |
| DISTINCT_INTERIOR | bigint(21)   | NO   |      | NULL    |       |
| EXCLUSIVE_DENSITY | double       | NO   |      | NULL    |       |
+-------------------+--------------+------+------+---------+-------+
10 rows in set (0.00 sec)

The values for the endpoints are stored in an order-preserving mapping from their original type to bytestrings ordered lexicographically and shown in the table in a BASE64-encoded representation in the columns MIN_VALUE and MAX_VALUE. BUCKET_ID stores the index of the bucket in the histogram.

Missing Statistics Warnings

Identifying which columns a query could use range statistics on can be tedious and error-prone to do manually and is complicated by query transformations like predicate transitivity and predicate pushdown. When you run EXPLAIN on a query, MemSQL will show information about any missing statistics that the optimizer could take advantage of, and the ANALYZE commands that you can run to collect those statistics.

For example, for the following query, the explain shows that the optimizer is missing column statistics on both tables t and t2 as well as range statistics on t.x.

memsql> create table t (id bigint primary key, x double);
Query OK, 0 rows affected (0.17 sec)

memsql> create table t2 (id bigint primary key, x double);
Query OK, 0 rows affected (0.17 sec)

memsql> explain select * from t, t2 where t.id = t2.id and t.x > 1;
+---------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                     |
+---------------------------------------------------------------------------------------------+
| WARNING: Missing statistics on some tables, consider running the following commands:        |
|     ANALYZE TABLE db.`t2`;                                                                  |
|     ANALYZE TABLE db.`t` COLUMNS `x` ENABLE;                                                |
| See http://docs.memsql.com/v5.0/docs/analyze for more information on statistics collection. |
|                                                                                             |
| Project [t.id, t.id2, t.x, t2.id, t2.x]                                                     |
| Gather partitions:all                                                                       |
| Project [t.id, t.id2, t.x, t2.id, t2.x]                                                     |
| NestedLoopJoin                                                                              |
| |---IndexSeek db.t2, PRIMARY KEY (id) scan:[id = t.id]                                      |
| Filter [t.x > 1]                                                                            |
| TableScan db.t, PRIMARY KEY (id)                                                            |
+---------------------------------------------------------------------------------------------+
12 rows in set (0.14 sec)

Range Statistics vs. Sampling

Column statistics are used to estimate group-by and join cardinalities and are almost always beneficial to collect. Range statistics (histograms) are used to estimate filter selectivity, but the optimizer will resort to sampling when they are absent. Range statistics can make query optimization much faster and more accurate:

However, in some cases sampling can yield more accurate results:

In summary, range statistics are inaccurate in the presence of correlated predicates, whereas sampling is inaccurate in the presence of data skew across partitions.

Range statistics are more expensive to collect than column statistics and are only useful on columns that are involved in single-table equality or range predicates. In addition, the storage cost of range statistics is proportional to the size of the values in the column they’re collected over, unlike column statistics which take constant memory to store per column.

Plan Invalidation

When column statistics are collected on a table, the optimizer will record the row count associated with the statistics on the table into all plans compiled using those statistics. When ANALYZE TABLE is run on the table, all plans which were optimized with significantly different statistics than the current statistics are invalidated, i.e. they will be reoptimized and recompiled the next time they are run, because a new plan may be significantly better after a change in data statistics. Significantly different statistics is defined as having a recorded rowcount less than 50% or greater than 200% of the new rowcount. (1) Plans optimized without column statistics on a table are always invalidated after ANALYZE TABLE is run on the table (with the exception of when the table is empty).

(1): More precisely, when ANALYZE TABLE is run on the table, all in-memory plans with a recorded rowcount less than 50% or greater than 200% of the new rowcount are immediately invalidated. Plans in the persistent plancache are discarded when loaded if their recorded rowcount is less than 50% or greater than 200% of the rowcount of the table as recorded by the most recent statistics available for it.

Was this article useful?