A large contributor to MemSQL’s superior query execution performance is its system of code generation. In contrast to a traditional interpreter-based execution model for databases, MemSQL embeds an industrial compiler to produce highly efficient machine code before executing queries issued against it. This type of compilation enables low level optimizations that are not possible when executing via interpretation.
The first time a MemSQL server encounters a given query shape, it will optimize and compile the query for future invocations. This incurs minimal overhead which does not depend on the amount of data to be processed, but rather the complexity of the query. The process of code generation involves extracting parameters from the query then transforming the normalized query into a MemSQL-specific intermediate representation tailored to the system. Subsequent requests with the same shape can reuse this plan to complete both quickly and consistently.
Code generation applies to all Data Manipulation Language (DML) queries. In addition, MemSQL generates code during CREATE TABLE and ALTER TABLE statements. These Data Definition Language (DDL) queries generate code in order to reduce the compilation time of subsequent DML queries against the table.
To observe the effects of code generation, one could run the following queries against MemSQL, for example:
memsql> SELECT * FROM t WHERE col = 1; Empty set (0.13 sec) memsql> SELECT * FROM t WHERE col = 1; Empty set (0.00 sec) memsql> SELECT * FROM t WHERE col = 100000; Empty set (0.00 sec)
The first query took a slightly longer time to run even though it returned no rows. The second and third queries executed quickly because they reuse the plan compiled for the first query. An important thing to note is that even though the constant in the
WHERE clause changed in the third execution of the query, its shape was identical so the same plan was selected.
After code generation, compiled queries are cached for later use on disk in the
@@global.plancachedir directory. To see the DML queries for which MemSQL has cached, run SHOW PLANCACHE or access the
PLANCACHE table in the Information Schema database. These will show information for each plan, including execution counts and statistics.
Codegen Architecture in MemSQL 5
Code generation produces a specialized programming language, the MemSQL Plan Language. MPL is simple, tightly managed and designed specifically for MemSQL rather than for general purpose applications. SQL operator trees are converted directly to MPL abstract syntax trees, eliminating the need for expensive parsing and semantic resolution at query compilation time.
In addition to the simplicity of the language, MPL benefits from exposing as primitives many relational operations such as “index seek” and scalar operations such as “
BIGINT addition”. These operators are compiled statically along with the server and can easily by combined to form a query plan. In this way, much of the work of compilation can be performed ahead of time.
When compiling a query, MemSQL first flattens MPL into a compact format known as MemSQL Bytecode (MBC). Plans in this form can easily be serialized, interpreted or transformed to LLVM bitcode for compilation to
x86_64 machine code. Interpreting the bytecode directly nearly eliminates the first-time cost of running a query, but can hamper execution performance:
memsql> select * from t where j > 0; Empty set (0.05 sec) memsql> select * from t where j > 0; Empty set (0.05 sec)
In this example, there is no additional latency on the first request because the query is interpreted.
memsql> select * from t where j > 0; -- With code generation Empty set (0.08 sec) memsql> select * from t where j > 0; -- Using the cached plan Empty set (0.02 sec)
Here, the additional latency on the first request is the result of code generation. However, the subsequent request is more than twice as fast as the interpreted execution above. This goes to show the power of MemSQL code generation.