Highly Selective Joins
Highly Selective Joins
Starting in version 7.0, MemSQL introduced support for columnstore hash indexes to broaden the support for OLTP-type queries. However, a common join pattern in OLTP is to have a very selective filter on one table, which produces a few rows from the source table, and then join those rows with another table. Databases for OLTP normally use a nested loop join for this. For each row from the outer table, an index seek is done on the inner table.
MemSQL supports these highly selective joins using an adaptive hash join algorithm. As a result, there’s no need to run a full columnstore scan when there’s a matching hash index. First, it builds a hash table for the table with the highly-selective filter. Then, depending on the number of rows in the hash table, the adaptive algorithm will switch strategies internally.
- If there are only a few rows in the hash table, it switches to use a nested loop join strategy, seeking into the larger table (on the probe side) via the index on the join column of the table on the probe side.
- If the hash build side produces a lot of rows, then it performs a normal hash join.
The following example takes advantage of this strategy for selective joins.
CREATE TABLE orders( oid INT, d DATETIME, KEY(d) USING CLUSTERED COLUMNSTORE, SHARD(oid), KEY(oid) USING hash); CREATE TABLE lineitems( id INT, oid INT, item INT, KEY(oid) USING CLUSTERED COLUMNSTORE, SHARD(oid) USING HASH);
Now, add some sample data to
INSERT INTO orders VALUES(1, NOW());
Execute the following command repeatedly, until the table has around 33.5 million rows in it.
INSERT INTO orders SELECT oid+(SELECT MAX(oid) FROM orders), NOW() FROM orders;
Add 67.1 million rows of data to the
lineitems table, such that each line item belongs to an order, and each order has exactly two line items.
INSERT INTO lineitems SELECT oid, oid, 1 FROM orders; INSERT INTO lineitems SELECT oid + 1000*1000*1000, oid, 2 FROM orders;
Find a selective
DATETIME value for
d to search on:
SELECT d, COUNT(*) FROM orders GROUP BY d;
The result shows that a number of
DATETIME values only appear in one row in
orders. For this example, let’s say that one such value is
The following query uses this date to produce a join result with exactly two rows:
SELECT * FROM orders o JOIN lineitems l ON o.oid = l.oid WHERE o.d = "2020-03-30 16:47:05";
This query filters rows on
o.d to find a single row of orders, then joins to
lineitems via the hash index on
lineitems.oid, using the new selective join algorithm.
In the JSON profile plan, you can see if a plan may be able to perform a join via a hash index. If this optimization strategy is available in your plan, you will see
join index in descriptions of columnstore filter operators. For example:
"executor":"ColumnStoreFilter", "keyId":4294968023, "condition":[ "o.oid = l.oid bloom AND l.oid = o.oid join index" ],
In the MemSQL Studio, you may see a condition that mentions
join index in the
ColumnStoreFilter operator in the properties pane:
o.oid = l.oid bloom AND l.oid = o.oid join index
join index filter can also be viewed in the
See MemSQL SingleStore – And Then There Was One for additional information about (1) using columnstores with hash indexes, sub-segment access, and fine-grain locking to enable OLTP operations on data bigger than will fit in RAM, and (2) using SPARSE rowstore compression to reduce TCO for rowstore tables with many NULL values.