Discover effective optimization strategies for MariaDB Server queries. This section provides a variety of techniques and approaches to enhance query performance and overall database efficiency.
DuplicateWeedout
is an execution strategy for Semi-join subqueries.
The idea is to run the semi-join (a query with uses WHERE X IN (SELECT Y FROM ...)
) as if it were a regular inner join, and then eliminate the duplicate record combinations using a temporary table.
Suppose, you have a query where you're looking for countries which have more than 33% percent of their population in one big city:
SELECT *
FROM Country
WHERE
Country.code IN (SELECT City.Country
FROM City
WHERE
City.Population > 0.33 * Country.Population AND
City.Population > 1*1000*1000);
First, we run a regular inner join between the City
and Country
tables:
The Inner join produces duplicates. We have Germany three times, because it has three big cities.
Now, lets put DuplicateWeedout
into the picture:
Here one can see that a temporary table with a primary key was used to avoid producing multiple records with 'Germany'.
The Start temporary
and End temporary
from the last diagram are shown in the EXPLAIN
output:
EXPLAIN SELECT * FROM Country WHERE Country.code IN
(select City.Country from City where City.Population > 0.33 * Country.Population
AND City.Population > 1*1000*1000)\G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
TABLE: City
type: RANGE
possible_keys: Population,Country
KEY: Population
key_len: 4
ref: NULL
ROWS: 238
Extra: USING INDEX CONDITION; Start temporary
*************************** 2. row ***************************
id: 1
select_type: PRIMARY
TABLE: Country
type: eq_ref
possible_keys: PRIMARY
KEY: PRIMARY
key_len: 3
ref: world.City.Country
ROWS: 1
Extra: USING WHERE; End temporary
2 rows in set (0.00 sec)
This query will read 238 rows from the City
table, and for each of them will make a primary key lookup in the Country
table, which gives another 238 rows. This gives a total of 476 rows, and you need to add 238 lookups in the temporary table (which are typically much cheaper since the temporary table is in-memory).
If we run the same query with semi-join optimizations disabled, we'll get:
EXPLAIN SELECT * FROM Country WHERE Country.code IN
(select City.Country from City where City.Population > 0.33 * Country.Population
AND City.Population > 1*1000*1000)\G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
TABLE: Country
type: ALL
possible_keys: NULL
KEY: NULL
key_len: NULL
ref: NULL
ROWS: 239
Extra: USING WHERE
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
TABLE: City
type: index_subquery
possible_keys: Population,Country
KEY: Country
key_len: 3
ref: func
ROWS: 18
Extra: USING WHERE
2 rows in set (0.00 sec)
This plan will read (239 + 239*18) = 4541
rows, which is much slower.
DuplicateWeedout
is shown as "Start temporary/End temporary" in EXPLAIN
.
The strategy can handle correlated subqueries.
But it cannot be applied if the subquery has meaningful GROUP BY
and/or aggregate functions.
DuplicateWeedout
allows the optimizer to freely mix a subquery's tables and the parent select's tables.
There is no separate @@optimizer_switch flag for DuplicateWeedout
. The strategy can be disabled by switching off all semi-join optimizations with SET @@optimizer_switch='optimizer_semijoin=off'
command.
This page is licensed: CC BY-SA / Gnu FDL
FirstMatch
is an execution strategy for Semi-join subqueries.
It is very similar to how IN/EXISTS
subqueries were executed in MySQL 5.x.
Let's take the usual example of a search for countries with big cities:
SELECT * FROM Country
WHERE Country.code IN (SELECT City.Country
FROM City
WHERE City.Population > 1*1000*1000)
AND Country.continent='Europe'
Suppose, our execution plan is to find countries in Europe, and then, for each found country, check if it has any big cities. Regular inner join execution will look as follows:
Since Germany has two big cities (in this diagram), it will be put into the query output twice. This is not correct, SELECT ... FROM Country
should not produce the same country record twice. The FirstMatch
strategy avoids the production of duplicates by short-cutting execution as soon as the first genuine match is found:
Note that the short-cutting has to take place after "Using where" has been applied. It would have been wrong to short-cut after we found Trier.
The EXPLAIN
for the above query will look as follows:
MariaDB [world]> EXPLAIN SELECT * FROM Country WHERE Country.code IN
(select City.Country from City where City.Population > 1*1000*1000)
AND Country.continent='Europe';
+----+-------------+---------+------+--------------------+-----------+---------+--------------------+------+----------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+------+--------------------+-----------+---------+--------------------+------+----------------------------------+
| 1 | PRIMARY | Country | ref | PRIMARY,continent | continent | 17 | const | 60 | Using index condition |
| 1 | PRIMARY | City | ref | Population,Country | Country | 3 | world.Country.Code | 18 | Using where; FirstMatch(Country) |
+----+-------------+---------+------+--------------------+-----------+---------+--------------------+------+----------------------------------+
2 rows in set (0.00 sec)
FirstMatch(Country)
in the Extra column means that as soon as we have produced one matching record combination, short-cut the execution and jump back to the Country table.
FirstMatch
's query plan is very similar to one you would get in MySQL:
MySQL [world]> EXPLAIN SELECT * FROM Country WHERE Country.code IN
(select City.Country from City where City.Population > 1*1000*1000)
AND Country.continent='Europe';
+----+--------------------+---------+----------------+--------------------+-----------+---------+-------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------+----------------+--------------------+-----------+---------+-------+------+------------------------------------+
| 1 | PRIMARY | Country | ref | continent | continent | 17 | const | 60 | Using index condition; Using where |
| 2 | DEPENDENT SUBQUERY | City | index_subquery | Population,Country | Country | 3 | func | 18 | Using where |
+----+--------------------+---------+----------------+--------------------+-----------+---------+-------+------+------------------------------------+
2 rows in set (0.01 sec)
and these two particular query plans will execute in the same time.
The general idea behind the FirstMatch
strategy is the same as the one behind the IN->EXISTS
transformation, however, FirstMatch
has several advantages:
Equality propagation works across semi-join bounds, but not subquery bounds. Therefore, converting a subquery to semi-join and using FirstMatch
can still give a better execution plan. (TODO example)
There is only one way to apply the IN->EXISTS
strategy and MySQL will do it unconditionally. With FirstMatch
, the optimizer can make a choice between whether it should run the FirstMatch
strategy as soon as all tables used in the subquery are in the join prefix, or at some later point in time. (TODO: example)
The FirstMatch
strategy works by executing the subquery and short-cutting its execution as soon as the first match is found.
This means, subquery tables must be after all of the parent select's tables that are referred from the subquery predicate.
EXPLAIN
shows FirstMatch
as "FirstMatch(tableN)
".
The strategy can handle correlated subqueries.
But it cannot be applied if the subquery has meaningful GROUP BY
and/or aggregate functions.
Use of the FirstMatch
strategy is controlled with the firstmatch=on|off
flag in the optimizer_switch variable.
In-depth material:
This page is licensed: CC BY-SA / Gnu FDL
In 2024, fix for MDEV-34720 has added optimizer_join_limit_pref_ratio optimization into MariaDB starting from 10.6. It allows one to enable extra optimization for ORDER BY with small LIMIT.
MariaDB brought several improvements to the ORDER BY optimizer.
The fixes were made as a response to complaints by MariaDB customers, so they fix real-world optimization problems. The fixes are a bit hard to describe (as the ORDER BY
optimizer is complicated), but here's a short description:
The ORDER BY optimizer:
Doesn’t make stupid choices when several multi-part keys and potential range accesses are present (MDEV-6402).
This also fixes MySQL Bug#12113.
Always uses “range” and (not full “index” scan) when it switches to an index to satisfy ORDER BY … LIMIT
(MDEV-6657).
Tries hard to be smart and use cost/number of records estimates from other parts of the optimizer (MDEV-6384, MDEV-465).
This change also fixes MySQL Bug#36817.
Takes full advantage of InnoDB’s Extended Keys feature when checking if filesort() can be skipped (MDEV-6796).
In MySQL 5.7 changelog, one can find this passage:
Make switching of index due to small limit cost-based (WL#6986) : We have made the decision in make_join_select() of whether to switch to a new index in order to support "ORDER BY ... LIMIT N" cost-based. This work fixes Bug#73837.
MariaDB is not using Oracle's fix (we believe make_join_select
is not the right place to do ORDER BY optimization), but the effect is the same.
This page is licensed: CC BY-SA / Gnu FDL
LooseScan is an execution strategy for Semi-join subqueries.
We will demonstrate the LooseScan
strategy by example. Suppose, we're looking for countries that have satellites. We can get them using the following query (for the sake of simplicity we ignore satellites that are owned by consortiums of multiple countries):
SELECT * FROM Country
WHERE
Country.code IN (SELECT country_code FROM Satellite)
Suppose, there is an index on Satellite.country_code
. If we use that index, we will get satellites in the order of their owner country:
The LooseScan
strategy doesn't really need ordering, what it needs is grouping. In the above figure, satellites are grouped by country. For instance, all satellites owned by Australia come together, without being mixed with satellites of other countries. This makes it easy to select just one satellite from each group, which you can join with its country and get a list of countries without duplicates:
The EXPLAIN
output for the above query looks as follows:
MariaDB [world]> EXPLAIN SELECT * FROM Country WHERE Country.code IN
(select country_code from Satellite);
+----+-------------+-----------+--------+---------------+--------------+---------+------------------------------+------+-------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+--------+---------------+--------------+---------+------------------------------+------+-------------------------------------+
| 1 | PRIMARY | Satellite | index | country_code | country_code | 9 | NULL | 932 | Using where; Using index; LooseScan |
| 1 | PRIMARY | Country | eq_ref | PRIMARY | PRIMARY | 3 | world.Satellite.country_code | 1 | Using index condition |
+----+-------------+-----------+--------+---------------+--------------+---------+------------------------------+------+-------------------------------------+
LooseScan avoids the production of duplicate record combinations by putting the subquery table first and using its index to select one record from multiple duplicates
Hence, in order for LooseScan to be applicable, the subquery should look like:
expr IN (SELECT tbl.keypart1 FROM tbl ...)
or
expr IN (SELECT tbl.keypart2 FROM tbl WHERE tbl.keypart1=const AND ...)
LooseScan can handle correlated subqueries
LooseScan can be switched off by setting the loosescan=off
flag in the optimizer_switch variable.
This page is licensed: CC BY-SA / Gnu FDL
Semi-join Materialization is a special kind of subquery materialization used for Semi-join subqueries. It actually includes two strategies:
Materialization/lookup
Materialization/scan
Consider a query that finds countries in Europe which have big cities:
SELECT * FROM Country
WHERE Country.code IN (SELECT City.Country
FROM City
WHERE City.Population > 7*1000*1000)
AND Country.continent='Europe'
The subquery is uncorrelated, that is, we can run it independently of the upper query. The idea of semi-join materialization is to do just that, and fill a temporary table with possible values of the City.country field of big cities, and then do a join with countries in Europe:
The join can be done in two directions:
From the materialized table to countries in Europe
From countries in Europe to the materialized table
The first way involves doing a full scan on the materialized table, so we call it "Materialization-scan".
If you run a join from Countries to the materialized table, the cheapest way to find a match in the materialized table is to make a lookup on its primary key (it has one: we used it to remove duplicates). Because of that, we call the strategy "Materialization-lookup".
If we chose to look for cities with a population greater than 7 million, the optimizer will use Materialization-Scan and EXPLAIN
will show this:
MariaDB [world]> EXPLAIN SELECT * FROM Country WHERE Country.code IN
(select City.Country from City where City.Population > 7*1000*1000);
+----+--------------+-------------+--------+--------------------+------------+---------+--------------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+-------------+--------+--------------------+------------+---------+--------------------+------+-----------------------+
| 1 | PRIMARY | <subquery2> | ALL | distinct_key | NULL | NULL | NULL | 15 | |
| 1 | PRIMARY | Country | eq_ref | PRIMARY | PRIMARY | 3 | world.City.Country | 1 | |
| 2 | MATERIALIZED | City | range | Population,Country | Population | 4 | NULL | 15 | Using index condition |
+----+--------------+-------------+--------+--------------------+------------+---------+--------------------+------+-----------------------+
3 rows in set (0.01 sec)
Here, you can see:
There are still two SELECT
s (look for columns with id=1
and id=2
)
The second select (with id=2
) has select_type=MATERIALIZED
. This means it will be executed and its results will be stored in a temporary table with a unique key over all columns. The unique key is there to prevent the table from containing any duplicate records.
The first select received the table name subquery2
. This is the table that we got as a result of the materialization of the select with id=2
.
The optimizer chose to do a full scan over the materialized table, so this is an example of a use of the Materialization-Scan strategy.
As for execution costs, we're going to read 15 rows from table City, write 15 rows to materialized table, read them back (the optimizer assumes there won't be any duplicates), and then do 15 eq_ref accesses to table Country. In total, we'll do 45 reads and 15 writes.
By comparison, if you run the EXPLAIN
with semi-join optimizations disabled, you'll get this:
MariaDB [world]> EXPLAIN SELECT * FROM Country WHERE Country.code IN
(select City.Country from City where City.Population > 7*1000*1000);
+----+--------------------+---------+-------+--------------------+------------+---------+------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------+-------+--------------------+------------+---------+------+------+------------------------------------+
| 1 | PRIMARY | Country | ALL | NULL | NULL | NULL | NULL | 239 | Using where |
| 2 | DEPENDENT SUBQUERY | City | range | Population,Country | Population | 4 | NULL | 15 | Using index condition; Using where |
+----+--------------------+---------+-------+--------------------+------------+---------+------+------+------------------------------------+
...which is a plan to do (239 + 239*15) = 3824
table reads.
Let's modify the query slightly and look for countries which have cities with a population over one millon (instead of seven):
MariaDB [world]> EXPLAIN SELECT * FROM Country WHERE Country.code IN
(select City.Country from City where City.Population > 1*1000*1000) ;
+----+--------------+-------------+--------+--------------------+--------------+---------+------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+-------------+--------+--------------------+--------------+---------+------+------+-----------------------+
| 1 | PRIMARY | Country | ALL | PRIMARY | NULL | NULL | NULL | 239 | |
| 1 | PRIMARY | <subquery2> | eq_ref | distinct_key | distinct_key | 3 | func | 1 | |
| 2 | MATERIALIZED | City | range | Population,Country | Population | 4 | NULL | 238 | Using index condition |
+----+--------------+-------------+--------+--------------------+--------------+---------+------+------+-----------------------+
3 rows in set (0.00 sec)
The EXPLAIN
output is similar to the one which used Materialization-scan, except that:
the <subquery2>
table is accessed with the eq_ref
access method
the access uses an index named distinct_key
This means that the optimizer is planning to do index lookups into the materialized table. In other words, we're going to use the Materialization-lookup strategy.
With optimizer_switch='semijoin=off,materialization=off'
), one will get this EXPLAIN
:
MariaDB [world]> EXPLAIN SELECT * FROM Country WHERE Country.code IN
(select City.Country from City where City.Population > 1*1000*1000) ;
+----+--------------------+---------+----------------+--------------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------+----------------+--------------------+---------+---------+------+------+-------------+
| 1 | PRIMARY | Country | ALL | NULL | NULL | NULL | NULL | 239 | Using where |
| 2 | DEPENDENT SUBQUERY | City | index_subquery | Population,Country | Country | 3 | func | 18 | Using where |
+----+--------------------+---------+----------------+--------------------+---------+---------+------+------+-------------+
One can see that both plans will do a full scan on the Country
table. For the second step, MariaDB will fill the materialized table (238 rows read from table City and written to the temporary table) and then do a unique key lookup for each record in table Country
, which works out to 238 unique key lookups. In total, the second step will cost (239+238) = 477
reads and 238
temp.table writes.
Execution of the latter (DEPENDENT SUBQUERY
) plan reads 18 rows using an index on City.Country
for each record it receives for table Country
. This works out to a cost of (18*239) = 4302
reads. Had there been fewer subquery invocations, this plan would have been better than the one with Materialization. By the way, MariaDB has an option to use such a query plan, too (see FirstMatch Strategy), but it did not choose it.
MariaDB is able to use Semi-join materialization strategy when the subquery has grouping (other semi-join strategies are not applicable in this case).
This allows for efficient execution of queries that search for the best/last element in a certain group.
For example, let's find cities that have the biggest population on their continent:
EXPLAIN
SELECT * FROM City
WHERE City.Population IN (SELECT max(City.Population) FROM City, Country
WHERE City.Country=Country.Code
GROUP BY Continent)
+------+--------------+-------------+------+---------------+------------+---------+----------------------------------+------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------+-------------+------+---------------+------------+---------+----------------------------------+------+-----------------+
| 1 | PRIMARY | <subquery2> | ALL | distinct_key | NULL | NULL | NULL | 239 | |
| 1 | PRIMARY | City | ref | Population | Population | 4 | <subquery2>.max(City.Population) | 1 | |
| 2 | MATERIALIZED | Country | ALL | PRIMARY | NULL | NULL | NULL | 239 | Using temporary |
| 2 | MATERIALIZED | City | ref | Country | Country | 3 | world.Country.Code | 18 | |
+------+--------------+-------------+------+---------------+------------+---------+----------------------------------+------+-----------------+
4 rows in set (0.00 sec)
the cities are:
+------+-------------------+---------+------------+
| ID | Name | Country | Population |
+------+-------------------+---------+------------+
| 1024 | Mumbai (Bombay) | IND | 10500000 |
| 3580 | Moscow | RUS | 8389200 |
| 2454 | Macao | MAC | 437500 |
| 608 | Cairo | EGY | 6789479 |
| 2515 | Ciudad de México | MEX | 8591309 |
| 206 | São Paulo | BRA | 9968485 |
| 130 | Sydney | AUS | 3276207 |
+------+-------------------+---------+------------+
Semi-join materialization
Can be used for uncorrelated IN-subqueries. The subselect may use grouping and/or aggregate functions.
Is shown in EXPLAIN
as type=MATERIALIZED
for the subquery, and a line withtable=<subqueryN>
in the parent subquery.
Is enabled when one has both materialization=on
and semijoin=on
in the optimizer_switch variable.
The materialization=on|off
flag is shared with Non-semijoin materialization.
This page is licensed: CC BY-SA / Gnu FDL