Optimize MariaDB Server queries with indexes. This section covers index types, creation, and best practices for leveraging them to significantly improve query performance and data retrieval speed.
You have a SELECT and you want to build the best INDEX for it. This blog is a "cookbook" on how to do that task.
A short algorithm that works for many simpler SELECTs and helps in complex queries.
Examples of the algorithm, plus digressions into exceptions and variants
Finally a long list of "other cases".
The hope is that a newbie can quickly get up to speed, and his/her INDEXes will no longer smack of "newbie".
Many edge cases are explained, so even an expert may find something useful here.
Here's the way to approach creating an INDEX, given a SELECT. Follow the steps below, gathering columns to put in the INDEX in order. When the steps give out, you usually have the 'perfect' index.
Given a WHERE with a bunch of expressions connected by AND: Include the columns (if any), in any order, that are compared to a constant and not hidden in a function.
You get one more chance to add to the INDEX; do the first of these that applies:
2a. One column used in a 'range' -- BETWEEN, '>', LIKE w/o leading wildcard, etc.
2b. All columns, in order, of the GROUP BY.
2c. All columns, in order, of the ORDER BY if there is no mixing of ASC and DESC.
This blog assumes you know the basic idea behind having an INDEX. Here is a refresher on some of the key points.
Virtually all INDEXes in MySQL are structured as BTrees BTrees allow very efficient for
Given a key, find the corresponding row(s);
"Range scans" -- That is start at one value for the key and repeatedly find the "next" (or "previous") row.
A PRIMARY KEY is a UNIQUE KEY; a UNIQUE KEY is an INDEX. ("KEY" == "INDEX".)
InnoDB "clusters" the PRIMARY KEY with the data. Hence, given the value of the PK ("PRIMARY KEY"), after drilling down the BTree to find the index entry, you have all the columns of the row when you get there. A "secondary key" (any UNIQUE or INDEX other than the PK) in InnoDB first drills down the BTree for the secondary index, where it finds a copy of the PK. Then it drills down the PK to find the row.
Every InnoDB table has a PRIMARY KEY. While there is a default if you do not specify one, it is best to explicitly provide a PK.
For completeness: MyISAM works differently. All indexes (including the PK) are in separate BTrees. The leaf node of such BTrees have a pointer (usually a byte offset) into the data file.
All discussion here assumes InnoDB tables, however most statements apply to other Engines.
Think of a list of names, sorted by last_name, then first_name. You have undoubtedly seen such lists, and they often have other information such as address and phone number. Suppose you wanted to look me up. If you remember my full name ('James' and 'Rick'), it is easy to find my entry. If you remembered only my last name ('James') and first initial ('R'). You would quickly zoom in on the Jameses and find the Rs in them. There, you might remember 'Rick' and ignore 'Ronald'. But, suppose you remembered my first name ('Rick') and only my last initial ('J'). Now you are in trouble. You would be scanning all the Js -- Jones, Rick; Johnson, Rick; Jamison, Rick; etc, etc. That's much less efficient.
Those equate to
INDEX(last_name, first_name) -- the order of the list.
WHERE last_name = 'James' AND first_name = 'Rick' -- best case
WHERE last_name = 'James' AND first_name LIKE 'R%' -- pretty good
WHERE last_name LIKE 'J%' AND first_name = 'Rick' -- pretty bad
Think about this example as I talk about "=" versus "range" in the Algorithm, below.
WHERE aaa = 123 AND ...
: an INDEX starting with aaa is good.
WHERE aaa = 123 AND bbb = 456 AND ...
: an INDEX starting with aaa and bbb is good. In this case, it does not matter whether aaa or bbb comes first in the INDEX.
xxx IS NULL
: this acts like "= const" for this discussion.
WHERE t1.aa = 123 AND t2.bb = 456
-- You must only consider columns in the current table.
Note that the expression must be of the form of column_name
= (constant). These do not apply to this step in the Algorithm: DATE(dt) = '...', LOWER(s) = '...', CAST(s ...) = '...', x='...' COLLATE...
(If there are no "=" parts AND'd in the WHERE clause, move on to step 2 without any columns in your putative INDEX.)
Find the first of 2a / 2b / 2c that applies; use it; then quit. If none apply, then you are through gathering columns for the index.
In some cases it is optimal to do step 1 (all equals) plus step 2c (ORDER BY).
A "range" shows up as
aaa >= 123
-- any of <, <=, >=, >; but not <>, !=
aaa BETWEEN 22 AND 44
sss LIKE 'blah%'
-- but not sss LIKE '%blah'
xxx IS NOT NULL
Add the column in the range to your putative INDEX.
If there are more parts to the WHERE clause, you must stop now.
Complete examples (assume nothing else comes after the snippet)
WHERE aaa >= 123 AND bbb = 1 ⇒ INDEX(bbb, aaa)
(WHERE order does not matter; INDEX order does)
WHERE aaa >= 123 ⇒ INDEX(aaa)
WHERE aaa >= 123 AND ccc > 'xyz' ⇒ INDEX(aaa) or INDEX(ccc)
(only one range)
WHERE aaa >= 123 ORDER BY aaa ⇒ INDEX(aaa)
-- Bonus: The ORDER BY will use the INDEX.
WHERE aaa >= 123 ORDER BY aaa ⇒ INDEX(aaa) DESC
-- Same Bonus.
If there is a GROUP BY, all the columns of the GROUP BY should now be added, in the specified order, to the INDEX you are building. (I do not know what happens if one of the columns is already in the INDEX.)
If you are GROUPing BY an expression (including function calls), you cannot use the GROUP BY; stop.
Complete examples (assume nothing else comes after the snippet)
WHERE aaa = 123 AND bbb = 1 GROUP BY ccc ⇒ INDEX(bbb, aaa, ccc) or INDEX(aaa, bbb, ccc)
(='s first, in any order; then the GROUP BY)
WHERE aaa >= 123 GROUP BY xxx ⇒ INDEX(aaa)
(You should have stopped with Step 2a)
GROUP BY x,y ⇒ INDEX(x,y)
(no WHERE)
WHERE aaa = 123 GROUP BY xxx, (a+b) ⇒ INDEX(aaa)
-- expression in GROUP BY, so no use including even xxx.
If there is a ORDER BY, all the columns of the ORDER BY should now be added, in the specified order, to the INDEX you are building.
If there are multiple columns in the ORDER BY, and there is a mixture of ASC and DESC, do not add the ORDER BY columns; they won't help; stop.
If you are ORDERing BY an expression (including function calls), you cannot use the ORDER BY; stop.
Complete examples (assume nothing else comes after the snippet)
WHERE aaa = 123 GROUP BY ccc ORDER BY ddd ⇒ INDEX(aaa, ccc)
-- should have stopped with Step 2b
WHERE aaa = 123 GROUP BY ccc ORDER BY ccc ⇒ INDEX(aaa, ccc)
-- the ccc will be used for both GROUP BY and ORDER BY
WHERE aaa = 123 ORDER BY xxx ASC, yyy DESC ⇒ INDEX(aaa)
-- mixture of ASC and DESC.
The following are especially good. Normally a LIMIT cannot be applied until after lots of rows are gathered and then sorted according to the ORDER BY. But, if the INDEX gets all they way through the ORDER BY, only (OFFSET + LIMIT) rows need to be gathered. So, in these cases, you win the lottery with your new index:
WHERE aaa = 123 GROUP BY ccc ORDER BY ccc LIMIT 10 ⇒ INDEX(aaa, ccc)
WHERE aaa = 123 ORDER BY ccc LIMIT 10 ⇒ INDEX(aaa, ccc)
ORDER BY ccc LIMIT 10 ⇒ INDEX(ccc)
WHERE ccc > 432 ORDER BY ccc LIMIT 10 ⇒ INDEX(ccc)
-- This "range" is compatible with ORDER BY
(It does not make much sense to have a LIMIT without an ORDER BY, so I do not discuss that case.)
You have collected a few columns; put them in INDEX and ADD that to the table. That will often produce a "good" index for the SELECT you have. Below are some other suggestions that may be relevant.
An example of the Algorithm being 'wrong':
SELECT ... FROM t WHERE flag = true;
This would (according to the Algorithm) call for INDEX(flag). However, indexing a column that has two (or a small number of) values is almost always useless. This is called 'low cardinality'. The Optimizer would prefer to do a table scan than to bounce between the index BTree and the data.
On the other hand, the Algorithm is 'right' with
SELECT ... FROM t WHERE flag = true AND date >= '2015-01-01';
That would call for a compound index starting with a flag: INDEX(flag, date). Such an index is likely to be very beneficial. And it is likely to be more beneficial than INDEX(date).
If your resulting INDEX include column(s) that are likely to be UPDATEd, note that the UPDATE will have extra work to remove a 'row' from one place in the INDEX's BTree and insert a 'row' back into the BTree. For example:
INDEX(x)
UPDATE t SET x = ... WHERE ...;
There are too many variables to say whether it is better to keep the index or to toss it.
In this case, shortening the index may be beneficial:
INDEX(z, x)
UPDATE t SET x = ... WHERE ...;
Changing to INDEX(z) would make for less work for the UPDATE, but might hurt some SELECT. It depends on the frequency of each, plus many more factors.
(There are exceptions to some of these.)
You may not create an index bigger than 3KB.
You may not include a column that equates to bigger than some value (767 bytes -- VARCHAR(255) CHARACTER SET utf8).
You can deal with big fields using "prefix" indexing; but see below.
You should not have more than 5 columns in an index. (This is just a Rule of Thumb; nothing prevents having more.)
You should not have redundant indexes. (See below.)
INDEX(flag)
is almost never useful if flag
has very few values. More specifically, when you say WHERE flag = 1 and "1" occurs more than 20% of the time, such an index will be shunned. The Optimizer would prefer to scan the table instead of bouncing back and forth between the index and the data for more than 20% of the rows.
("20%" is really somewhere between 10% and 30%, depending on the phase of the moon.)
A "Covering" index is an index that contains all the columns in the SELECT. It is special in that the SELECT can be completed by looking only at the INDEX BTree. (Since InnoDB's PRIMARY KEY is clustered with the data, "covering" is of no benefit when considering at the PRIMARY KEY.)
Mini-cookbook:
Gather the list of column(s) according to the "Algorithm", above.
Add to the end of the list the rest of the columns seen in the SELECT, in any order.
Examples:
SELECT x FROM t WHERE y = 5; ⇒ INDEX(y,x)
-- The algorithm said just INDEX(y)
SELECT x,z FROM t WHERE y = 5 AND q = 7; ⇒ INDEX(y,q,x,z)
-- y and q in either order (Algorithm), then x and z in either order (covering).
SELECT x FROM t WHERE y > 5 AND q > 7; ⇒ INDEX(y,q,x)
-- y or q first (that's as far as the Algorithm goes), then the other two fields afterwards.
The speedup you get might be minor, or it might be spectacular; it is hard to predict.
But...
It is not wise to build an index with lots of columns. Let's cut it off at 5 (Rule of Thumb).
Prefix indexes cannot 'cover', so don't use them anywhere in a 'covering' index.
There are limits (3KB?) on how 'wide' an index can be, so "covering" may not be possible.
INDEX(a,b) can find anything that INDEX(a) could find. So you don't need both. Get rid of the shorter one.
If you have lots of SELECTs and they generate lots of INDEXes, this may cause a different problem. Each index must be updated (sooner or later) for each INSERT. More indexes ⇒ slower INSERTs. Limit the number of indexes on a table to about 6 (Rule of Thumb).
Notice in the cookbook how it says "in any order" in a few places. If, for example, you have both of these (in different SELECTs):
WHERE a=1 AND b=2
begs for either INDEX(a,b) or INDEX(b,a)
WHERE a>1 AND b=2
begs only for INDEX(b,a)
Include only INDEX(b,a) since it handles both cases with only one INDEX.
Suppose you have a lot of indexes, including (a,b,c,dd) and (a,b,c,ee). Those are getting rather long. Consider either picking one of them, or having simply (a,b,c). Sometimes the selectivity of (a,b,c) is so good that tacking on 'dd' or 'ee' does make enough difference to matter.
The main cookbook skips over an important optimization that is sometimes used. The optimizer will sometimes ignore the WHERE and, instead, use an INDEX that matches the ORDER BY. This, of course, needs to be a perfect match -- all columns, in the same order. And all ASC or all DESC.
This becomes especially beneficial if there is a LIMIT.
But there is a problem. There could be two situations, and the Optimizer is sometimes not smart enough to see which case applies:
If the WHERE does very little filtering, fetching the rows in ORDER BY order avoids a sort and has little wasted effort (because of 'little filtering'). Using the INDEX matching the ORDER BY is better in this case.
If the WHERE does a lot of filtering, the ORDER BY is wasting a lot of time fetching rows only to filter them out. Using an INDEX matching the WHERE clause is better.
What should you do? If you think the "little filtering" is likely, then create an index with the ORDER BY columns in order and hope that the Optimizer uses it when it should.
Cases...
WHERE a=1 OR a=2
-- This is turned into WHERE a IN (1,2) and optimized that way.
WHERE a=1 OR b=2
usually cannot be optimized.
WHERE x.a=1 OR y.b=2
This is even worse because of using two different tables.
A workaround is to use UNION. Each part of the UNION is optimized separately. For the second case:
( SELECT ... WHERE a=1 ) -- and have INDEX(a)
UNION DISTINCT -- "DISTINCT" is assuming you need to get rid of dups
( SELECT ... WHERE b=2 ) -- and have INDEX(b)
GROUP BY ... ORDER BY ... -- whatever you had at the end of the original query
Now the query can take good advantage of two different indexes. Note: "Index merge" might kick in on the original query, but it is not necessarily any faster than the UNION. Sister blog on compound indexes, including 'Index Merge'
The third case (OR across 2 tables) is similar to the second.
If you originally had a LIMIT, UNION gets complicated. If you started with ORDER BY z LIMIT 190, 10, then the UNION needs to be
( SELECT ... LIMIT 200 ) -- Note: OFFSET 0, LIMIT 190+10
UNION DISTINCT -- (or ALL)
( SELECT ... LIMIT 200 )
LIMIT 190, 10 -- Same as originally
You cannot directly index a TEXT or BLOB or large VARCHAR or large BINARY column. However, you can use a "prefix" index: INDEX(foo(20)). This says to index the first 20 characters of foo
. But... It is rarely useful.
Example of a prefix index:
INDEX(last_name(2), first_name)
The index for me would contain 'Ja', 'Rick'. That's not useful for distinguishing between 'Jamison', 'Jackson', 'James', etc., so the index is so close to useless that the optimizer often ignores it.
Probably never do UNIQUE(foo(20)) because this applies a uniqueness constraint on the first 20 characters of the column, not the whole column!
DATE, DATETIME, etc. are tricky to compare against.
Some tempting, but inefficient, techniques:
date_col LIKE '2016-01%'
-- must convert date_col to a string, so acts like a functionLEFT(date_col, 4) = '2016-01'
-- hiding the column in functionDATE(date_col) = 2016
-- hiding the column in function
All must do a full scan. (On the other hand, it can handy to use GROUP BY LEFT(date_col, 7) for monthly grouping, but that is not an INDEX issue.)
This is efficient, and can use an index:
date_col >= '2016-01-01'
AND date_col < '2016-01-01' + INTERVAL 3 MONTH
This case works because both right-hand values are converted to constants, then it is a "range". I like the design pattern with INTERVAL because it avoids computing the last day of the month. And it avoids tacking on '23:59:59', which is wrong if you have microsecond times. (And other cases.)
Perform EXPLAIN SELECT... (and EXPLAIN FORMAT=JSON SELECT... if you have 5.6.5). Look at the Key that it chose, and the Key_len. From those you can deduce how many columns of the index are being used for filtering. (JSON makes it easier to get the answer.) From that you can decide whether it is using as much of the INDEX as you thought. Caveat: Key_len only covers the WHERE part of the action; the non-JSON output won't easily say whether GROUP BY or ORDER BY was handled by the index.
IN (1,99,3)
is sometimes optimized as efficiently as "=", but not always. Older versions of MySQL did not optimize it as well as newer versions. (5.6 is possibly the main turning point.)
IN ( SELECT ... )
From version 4.1 through 5.5, IN ( SELECT ... ) was very poorly optimized. The SELECT was effectively re-evaluated every time. Often it can be transformed into a JOIN, which works much faster. Heres is a pattern to follow:
SELECT ...
FROM a
WHERE test_a
AND x IN (
SELECT x
FROM b
WHERE test_b
);
⇒
SELECT ...
FROM a
JOIN b USING(x)
WHERE test_a
AND test_b;
The SELECT expressions will need "a." prefixing the column names.
Alas, there are cases where the pattern is hard to follow.
5.6 does some optimizing, but probably not as good as the JOIN.
If there is a JOIN or GROUP BY or ORDER BY LIMIT in the subquery, that complicates the JOIN in new format. So, it might be better to use this pattern:
SELECT ...
FROM a
WHERE test_a
AND x IN ( SELECT x FROM ... );
⇒
SELECT ...
FROM a
JOIN ( SELECT x FROM ... ) b
USING(x)
WHERE test_a;
Caveat: If you end up with two subqueries JOINed together, note that neither has any indexes, hence performance can be very bad. (5.6 improves on it by dynamically creating indexes for subqueries.)
There is work going on in MariaDB and Oracle 5.7, in relation to "NOT IN", "NOT EXISTS", and "LEFT JOIN..IS NULL"; here is an old discussion on the topic So, what I say here may not be the final word.
When you have a JOIN and a GROUP BY, you may have the situation where the JOIN exploded more rows than the original query (due to many:many), but you wanted only one row from the original table, so you added the GROUP BY to implode back to the desired set of rows.
This explode + implode, itself, is costly. It would be better to avoid them if possible.
Sometimes the following will work.
Using DISTINCT or GROUP BY to counteract the explosion
SELECT DISTINCT
a.*,
b.y
FROM a
JOIN b
⇒
SELECT a.*,
( SELECT GROUP_CONCAT(b.y) FROM b WHERE b.x = a.x ) AS ys
FROM a
When using second table just to check for existence:
SELECT a.*
FROM a
JOIN b ON b.x = a.x
GROUP BY a.id
⇒
SELECT a.*,
FROM a
WHERE EXISTS ( SELECT * FROM b WHERE b.x = a.x )
Do it this way.
CREATE TABLE XtoY (
# No surrogate id for this table
x_id MEDIUMINT UNSIGNED NOT NULL, -- For JOINing to one table
y_id MEDIUMINT UNSIGNED NOT NULL, -- For JOINing to the other table
# Include other fields specific to the 'relation'
PRIMARY KEY(x_id, y_id), -- When starting with X
INDEX (y_id, x_id) -- When starting with Y
) ENGINE=InnoDB;
Notes:
Lack of an AUTO_INCREMENT id for this table -- The PK given is the 'natural' PK; there is no good reason for a surrogate.
"MEDIUMINT" -- This is a reminder that all INTs should be made as small as is safe (smaller ⇒ faster). Of course the declaration here must match the definition in the table being linked to.
"UNSIGNED" -- Nearly all INTs may as well be declared non-negative
"NOT NULL" -- Well, that's true, isn't it?
"InnoDB" -- More effecient than MyISAM because of the way the PRIMARY KEY is clustered with the data in InnoDB.
"INDEX(y_id, x_id)" -- The PRIMARY KEY makes it efficient to go one direction; this index makes the other direction efficient. No need to say UNIQUE; that would be extra effort on INSERTs.
In the secondary index, saying justINDEX(y_id) would work because it would implicit include x_id. But I would rather make it more obvious that I am hoping for a 'covering' index.
To conditionally INSERT new links, use IODKU
Note that if you had an AUTO_INCREMENT in this table, IODKU would "burn" ids quite rapidly.
Each subquery SELECT and each SELECT in a UNION can be considered separately for finding the optimal INDEX.
Exception: In a "correlated" ("dependent") subquery, the part of the WHERE that depends on the outside table is not easily factored into the INDEX generation. (Cop out!)
The first step is to decide what order the optimizer will go through the tables. If you cannot figure it out, then you may need to be pessimistic and create two indexes for each table -- one assuming the table will be used first, one assiming that it will come later in the table order.
The optimizer usually starts with one table and extracts the data needed from it. As it finds a useful (that is, matches the WHERE clause, if any) row, it reaches into the 'next' table. This is called NLJ ("Nested Loop Join"). The process of filtering and reaching to the next table continues through the rest of the tables.
The optimizer usually picks the "first" table based on these hints:
STRAIGHT_JOIN forces the table order.
The WHERE clause limits which rows needed (whether indexed or not).
The table to the "left" in a LEFT JOIN usually comes before the "right" table. (By looking at the table definitions, the optimizer may decide that "LEFT" is irrelevant.)
The current INDEXes will encourage an order.
etc.
Running EXPLAIN tells you the table order that the Optimizer is very likely to use today. After adding a new INDEX, the optimizer may pick a different table order. You should anticipate the order changing, guess at what order makes the most sense, and build the INDEXes accordingly. Then rerun EXPLAIN to see if the Optimizer's brain was on the same wavelength you were on.
You should build the INDEX for the "first" table based on any parts of the WHERE, GROUP BY, and ORDER BY clauses that are relevant to it. If a GROUP/ORDER BY mentions a different table, you should ignore that clause.
The second (and subsequent) table will be reached into based on the ON clause. (Instead of using commajoin, please write JOINs with the JOIN keyword and ON clause!) In addition, there could be parts of the WHERE clause that are relevant. GROUP/ORDER BY are not to be considered in writing the optimal INDEX for subsequent tables.
PARTITIONing is rarely a substitute for a good INDEX.
PARTITION BY RANGE is a technique that is sometimes useful when indexing fails to be good enough. In a two-dimensional situation such as nearness in a geographical sense, one dimension can partially be handled by partition pruning; then the other dimension can be handled by a regular index (preferrably the PRIMARY KEY). This goes into more detail: Find nearest 10 pizza parlors.
FULLTEXT is now implemented in InnoDB as well as MyISAM. It provides a way to search for "words" in TEXT columns. This is much faster (when it is applicable) than col LIKE '%word%'.
WHERE x = 1
AND MATCH (...) AGAINST (...)
always(?) uses the FULLTEXT index first. That is, the whole Algorithm is invalidated when one of the ANDs is a MATCH.
No "compound" (aka "composite") indexes
No PRIMARY KEY
Redundant indexes (especially blatant is PRIMARY KEY(id), KEY(id))
Most or all columns individually indexes ("But I indexed everything")
"Commajoin" -- That is FROM a, b WHERE a.x=b.x
instead of FROM a JOIN b ON a.x=b.x
The published table (see Wikipedia) is
CREATE TABLE wp_postmeta (
meta_id BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
post_id BIGINT(20) UNSIGNED NOT NULL DEFAULT '0',
meta_key VARCHAR(255) DEFAULT NULL,
meta_value LONGTEXT,
PRIMARY KEY (meta_id),
KEY post_id (post_id),
KEY meta_key (meta_key)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The problems:
The AUTO_INCREMENT provides no benefit; in fact it slows down most queries and clutters disk.
Much better is PRIMARY KEY(post_id, meta_key) -- clustered, handles both parts of usual JOIN.
BIGINT is overkill, but that can't be fixed without changing other tables.
VARCHAR(255) can be a problem in 5.6 with utf8mb4; see workarounds below.
When would meta_key
or meta_value
ever be NULL?
The solutions:
CREATE TABLE wp_postmeta (
post_id BIGINT UNSIGNED NOT NULL,
meta_key VARCHAR(255) NOT NULL,
meta_value LONGTEXT NOT NULL,
PRIMARY KEY(post_id, meta_key),
INDEX(meta_key)
) ENGINE=InnoDB;
Initial posting: March, 2015; Refreshed Feb, 2016; Add DATE June, 2016; Add WP example May, 2017.
The tips in this document apply to MySQL, MariaDB, and Percona.
Some info in the MySQL manual: ORDER BY Optimization
A short, but complicated, example
Indexing 101: Optimizing MySQL queries on a single table (Stephane Combaudon - Percona)
This blog is the consolidation of a Percona tutorial I gave in 2013, plus many years of experience in fixing thousands of slow queries on hundreds of systems. I apologize that this does not tell you how create INDEXes for all SELECTs. Some are just too complex.
Rick James graciously allowed us to use this article in the documentation.
Rick James' site has other useful tips, how-tos, optimizations, and debugging tips.
Original source: random
This page is licensed: CC BY-SA / Gnu FDL
This document starts out trivial and perhaps boring, but builds up to more interesting information, perhaps things you did not realize about how MariaDB and MySQL indexing works.
This also explains EXPLAIN (to some extent).
(Most of this applies to other databases, too.)
The question is "When was Andrew Johnson president of the US?".
The available table Presidents
looks like:
+-----+------------+----------------+-----------+
| seq | last_name | first_name | term |
+-----+------------+----------------+-----------+
| 1 | Washington | George | 1789-1797 |
| 2 | Adams | John | 1797-1801 |
...
| 7 | Jackson | Andrew | 1829-1837 |
...
| 17 | Johnson | Andrew | 1865-1869 |
...
| 36 | Johnson | Lyndon B. | 1963-1969 |
...
("Andrew Johnson" was picked for this lesson because of the duplicates.)
What index(es) would be best for that question? More specifically, what would be best for
SELECT term
FROM Presidents
WHERE last_name = 'Johnson'
AND first_name = 'Andrew';
Some INDEXes to try...
No indexes
INDEX(first_name), INDEX(last_name) (two separate indexes)
"Index Merge Intersect"
INDEX(last_name, first_name) (a "compound" index)
INDEX(last_name, first_name, term) (a "covering" index)
Variants
Well, I am fudging a little here. I have a PRIMARY KEY on seq
, but that has no advantage on the query we are studying.
SHOW CREATE TABLE Presidents \G
CREATE TABLE `presidents` (
`seq` TINYINT(3) UNSIGNED NOT NULL AUTO_INCREMENT,
`last_name` VARCHAR(30) NOT NULL,
`first_name` VARCHAR(30) NOT NULL,
`term` VARCHAR(9) NOT NULL,
PRIMARY KEY (`seq`)
) ENGINE=InnoDB AUTO_INCREMENT=45 DEFAULT CHARSET=utf8
EXPLAIN SELECT term
FROM Presidents
WHERE last_name = 'Johnson'
AND first_name = 'Andrew';
+----+-------------+------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | Presidents | ALL | NULL | NULL | NULL | NULL | 44 | Using where |
+----+-------------+------------+------+---------------+------+---------+------+------+-------------+
# Or, using the other form of display: EXPLAIN ... \G
id: 1
select_type: SIMPLE
table: Presidents
type: ALL <-- Implies table scan
possible_keys: NULL
key: NULL <-- Implies that no index is useful, hence table scan
key_len: NULL
ref: NULL
rows: 44 <-- That's about how many rows in the table, so table scan
Extra: Using where
First, let's describe how InnoDB stores and uses indexes.
The data and the PRIMARY KEY are "clustered" together in on BTree.
A BTree lookup is quite fast and efficient. For a million-row table there might be 3 levels of BTree, and the top two levels are probably cached.
Each secondary index is in another BTree, with the PRIMARY KEY at the leaf.
Fetching 'consecutive' (according to the index) items from a BTree is very efficient because they are stored consecutively.
For the sake of simplicity, we can count each BTree lookup as 1 unit of work, and ignore scans for consecutive items. This approximates the number of disk hits for a large table in a busy system.
For MyISAM, the PRIMARY KEY is not stored with the data, so think of it as being a secondary key (over-simplified).
The novice, once he learns about indexing, decides to index lots of columns, one at a time. But...
MariaDB rarely uses more than one index at a time in a query. So, it will analyze the possible indexes.
first_name -- there are 2 possible rows (one BTree lookup, then scan consecutively)
last_name -- there are 2 possible rows Let's say it picks last_name. Here are the steps for doing the SELECT:
Using INDEX(last_name), find 2 index entries with last_name = 'Johnson'.
Get the PRIMARY KEY (implicitly added to each secondary index in InnoDB); get (17, 36).
Reach into the data using seq = (17, 36) to get the rows for Andrew Johnson and Lyndon B. Johnson.
Use the rest of the WHERE clause filter out all but the desired row.
Deliver the answer (1865-1869).
EXPLAIN SELECT term
FROM Presidents
WHERE last_name = 'Johnson'
AND first_name = 'Andrew' \G
select_type: SIMPLE
table: Presidents
type: ref
possible_keys: last_name, first_name
key: last_name
key_len: 92 <-- VARCHAR(30) utf8 may need 2+3*30 bytes
ref: const
rows: 2 <-- Two 'Johnson's
Extra: Using where
OK, so you get really smart and decide that MariaDB should be smart enough to use both name indexes to get the answer. This is called "Intersect".
Using INDEX(last_name), find 2 index entries with last_name = 'Johnson'; get (7, 17)
Using INDEX(first_name), find 2 index entries with first_name = 'Andrew'; get (17, 36)
"And" the two lists together (7,17) & (17,36) = (17)
Reach into the data using seq = (17) to get the row for Andrew Johnson.
Deliver the answer (1865-1869).
id: 1
select_type: SIMPLE
table: Presidents
type: index_merge
possible_keys: first_name,last_name
key: first_name,last_name
key_len: 92,92
ref: NULL
rows: 1
Extra: Using intersect(first_name,last_name); Using where
The EXPLAIN fails to give the gory details of how many rows collected from each index, etc.
This is called a "compound" or "composite" index since it has more than one column.
Drill down the BTree for the index to get to exactly the index row for Johnson+Andrew; get seq = (17).
Reach into the data using seq = (17) to get the row for Andrew Johnson.
Deliver the answer (1865-1869). This is much better. In fact this is usually the "best".
ALTER TABLE Presidents
(DROP old indexes AND...)
ADD INDEX compound(last_name, first_name);
id: 1
select_type: SIMPLE
table: Presidents
type: ref
possible_keys: compound
key: compound
key_len: 184 <-- The length of both fields
ref: const,const <-- The WHERE clause gave constants for both
rows: 1 <-- Goodie! It homed in on the one row.
Extra: Using where
Surprise! We can actually do a little better. A "Covering" index is one in which all of the fields of the SELECT are found in the index. It has the added bonus of not having to reach into the "data" to finish the task.
Drill down the BTree for the index to get to exactly the index row for Johnson+Andrew; get seq = (17).
Deliver the answer (1865-1869). The "data" BTree is not touched; this is an improvement over "compound".
... ADD INDEX covering(last_name, first_name, term);
id: 1
select_type: SIMPLE
table: Presidents
type: ref
possible_keys: covering
key: covering
key_len: 184
ref: const,const
rows: 1
Extra: Using where; Using index <-- Note
Everything is similar to using "compound", except for the addition of "Using index".
What would happen if you shuffled the fields in the WHERE clause? Answer: The order of ANDed things does not matter.
What would happen if you shuffled the fields in the INDEX? Answer: It may make a huge difference. More in a minute.
What if there are extra fields on the end? Answer: Minimal harm; possibly a lot of good (eg, 'covering').
Reduncancy? That is, what if you have both of these: INDEX(a), INDEX(a,b)? Answer: Reduncy costs something on INSERTs; it is rarely useful for SELECTs.
Prefix? That is, INDEX(last_name(5). first_name(5)) Answer: Don't bother; it rarely helps, and often hurts. (The details are another topic.)
INDEX(last, first)
... WHERE last = '...' -- good (even though `first` is unused)
... WHERE first = '...' -- index is useless
INDEX(first, last), INDEX(last, first)
... WHERE first = '...' -- 1st index is used
... WHERE last = '...' -- 2nd index is used
... WHERE first = '...' AND last = '...' -- either could be used equally well
INDEX(last, first)
Both of these are handled by that one INDEX:
... WHERE last = '...'
... WHERE last = '...' AND first = '...'
INDEX(last), INDEX(last, first)
In light of the above example, don't bother including INDEX(last).
Refreshed -- Oct, 2012; more links -- Nov 2016
Rick James graciously allowed us to use this article in the documentation.
Rick James' site has other useful tips, how-tos, optimizations, and debugging tips.
Original source: index1
This page is licensed: CC BY-SA / Gnu FDL
A foreign key is a constraint which can be used to enforce data integrity. It is composed by a column (or a set of columns) in a table called the child table, which references to a column (or a set of columns) in a table called the parent table. If foreign keys are used, MariaDB performs some checks to enforce that some integrity rules are always enforced. For a more exhaustive explanation, see Relational databases: Foreign Keys.
Foreign keys can only be used with storage engines that support them. The default InnoDB supports foreign keys.
Partitioned tables cannot contain foreign keys, and cannot be referenced by a foreign key.
Note: Until MariaDB 10.4, MariaDB accepts the shortcut format with a REFERENCES clause only in ALTER TABLE and CREATE TABLE statements, but that syntax does nothing. For example:
CREATE TABLE b(for_key INT REFERENCES a(not_key));
MariaDB simply parses it without returning any error or warning, for compatibility with other DBMS's. However, only the syntax described below creates foreign keys. From MariaDB 10.5, MariaDB will attempt to apply the constraint. See the Examples below.
Foreign keys are created with CREATE TABLE or ALTER TABLE. The definition must follow this syntax:
[CONSTRAINT [symbol]] FOREIGN KEY
[index_name] (index_col_name, ...)
REFERENCES tbl_name (index_col_name,...)
[ON DELETE reference_option]
[ON UPDATE reference_option]
reference_option:
RESTRICT | CASCADE | SET NULL | NO ACTION | SET DEFAULT
The symbol
clause, if specified, is used in error messages and must be unique in the database.
The columns in the child table must be a BTREE (not HASH, RTREE, or FULLTEXT — see SHOW INDEX) index, or the leftmost part of a BTREE index. Index prefixes are not supported (thus, TEXT and BLOB columns cannot be used as foreign keys). If MariaDB automatically creates an index for the foreign key (because it does not exist and is not explicitly created), its name will be index_name
.
The referenced columns in the parent table must be a an index or a prefix of an index.
The foreign key columns and the referenced columns must be of the same type, or similar types. For integer types, the size and sign must also be the same.
Both the foreign key columns and the referenced columns can be PERSISTENT columns. However, the ON UPDATE CASCADE, ON UPDATE SET NULL, ON DELETE SET NULL clauses are not allowed in this case.
The parent and the child table must use the same storage engine, and must not be TEMPORARY
or partitioned tables. They can be the same table.
If a foreign keys exists, each row in the child table must match a row in the parent table. Multiple child rows can match the same parent row. A child row matches a parent row if all its foreign key values are identical to a parent row's values in the parent table. However, if at least one of the foreign key values is NULL
, the row has no parents, but it is still allowed.
MariaDB performs certain checks to guarantee that the data integrity is enforced:
Trying to insert non-matching rows (or update matching rows in a way that makes them non-matching rows) in the child table produces a 1452 error (SQLSTATE '23000').
When a row in the parent table is deleted and at least one child row exists, MariaDB performs an action which depends on the ON DELETE
clause of the foreign key.
When a value in the column referenced by a foreign key changes and at least one child row exists, MariaDB performs an action which depends on the ON UPDATE
clause of the foreign key.
Trying to drop a table that is referenced by a foreign key produces a 1217 error (SQLSTATE '23000').
A TRUNCATE TABLE against a table containing one or more foreign keys is executed as a DELETE without WHERE, so that the foreign keys are enforced for each row.
The allowed actions for ON DELETE
and ON UPDATE
are:
RESTRICT
: The change on the parent table is prevented. The statement terminates with a 1451 error (SQLSTATE '2300'). This is the default behavior for both ON DELETE
and ON UPDATE
.
NO ACTION
: Synonym for RESTRICT
.
CASCADE
: The change is allowed and propagates on the child table. For example, if a parent row is deleted, the child row is also deleted; if a parent row's ID changes, the child row's ID will also change.
SET NULL
: The change is allowed, and the child row's foreign key columns are set to NULL
.
SET DEFAULT
: Only worked with PBXT. Similar to SET NULL
, but the foreign key columns were set to their default values. If default values do not exist, an error is produced.
The delete or update operations triggered by foreign keys do not activate triggers and are not counted in the Com_delete and Com_update status variables.
Foreign key constraints can be disabled by setting the foreign_key_checks server system variable to 0. This speeds up the insertion of large quantities of data.
The Information Schema REFERENTIAL_CONSTRAINTS table contains information about foreign keys. The individual columns are listed in the KEY_COLUMN_USAGE table.
The InnoDB-specific Information Schema tables also contain information about the InnoDB foreign keys. The foreign key information is stored in the INNODB_SYS_FOREIGN. Data about the individual columns are stored in INNODB_SYS_FOREIGN_COLS.
The most human-readable way to get information about a table's foreign keys sometimes is the SHOW CREATE TABLE statement.
Foreign keys have the following limitations in MariaDB:
Currently, foreign keys are only supported by InnoDB.
Cannot be used with views.
The SET DEFAULT
action is not supported.
Foreign keys actions do not activate triggers.
If ON UPDATE CASCADE recurses to update the same table it has previously updated during the cascade, it acts like RESTRICT.
Indexed generated columns (both VIRTUAL and PERSISTENT) are not supported as InnoDB foreign key indexes.
Let's see an example. We will create an author
table and a book
table. Both tables have a primary key called id
. book
also has a foreign key composed by a field called author_id
, which refers to the author
primary key. The foreign key constraint name is optional, but we'll specify it because we want it to appear in error messages: fk_book_author
.
CREATE TABLE author (
id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL
) ENGINE = InnoDB;
CREATE TABLE book (
id MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(200) NOT NULL,
author_id SMALLINT UNSIGNED NOT NULL,
CONSTRAINT `fk_book_author`
FOREIGN KEY (author_id) REFERENCES author (id)
ON DELETE CASCADE
ON UPDATE RESTRICT
) ENGINE = InnoDB;
Now, if we try to insert a book with a non-existing author, we will get an error:
INSERT INTO book (title, author_id) VALUES ('Necronomicon', 1);
ERROR 1452 (23000): Cannot add or update a child row: a foreign key constraint fails
(`test`.`book`, CONSTRAINT `fk_book_author` FOREIGN KEY (`author_id`)
REFERENCES `author` (`id`) ON DELETE CASCADE)
The error is very descriptive.
Now, let's try to properly insert two authors and their books:
INSERT INTO author (name) VALUES ('Abdul Alhazred');
INSERT INTO book (title, author_id) VALUES ('Necronomicon', LAST_INSERT_ID());
INSERT INTO author (name) VALUES ('H.P. Lovecraft');
INSERT INTO book (title, author_id) VALUES
('The call of Cthulhu', LAST_INSERT_ID()),
('The colour out of space', LAST_INSERT_ID());
It worked!
Now, let's delete the second author. When we created the foreign key, we specified ON DELETE CASCADE
. This should propagate the deletion, and make the deleted author's books disappear:
DELETE FROM author WHERE name = 'H.P. Lovecraft';
SELECT * FROM book;
+----+--------------+-----------+
| id | title | author_id |
+----+--------------+-----------+
| 3 | Necronomicon | 1 |
+----+--------------+-----------+
We also specified ON UPDATE RESTRICT
. This should prevent us from modifying an author's id
(the column referenced by the foreign key) if a child row exists:
UPDATE author SET id = 10 WHERE id = 1;
ERROR 1451 (23000): Cannot delete or update a parent row: a foreign key constraint fails
(`test`.`book`, CONSTRAINT `fk_book_author` FOREIGN KEY (`author_id`)
REFERENCES `author` (`id`) ON DELETE CASCADE)
Until MariaDB 10.4
CREATE TABLE a(a_key INT PRIMARY KEY, not_key INT);
CREATE TABLE b(for_key INT REFERENCES a(not_key));
SHOW CREATE TABLE b;
+-------+----------------------------------------------------------------------------------+
| Table | Create Table |
+-------+----------------------------------------------------------------------------------+
| b | CREATE TABLE `b` (
`for_key` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+-------+----------------------------------------------------------------------------------+
INSERT INTO a VALUES (1,10);
Query OK, 1 row affected (0.005 sec)
INSERT INTO b VALUES (10);
Query OK, 1 row affected (0.004 sec)
INSERT INTO b VALUES (1);
Query OK, 1 row affected (0.004 sec)
SELECT * FROM b;
+---------+
| for_key |
+---------+
| 10 |
| 1 |
+---------+
From MariaDB 10.5
CREATE TABLE a(a_key INT PRIMARY KEY, not_key INT);
CREATE TABLE b(for_key INT REFERENCES a(not_key));
ERROR 1005 (HY000): Can't create table `test`.`b`
(errno: 150 "Foreign key constraint is incorrectly formed")
CREATE TABLE c(for_key INT REFERENCES a(a_key));
SHOW CREATE TABLE c;
+-------+----------------------------------------------------------------------------------+
| Table | Create Table |
+-------+----------------------------------------------------------------------------------+
| c | CREATE TABLE `c` (
`for_key` INT(11) DEFAULT NULL,
KEY `for_key` (`for_key`),
CONSTRAINT `c_ibfk_1` FOREIGN KEY (`for_key`) REFERENCES `a` (`a_key`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+-------+----------------------------------------------------------------------------------+
INSERT INTO a VALUES (1,10);
Query OK, 1 row affected (0.004 sec)
INSERT INTO c VALUES (10);
ERROR 1452 (23000): Cannot add or update a child row: a foreign key constraint fails
(`test`.`c`, CONSTRAINT `c_ibfk_1` FOREIGN KEY (`for_key`) REFERENCES `a` (`a_key`))
INSERT INTO c VALUES (1);
Query OK, 1 row affected (0.004 sec)
SELECT * FROM c;
+---------+
| for_key |
+---------+
| 1 |
+---------+
This page is licensed: CC BY-SA / Gnu FDL
MariaDB starting with 10.6.0
Ignored indexes were added in MariaDB 10.6.
Ignored indexes are indexes that are visible and maintained, but which are not used by the optimizer. MySQL 8 has a similar feature which they call "invisible indexes".
By default, an index is not ignored. One can mark existing index as ignored (or not ignored) with an ALTER TABLE statement:
ALTER TABLE table_name ALTER {KEY|INDEX} [IF EXISTS] key_name [NOT] IGNORED;
It is also possible to specify IGNORED attribute when creating an index with a CREATE TABLE, or CREATE INDEX statement:
CREATE TABLE table_name (
...
INDEX index_name ( ...) [NOT] IGNORED
...
CREATE INDEX index_name (...) [NOT] IGNORED ON tbl_name (...);
table's primary key cannot be ignored. This applies to both explicitly defined primary key, as well as implicit primary key - if there is no explicit primary key defined but the table has a unique key containing only NOT NULL columns, the first of such keys becomes the implicitly defined primary key.
The optimizer will treats ignored indexes as if they didn't exist. They will not be used in the query plans, or as a source of statistical information.
Also, an attempt to use an ignored index in a USE INDEX
, FORCE INDEX
, or IGNORE INDEX
hint will result in an error - the same what would have if one used a name of a non-existent index.
Information about whether or not indexes are ignored can be viewed in the IGNORED column in the Information Schema STATISTICS table or the SHOW INDEX statement.
The primary use case is as follows: a DBA sees an index that seems to have little or no usage and considers whether to remove it. Dropping the index is a risk as it may still be needed in a few cases. For example, the optimizer may rely on the estimates provided by the index without using the index in query plans. If dropping an index causes an issue, it will take a while to re-create the index. On the other hand, marking the index as ignored (or not ignored) is instant, so the suggested workflow is:
Mark the index as ignored
Check if everything continues to work
If not, mark the index as not ignored.
If everything continues to work, one can safely drop the index.
CREATE TABLE t1 (id INT PRIMARY KEY, b INT, KEY k1(b) IGNORED);
CREATE OR REPLACE TABLE t1 (id INT PRIMARY KEY, b INT, KEY k1(b));
ALTER TABLE t1 ALTER INDEX k1 IGNORED;
CREATE OR REPLACE TABLE t1 (id INT PRIMARY KEY, b INT);
CREATE INDEX k1 ON t1(b) IGNORED;
SELECT * FROM INFORMATION_SCHEMA.STATISTICS WHERE TABLE_NAME = 't1'\G
*************************** 1. row ***************************
TABLE_CATALOG: def
TABLE_SCHEMA: test
TABLE_NAME: t1
NON_UNIQUE: 0
INDEX_SCHEMA: test
INDEX_NAME: PRIMARY
SEQ_IN_INDEX: 1
COLUMN_NAME: id
COLLATION: A
CARDINALITY: 0
SUB_PART: NULL
PACKED: NULL
NULLABLE:
INDEX_TYPE: BTREE
COMMENT:
INDEX_COMMENT:
IGNORED: NO
*************************** 2. row ***************************
TABLE_CATALOG: def
TABLE_SCHEMA: test
TABLE_NAME: t1
NON_UNIQUE: 1
INDEX_SCHEMA: test
INDEX_NAME: k1
SEQ_IN_INDEX: 1
COLUMN_NAME: b
COLLATION: A
CARDINALITY: 0
SUB_PART: NULL
PACKED: NULL
NULLABLE: YES
INDEX_TYPE: BTREE
COMMENT:
INDEX_COMMENT:
IGNORED: YES
SHOW INDEXES FROM t1\G
*************************** 1. row ***************************
Table: t1
Non_unique: 0
Key_name: PRIMARY
Seq_in_index: 1
Column_name: id
Collation: A
Cardinality: 0
Sub_part: NULL
Packed: NULL
Null:
Index_type: BTREE
Comment:
Index_comment:
Ignored: NO
*************************** 2. row ***************************
Table: t1
Non_unique: 1
Key_name: k1
Seq_in_index: 1
Column_name: b
Collation: A
Cardinality: 0
Sub_part: NULL
Packed: NULL
Null: YES
Index_type: BTREE
Comment:
Index_comment:
Ignored: YES
The optimizer does not make use of an index when it is ignored, while if the index is not ignored (the default), the optimizer will consider it in the optimizer plan, as shown in the EXPLAIN output.
CREATE OR REPLACE TABLE t1 (id INT PRIMARY KEY, b INT, KEY k1(b) IGNORED);
EXPLAIN SELECT * FROM t1 ORDER BY b;
+------+-------------+-------+------+---------------+------+---------+------+------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+------+---------+------+------+----------------+
| 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 1 | Using filesort |
+------+-------------+-------+------+---------------+------+---------+------+------+----------------+
ALTER TABLE t1 ALTER INDEX k1 NOT IGNORED;
EXPLAIN SELECT * FROM t1 ORDER BY b;
+------+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | t1 | index | NULL | k1 | 5 | NULL | 1 | Using index |
+------+-------------+-------+-------+---------------+------+---------+------+------+-------------+
This page is licensed: CC BY-SA / Gnu FDL
The MariaDB query optimizer decides how best to execute each query based largely on the details of the indexes involved.
The index statistics help inform these decisions. Imagine yourself choosing whether to look up a number in a phone book, or in your personal address book. You'd choose the personal phone book if at all possible, as it would (usually!) contain far fewer records and be quicker to search.
Now imagine getting to your personal address book and finding it has twice the number of entries as the phone book. Your search would be slower. The same process applies to the query optimizer, so having access to up-to-date and accurate statistics is critical.
The statistics are mainly based on groups of index elements of the same value. In a primary key, every index is unique, so every group size is one. In a non-unique index, you may have multiple keys with the same value. A worst-case example would be having large groups with the same value, for example an index on a boolean field.
MariaDB makes heavy use of the average group size statistic. For example, if there are 100 rows, and twenty groups with the same index values, the average group size would be five.
However, averages can be skewed by extremes, and the usual culprit is NULL values. The row of 100 may have 19 groups with an average size of 1, while the other 81 values are all NULL. MariaDB may think five is a good average size and choose to use that index, and then end up having to read through 81 rows with identical keys, taking longer than an alternative.
There are three main approaches to the problem of NULLs. NULL index values can be treated as a single group (nulls_equal). This is usually fine, but if you have large numbers of NULLs the average group size is slanted higher, and the optimizer may miss using the index for ref accesses when it would be useful. This is the default used by XtraDB/InnoDB and MyISAM. Nulls_unequal is the opposite approach, with each NULL forming its own group of one. Conversely, the average group size is slanted lower, and the optimizer may use the index for ref accesses when not suitable. This is the default used by the Aria storage engine. A third options sees NULL's ignored altogether from index group calculations.
The default approaches can be changed by setting the aria_stats_method, myisam_stats_method and innodb_stats_method server variables.
The comparison operator used plays an important role. If two values are compared with <=> (see the null-safe-equal comparison operator), and both are null, 1 is returned. If the same values are compared with = (see the equal comparison operator) null is returned. For example:
SELECT 1 <=> 1, NULL <=> NULL, 1 <=> NULL;
+---------+---------------+------------+
| 1 <=> 1 | NULL <=> NULL | 1 <=> NULL |
+---------+---------------+------------+
| 1 | 1 | 0 |
+---------+---------------+------------+
SELECT 1 = 1, NULL = NULL, 1 = NULL;
+-------+-------------+----------+
| 1 = 1 | NULL = NULL | 1 = NULL |
+-------+-------------+----------+
| 1 | NULL | NULL |
+-------+-------------+----------+
MariaDB 10.0.1 introduced a way to gather statistics independently of the storage engine. See Engine-independent table statistics.
Histogram-Based Statistics were introduced in MariaDB 10.0.2, and are collected by default from MariaDB 10.4.3.
User Statistics. This plugin provides user, client, table and index usage statistics.
This page is licensed: CC BY-SA / Gnu FDL
You want to find the nearest 10 pizza parlors, but you cannot figure out how to do it efficiently in your huge database. Database indexes are good at one-dimensional indexing, but poor at two-dimensions.
You might have tried
INDEX(lat), INDEX(lon) -- but the optimizer used only one
INDEX(lat,lon) -- but it still had to work too hard
Sometimes you ended up with a full table scan -- Yuck.
WHERE SQRT(...)< ... -- No chance of using any index.
WHERE lat BETWEEN ... AND lng BETWEEN... -- This has some chance of using such indexes.
The goal is to look only at records "close", in both directions, to the target lat/lng.
PARTITIONs in MariaDB and MySQL sort of give you a way to have two clustered indexes. So, if we could slice up (partition) the globe in one dimension and use ordinary indexing in the other dimension, maybe we can get something approximating a 2D index. This 2D approach keeps the number of disk hits significantly lower than 1D approaches, thereby speeding up "find nearest" queries.
It works. Not perfectly, but better than the alternatives.
What to PARTITION on? It seems like latitude or longitude would be a good idea. Note that longitudes vary in width, from 69 miles (111 km) at the equator, to 0 at the poles. So, latitude seems like a better choice.
How many PARTITIONs? It does not matter a lot. Some thoughts:
90 partitions - 2 degrees each. (I don't like tables with too many partitions; 90 seems like plenty.)
50-100 - evenly populated. (This requires code. For 2.7M placenames, 85 partitions varied from 0.5 degrees to very wide partitions at the poles.)
Don't have more than 100 partitions, there are inefficiencies in the partition implementation.
How to PARTITION? Well, MariaDB and MySQL are very picky. So FLOAT/DOUBLE are out. DECIMAL is out. So, we are stuck with some kludge. Essentially, we need to convert Lat/Lng to some size of INT and use PARTITION BY RANGE.
To get to a datatype that can be used in PARTITION, you need to "scale" the latitude and longitude. (Consider only the *INTs; the other datatypes are included for comparison)
Datatype Bytes resolution
------------------ ----- --------------------------------
Deg*100 (SMALLINT) 4 1570 m 1.0 mi Cities
DECIMAL(4,2)/(5,2) 5 1570 m 1.0 mi Cities
SMALLINT scaled 4 682 m 0.4 mi Cities
Deg*10000 (MEDIUMINT) 6 16 m 52 ft Houses/Businesses
DECIMAL(6,4)/(7,4) 7 16 m 52 ft Houses/Businesses
MEDIUMINT scaled 6 2.7 m 8.8 ft
FLOAT 8 1.7 m 5.6 ft
DECIMAL(8,6)/(9,6) 9 16cm 1/2 ft Friends in a mall
Deg*10000000 (INT) 8 16mm 5/8 in Marbles
DOUBLE 16 3.5nm ... Fleas on a dog
(Sorted by resolution)
What these mean...
Deg*100 (SMALLINT) -- you take the lat/lng, multiply by 100, round, and store into a SMALLINT. That will take 2 bytes for each dimension, for a total of 4 bytes. Two items might be 1570 meters apart, but register as having the same latitude and longitude.
DECIMAL(4,2) for latitude and DECIMAL(5,2) for longitude will take 2+3 bytes and have no better resolution than Deg*100.
SMALLINT scaled -- Convert latitude into a SMALLINT SIGNED by doing (degrees / 90 * 32767) and rounding; longitude by (degrees / 180 * 32767).
FLOAT has 24 significant bits; DOUBLE has 53. (They don't work with PARTITIONing but are included for completeness. Often people use DOUBLE without realizing how much an overkill it is, and how much space it takes.)
Sure, you could do DEG1000 and other "in between" cases, but there is no advantage. DEG1000 takes as much space as DEG*10000, but has less resolution.
So, go down the list to see how much resolution you need, then pick an encoding you are comfortable with. However, since we are about to use latitude as a "partition key", it must be limited to one of the INTs. For the sample code, I will use Deg*10000 (MEDIUMINT).
GCDist is a helper FUNCTION that correctly computes the distance between two points on the globe.
The code has been benchmarked at about 20 microseconds per call on a 2011-vintage PC. If you had to check a million points, that would take 20 seconds -- far too much for a web application. So, one goal of the Procedure that uses it will be to minimize the usage of this function. With the code presented here, the function need be called only a few dozen or few hundred times, except in pathological cases.
Sure, you could use the Pythagorean formula. And it would work for most applications. But it does not take extra effort to do the GC. Furthermore, GC works across a pole and across the dateline. And, a Pythagorean function is not that much faster.
For efficiency, GCDist understands the scaling you picked and has that stuff hardcoded. I am picking "Deg*10000", so the function expects 350000 for representing 35 degrees. If you choose a different scaling, you will need to change the code.
GCDist() takes 4 scaled DOUBLEs -- lat1, lon1, lat2, lon2 -- and returns a scaled number of "degrees" representing the distance.
The table of representation choices says 52 feet of resolution for Deg*10000 and DECIMAL(x,4). Here is how it was calculated: To measuring a diagonal between lat/lng (0,0) and (0.0001,00001) (one 'unit in the last place'): GCDist(0,0,1,1) * 69.172 / 10000 * 5280 = 51.65, where
69.172 miles/degree of latitude
10000 units per degree for the scaling chosen
5280 feet / mile.
(No, this function does not compensate for the Earth being an oblate spheroid, etc.)
There will be one table (plus normalization tables as needed). The one table must be partitioned and indexed as indicated below.
Fields and indexes
PARTITION BY RANGE(lat)
lat -- scaled latitude (see above)
lon -- scaled longitude
PRIMARY KEY(lon, lat, ...) -- lon must be first; something must be added to make it UNIQUE
id -- (optional) you may need to identify the rows for your purposes; AUTO_INCREMENT if you like
INDEX(id) -- if id
is AUTO_INCREMENT, then this plain INDEX (not UNIQUE, not PRIMARY KEY) is necessary
ENGINE=InnoDB -- so the PRIMARY KEY will be "clustered"
Other indexes -- keep to a minimum (this is a general performance rule for large tables)
For most of this discussion, lat is assumed to be MEDIUMINT -- scaled from -90 to +90 by multiplying by 10000. Similarly for lon and -180 to +180.
The PRIMARY KEY must
start with lon
since the algorithm needs the "clustering" that InnoDB will provide, and
include lat
somewhere, since it is the PARTITION key, and
contain something to make the key UNIQUE (lon+lat is unlikely to be sufficient).
The FindNearest PROCEDURE will do multiple SELECTs something like this:
WHERE lat BETWEEN @my_lat - @dlat
AND @my_lat + @dlat -- PARTITION Pruning and bounding box
AND lon BETWEEN @my_lon - @dlon
AND @my_lon + @dlon -- first part of PK
AND condition -- filter out non-pizza parlors
The query planner will
Do PARTITION "pruning" based on the latitude; then
Within a PARTITION (which is effectively a table), use lon do a 'clustered' range scan; then
Use the "condition" to filter down to the rows you desire, plus recheck lat. This design leads to very few disk blocks needing to be read, which is the main goal of the design.
Note that this does not even call GCDist. That comes in the last pass when the ORDER BY and LIMIT are used.
The stored procedure has a loop. At least two SELECTs will be executed, but with proper tuning; usually no more than about 6 SELECTs will be performed. Because of searching by the PRIMARY KEY, each SELECT hits only one block, sometimes more, of the table. Counting the number of blocks hit is a crude, but effective way, of comparing the performance of multiple designs. By comparison, a full table scan will probably touch thousands of blocks. A simple INDEX(lat) probably leads to hitting hundreds of blocks.
Filtering... An argument to the FindNearest procedure includes a boolean expression ("condition") for a WHERE clause. If you don't need any filtering, pass in "1". To avoid "SQL injection", do not let web users put arbitrary expressions; instead, construct the "condition" from inputs they provide, thereby making sure it is safe.
The algorithm is embodied in a stored procedure because of its complexity.
You feed it a starting width for a "square" and a number of items to find.
It builds a "square" around where you are.
A SELECT is performed to see how many items are in the square.
Loop, doubling the width of the square, until enough items are found.
Now, a 'last' SELECT is performed to get the exact distances, sort them (ORDER BY) and LIMIT to the desired number.
If spanning a pole or the dateline, a more complex SELECT is used.
The next section ("Performance") should make this a bit clearer as it walks through some examples.
Because of all the variations, it is hard to get a meaningful benchmark. So, here is some hand-waving instead.
Each SELECT is constrained by a "square" defined by a latitude range and a longitude range. (See the WHERE clause mentioned above, or in the sample code below.) Because of the way longitude lines warp, the longitude range of the "square" will be more degrees than the latitude range. Let's say the latitude partitioning is 3 degrees wide in the area where you are searching. That is over 200 miles (over 300km), so you are very likely to have a latitude range smaller than the partition width. Still, if you are reaching from the edge of a latitude stripe, the square could span two partitions. After partition pruning down to one (sometimes more) partition, the query is then constrained by a longitude range. (Remember, the PRIMARY KEY starts with lon
.) If an InnoDB data block contains 100 rows (a handy Rule of Thumb), the select will touch one (or a few) block. If the square spans two (or more) partitions, then the same logic applies to each partition.
So, scanning the square will involve as little as one block; rarely more than a few blocks. The number of blocks is mostly independent of the dataset size.
The primary use case for this algorithm is when the data is significantly larger than will fit into cache (the buffer_pool). Hence, the main goal is to minimize the number of disk hits.
Now let's look at some edge cases, and argue that the number of blocks is still better (usually) than with traditional indexing techniques.
What if you are looking for Starbucks in a dense city? There would be dozens, maybe hundreds per square mile. If you start the guess at 100 miles, the SELECTs would be hitting lots of blocks -- not efficient. In this case, the "starting distance" should be small, say, 2 miles. Let's say your app wants the closest 10 stores. In this example, you would probably find more than 10 Starbucks within 2 miles in 1 InnoDB block in one partition. Even though there is a second SELECT to finish off the query, it would be hitting the same block. Total: One block hit == cheap.
Let's say you start with a 5 mile square. Since there are upwards of 200 Starbucks within a 5-miles radius in some dense cities of the world, that might imply 300 in our "square". That maps to about 4 disk blocks, and a modest amount of CPU to chew through the 300 records. Still not bad.
Now, suppose you are on an ocean liner somewhere in the Pacific. And there is one Starbucks onboard, but you are looking for the nearest 10. If you again start with 2 miles, it will take several iterations to find 10 sites. But, let's walk through it anyway. The first probe will hit one partition (maybe 2), and find just one hit. The second probe doubles the width of the square; 4 miles will still give you one hit -- the same hit in the same block, which is now cached, so we won't count it as a second disk I/O. Eventually the square will be wide enough to span multiple partitions. Each extra partition will be one new disk hit to discover no sites in the square. Finally, the square will hit Chile or Hawaii or Fiji and find some more sites, perhaps enough to stop the iteration. Since the main criteria in determining the number of disk hits is the number of partitions hit, we do not want to split the world into too many partitions. If there are, say, 40 partitions, then I have just described a case where there might be 20 disk hits.
2-degree partitions might be good for a global table of stores or restaurants. A 5-mile starting distance might be good when filtering for Starbucks. 20 miles might be better for a department store.
Now, let's discuss the 'last' SELECT, wherein the square is expanded by SQRT(2) and it uses the Great Circle formula to precisely order the N results. The SQRT(2) is in case that the N items were all at the corners of the 'square'. Growing the square by this much allows us to catch any other sites that were just outside the old square.
First, note that this 'last' SELECT is hitting the same block(s) that the iteration hit, plus possibly hitting some more blocks. It is hard to predict how many extra blocks might be hit. Here's a pathological case. You are in the middle of a desert; the square grows and grows. Eventually it finds N sites. There is a big city just outside the final square from the iterating. Now the 'last' SELECT kicks in, and it includes lots of sites in this big city. "Lots of sites" --> lots of blocks --> lots of disk hits.
Here's the gist of the stored procedure FindNearest().
Make a guess at how close to "me" to look.
See how many items are in a 'square' around me, after filtering.
If not enough, repeat, doubling the width of the square.
After finding enough, or giving up because we are looking "too far", make one last pass to get all the data, ORDERed and LIMITed
Note that the loop merely uses 'squares' of lat/lng ranges. This is crude, but works well with the partitioning and indexing, and avoids calling to GCDist (until the last step). In the sample code, I picked 15 miles as starting value. Adjusting this will have some impact on the Procedure's performance, but the impact will vary with the use cases. A rough way to set the radius is to guess what will find the desired LIMIT about half the time. (This value is hardcoded in the PROCEDURE.)
Parameters passed into FindNearest():
your Latitude -- -90..90 (not scaled -- see hardcoded conversion in PROCEDURE)
your Longitude -- -180..180 (not scaled)
Start distance -- (miles or km) -- see discussion below
Max distance -- in miles or km -- see hardcoded conversion in PROCEDURE
Limit -- maximum number of items to return
Condition -- something to put after 'AND' (more discussion above)
The function will find the nearest items, up to Limit that meet the Condition. But it will give up at Max distance. (If you are at the South Pole, why bother searching very far for the tenth pizza parlor?)
Because of the "scaling", "hardcoding", "Condition", the table name, etc, this PROCEDURE is not truly generic; the code must be modified for each application. Yes, I could have designed it to pass all that stuff in. But what a mess.
The "_start_dist" gives some control over the performance. Making this too small leads to extra iterations; too big leads to more rows being checked. If you choose to tune the Stored Procedure, do the following. "SELECT @iterations" after calling the SP for a number of typical values. If the value is usually 1, then decrease _start_dist. If it is usually 2 or more, then increase it.
Timing: Under 10ms for "typical" usage; any dataset size. Slower for pathological cases (low min distance, high max distance, crossing dateline, bad filtering, cold cache, etc)
End-cases:
By using GC distance, not Pythagoras, distances are 'correct' even near poles.
Poles -- Even if the "nearest" is almost 360 degrees away (longitude), it can find it.
Dateline -- There is a small, 'contained', piece of code for crossing the Dateline. Example: you are at +179 deg longitude, and the nearest item is at -179.
The procedure returns one resultset, SELECT *, distance.
Only rows that meet your Condition, within Max distance are returned
At most Limit rows are returned
The rows will be ordered, "closest" first.
"dist" will be in miles or km (based on a hardcoded constant in the SP)
This version is based on scaling "Deg*10000 (MEDIUMINT)".
DELIMITER //
DROP function IF EXISTS GCDist //
CREATE FUNCTION GCDist (
_lat1 DOUBLE, -- Scaled Degrees north for one point
_lon1 DOUBLE, -- Scaled Degrees west for one point
_lat2 DOUBLE, -- other point
_lon2 DOUBLE
) RETURNS DOUBLE
DETERMINISTIC
CONTAINS SQL -- SQL but does not read or write
SQL SECURITY INVOKER -- No special privileges granted
-- Input is a pair of latitudes/longitudes multiplied by 10000.
-- For example, the south pole has latitude -900000.
-- Multiply output by .0069172 to get miles between the two points
-- or by .0111325 to get kilometers
BEGIN
-- Hardcoded constant:
DECLARE _deg2rad DOUBLE DEFAULT PI()/1800000; -- For scaled by 1e4 to MEDIUMINT
DECLARE _rlat1 DOUBLE DEFAULT _deg2rad * _lat1;
DECLARE _rlat2 DOUBLE DEFAULT _deg2rad * _lat2;
-- compute as if earth's radius = 1.0
DECLARE _rlond DOUBLE DEFAULT _deg2rad * (_lon1 - _lon2);
DECLARE _m DOUBLE DEFAULT COS(_rlat2);
DECLARE _x DOUBLE DEFAULT COS(_rlat1) - _m * COS(_rlond);
DECLARE _y DOUBLE DEFAULT _m * SIN(_rlond);
DECLARE _z DOUBLE DEFAULT SIN(_rlat1) - SIN(_rlat2);
DECLARE _n DOUBLE DEFAULT SQRT(
_x * _x +
_y * _y +
_z * _z );
RETURN 2 * ASIN(_n / 2) / _deg2rad; -- again--scaled degrees
END;
//
DELIMITER ;
DELIMITER //
-- FindNearest (about my 6th approach)
DROP PROCEDURE IF EXISTS FindNearest6 //
CREATE
PROCEDURE FindNearest (
IN _my_lat DOUBLE, -- Latitude of me [-90..90] (not scaled)
IN _my_lon DOUBLE, -- Longitude [-180..180]
IN _START_dist DOUBLE, -- Starting estimate of how far to search: miles or km
IN _max_dist DOUBLE, -- Limit how far to search: miles or km
IN _limit INT, -- How many items to try to get
IN _condition VARCHAR(1111) -- will be ANDed in a WHERE clause
)
DETERMINISTIC
BEGIN
-- lat and lng are in degrees -90..+90 and -180..+180
-- All computations done in Latitude degrees.
-- Thing to tailor
-- *Locations* -- the table
-- Scaling of lat, lon; here using *10000 in MEDIUMINT
-- Table name
-- miles versus km.
-- Hardcoded constant:
DECLARE _deg2rad DOUBLE DEFAULT PI()/1800000; -- For scaled by 1e4 to MEDIUMINT
-- Cannot use params in PREPARE, so switch to @variables:
-- Hardcoded constant:
SET @my_lat := _my_lat * 10000,
@my_lon := _my_lon * 10000,
@deg2dist := 0.0069172, -- 69.172 for miles; 111.325 for km *** (mi vs km)
@start_deg := _start_dist / @deg2dist, -- Start with this radius first (eg, 15 miles)
@max_deg := _max_dist / @deg2dist,
@cutoff := @max_deg / SQRT(2), -- (slightly pessimistic)
@dlat := @start_deg, -- note: must stay positive
@lon2lat := COS(_deg2rad * @my_lat),
@iterations := 0; -- just debugging
-- Loop through, expanding search
-- Search a 'square', repeat with bigger square until find enough rows
-- If the inital probe found _limit rows, then probably the first
-- iteration here will find the desired data.
-- Hardcoded table name:
-- This is the "first SELECT":
SET @sql = CONCAT(
"SELECT COUNT(*) INTO @near_ct
FROM Locations
WHERE lat BETWEEN @my_lat - @dlat
AND @my_lat + @dlat -- PARTITION Pruning and bounding box
AND lon BETWEEN @my_lon - @dlon
AND @my_lon + @dlon -- first part of PK
AND ", _condition);
PREPARE _sql FROM @sql;
MainLoop: LOOP
SET @iterations := @iterations + 1;
-- The main probe: Search a 'square'
SET @dlon := ABS(@dlat / @lon2lat); -- good enough for now -- note: must stay positive
-- Hardcoded constants:
SET @dlon := IF(ABS(@my_lat) + @dlat >= 900000, 3600001, @dlon); -- near a Pole
EXECUTE _sql;
IF ( @near_ct >= _limit OR -- Found enough
@dlat >= @cutoff ) THEN -- Give up (too far)
LEAVE MainLoop;
END IF;
-- Expand 'square':
SET @dlat := LEAST(2 * @dlat, @cutoff); -- Double the radius to search
END LOOP MainLoop;
DEALLOCATE PREPARE _sql;
-- Out of loop because found _limit items, or going too far.
-- Expand range by about 1.4 (but not past _max_dist),
-- then fetch details on nearest 10.
-- Hardcoded constant:
SET @dlat := IF( @dlat >= @max_deg OR @dlon >= 1800000,
@max_deg,
GCDist(ABS(@my_lat), @my_lon,
ABS(@my_lat) - @dlat, @my_lon - @dlon) );
-- ABS: go toward equator to find farthest corner (also avoids poles)
-- Dateline: not a problem (see GCDist code)
-- Reach for longitude line at right angle:
-- sin(dlon)*cos(lat) = sin(dlat)
-- Hardcoded constant:
SET @dlon := IFNULL(ASIN(SIN(_deg2rad * @dlat) /
COS(_deg2rad * @my_lat))
/ _deg2rad -- precise
, 3600001); -- must be too near a pole
-- This is the "last SELECT":
-- Hardcoded constants:
IF (ABS(@my_lon) + @dlon < 1800000 OR -- Usual case - not crossing dateline
ABS(@my_lat) + @dlat < 900000) THEN -- crossing pole, so dateline not an issue
-- Hardcoded table name:
SET @sql = CONCAT(
"SELECT *,
@deg2dist * GCDist(@my_lat, @my_lon, lat, lon) AS dist
FROM Locations
WHERE lat BETWEEN @my_lat - @dlat
AND @my_lat + @dlat -- PARTITION Pruning and bounding box
AND lon BETWEEN @my_lon - @dlon
AND @my_lon + @dlon -- first part of PK
AND ", _condition, "
HAVING dist <= ", _max_dist, "
ORDER BY dist
LIMIT ", _limit
);
ELSE
-- Hardcoded constants and table name:
-- Circle crosses dateline, do two SELECTs, one for each side
SET @west_lon := IF(@my_lon < 0, @my_lon, @my_lon - 3600000);
SET @east_lon := @west_lon + 3600000;
-- One of those will be beyond +/- 180; this gets points beyond the dateline
SET @sql = CONCAT(
"( SELECT *,
@deg2dist * GCDist(@my_lat, @west_lon, lat, lon) AS dist
FROM Locations
WHERE lat BETWEEN @my_lat - @dlat
AND @my_lat + @dlat -- PARTITION Pruning and bounding box
AND lon BETWEEN @west_lon - @dlon
AND @west_lon + @dlon -- first part of PK
AND ", _condition, "
HAVING dist <= ", _max_dist, " )
UNION ALL
( SELECT *,
@deg2dist * GCDist(@my_lat, @east_lon, lat, lon) AS dist
FROM Locations
WHERE lat BETWEEN @my_lat - @dlat
AND @my_lat + @dlat -- PARTITION Pruning and bounding box
AND lon BETWEEN @east_lon - @dlon
AND @east_lon + @dlon -- first part of PK
AND ", _condition, "
HAVING dist <= ", _max_dist, " )
ORDER BY dist
LIMIT ", _limit
);
END IF;
PREPARE _sql FROM @sql;
EXECUTE _sql;
DEALLOCATE PREPARE _sql;
END;
//
DELIMITER ;
<<code>>
== Sample
Find the 5 cities with non-zero population (out of 3 million) nearest to (+35.15, -90.15). Start with a 10-mile bounding box and give up at 100 miles.
<<code>>
CALL FindNearestLL(35.15, -90.05, 10, 100, 5, 'population > 0');
+---------+--------+---------+---------+--------------+--------------+-------+------------+--------------+---------------------+------------------------+
| id | lat | lon | country | ascii_city | city | state | population | @gcd_ct := 0 | dist | @gcd_ct := @gcd_ct + 1 |
+---------+--------+---------+---------+--------------+--------------+-------+------------+--------------+---------------------+------------------------+
| 3023545 | 351494 | -900489 | us | memphis | Memphis | TN | 641608 | 0 | 0.07478733189367963 | 3 |
| 2917711 | 351464 | -901844 | us | west memphis | West Memphis | AR | 28065 | 0 | 7.605683607627499 | 2 |
| 2916457 | 352144 | -901964 | us | marion | Marion | AR | 9227 | 0 | 9.3994963998986 | 1 |
| 3020923 | 352044 | -898739 | us | bartlett | Bartlett | TN | 43264 | 0 | 10.643941157860604 | 7 |
| 2974644 | 349889 | -900125 | us | southaven | Southaven | MS | 38578 | 0 | 11.344042217329935 | 5 |
+---------+--------+---------+---------+--------------+--------------+-------+------------+--------------+---------------------+------------------------+
5 rows in set (0.00 sec)
Query OK, 0 rows affected (0.04 sec)
SELECT COUNT(*) FROM ll_table;
+----------+
| COUNT(*) |
+----------+
| 3173958 |
+----------+
1 row in set (5.04 sec)
FLUSH STATUS;
CALL...
SHOW SESSION STATUS LIKE 'Handler%';
SHOW session status LIKE 'Handler%';
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_read_first | 1 |
| Handler_read_key | 3 |
| Handler_read_next | 1307 | -- some index, some tmp, but far less than 3 million.
| Handler_read_rnd | 5 |
| Handler_read_rnd_next | 13 |
| Handler_write | 12 | -- it needed a tmp
+----------------------------+-------+
There is a "Haversine" algorithm that is twice as fast as the GCDist function here. But it has a fatal flaw of sometimes returning NULL for the distance between a point and itself. (This is because of computing a number slightly bigger than 1.0, then trying to take the ACOS of it.)
Rick James graciously allowed us to use this article in the documentation.
Rick James' site has other useful tips, how-tos, optimizations, and debugging tips.
Original source: latlng
This page is licensed: CC BY-SA / Gnu FDL
MariaDB deals with primary keys over nullable columns according to the SQL standards.
Take the following table structure:
CREATE TABLE t1(
c1 INT NOT NULL AUTO_INCREMENT,
c2 INT NULL DEFAULT NULL,
PRIMARY KEY(c1,c2)
);
Column c2 is part of a primary key, and thus it cannot be NULL.
Before MariaDB 10.1.7, MariaDB (as well as versions of MySQL before MySQL 5.7) would silently convert it into a NOT NULL column with a default value of 0.
Since MariaDB 10.1.7, the column is converted to NOT NULL, but without a default value. If we then attempt to insert a record without explicitly setting c2, a warning (or, in strict mode, an error), will be thrown, for example:
INSERT INTO t1() VALUES();
Query OK, 1 row affected, 1 warning (0.00 sec)
Warning (Code 1364): Field 'c2' doesn't have a default value
SELECT * FROM t1;
+----+----+
| c1 | c2 |
+----+----+
| 1 | 0 |
+----+----+
MySQL, since 5.7, will abort such a CREATE TABLE with an error.
The MariaDB 10.1.7 behavior adheres to the SQL 2003 standard.
SQL-2003, Part II, “Foundation” says:
**11.7 **Syntax Rules
…
5) If the specifies PRIMARY KEY, then for each in the explicit or implicit for which NOT NULL is not specified, NOT NULL is implicit in the .
Essentially this means that all PRIMARY KEY columns are automatically converted to NOT NULL. Furthermore:
11.5 General Rules
…
3) When a site S is set to its default value,
…
b) If the data descriptor for the site includes a , then S is set to the value specified by that .
…
e) Otherwise, S is set to the null value.
There is no concept of “no default value” in the standard. Instead, a column always has an implicit default value of NULL. On insertion it might however fail the NOT NULL constraint. MariaDB and MySQL instead mark such a column as “not having a default value”. The end result is the same — a value must be specified explicitly or an INSERT will fail.
MariaDB since 10.1.7 behaves in a standard compatible manner — being part of a PRIMARY KEY, the nullable column gets an automatic NOT NULL constraint, on insertion one must specify a value for such a column. MariaDB before 10.1.7 was automatically assigning a default value of 0 — this behavior was non-standard. Issuing an error at CREATE TABLE time is also non-standard.
MDEV-12248 describes an edge-case that may result in replication problems when replicating from a master server before this change to a slave server after this change.
This page is licensed: CC BY-SA / Gnu FDL
This refers to the index_type definition when creating an index, i.e. BTREE, HASH or RTREE.
For more information on general types of indexes, such as primary keys, unique indexes etc, go to Getting Started with Indexes.
BTREE is generally the default index type. For MEMORY tables, HASH is the default. TokuDB uses a particular data structure called fractal trees, which is optimized for data that do not entirely fit memory.
Understanding the B-tree and hash data structures can help predict how different queries perform on different storage engines that use these data structures in their indexes, particularly for the MEMORY storage engine that lets you choose B-tree or hash indexes. B-Tree Index Characteristics
B-tree indexes are used for column comparisons using the >, >=, =, >=, < or BETWEEN operators, as well as for LIKE comparisons that begin with a constant.
For example, the query SELECT * FROM Employees WHERE First_Name LIKE 'Maria%';
can make use of a B-tree index, while SELECT * FROM Employees WHERE First_Name LIKE '%aria';
cannot.
B-tree indexes also permit leftmost prefixing for searching of rows.
If the number or rows doesn't change, hash indexes occupy a fixed amount of memory, which is lower than the memory occupied by BTREE indexes.
Hash indexes, in contrast, can only be used for equality comparisons, so those using the = or <=> operators. They cannot be used for ordering, and provide no information to the optimizer on how many rows exist between two values.
Hash indexes do not permit leftmost prefixing - only the whole index can be used.
See SPATIAL for more information.
This page is licensed: CC BY-SA / Gnu FDL
Implement full-text indexes in MariaDB Server for efficient text search. This section guides you through creating and utilizing these indexes to optimize queries on large text datasets.
MariaDB has support for full-text indexing and searching:
A full-text index in MariaDB is an index of type FULLTEXT, and it allows more options when searching for portions of text from a field.
Partitioned tables cannot contain fulltext indexes, even if the storage engine supports them.
A FULLTEXT index definition can be given in the CREATE TABLE statement when a table is created, or added later using ALTER TABLE or CREATE INDEX.
For large data sets, it is much faster to load your data into a table that has no FULLTEXT index and then create the index after that, than to load data into a table that has an existing FULLTEXT index.
Full-text searching is performed using MATCH() ... AGAINST syntax. MATCH() takes a comma-separated list that names the columns to be searched. AGAINST takes a string to search for, and an optional modifier that indicates what type of search to perform. The search string must be a literal string, not a variable or a column name.
MATCH (col1,col2,...) AGAINST (expr [search_modifier])
Partial words are excluded.
Words less than 4 (MyISAM) or 3 (InnoDB) characters in length will not be stored in the fulltext index. This value can be adjusted by changing the ft_min_word_length system variable (or, for InnoDB, innodb_ft_min_token_size).
Words longer than 84 characters in length will also not be stored in the fulltext index. This values can be adjusted by changing the ft_max_word_length system variable (or, for InnoDB, innodb_ft_max_token_size).
Stopwords are a list of common words such as "once" or "then" that do not reflect in the search results unless IN BOOLEAN MODE is used. The stopword list for MyISAM/Aria tables and InnoDB tables can differ. See stopwords for details and a full list, as well as for details on how to change the default list.
For MyISAM/Aria fulltext indexes only, if a word appears in more than half the rows, it is also excluded from the results of a fulltext search.
For InnoDB indexes, only committed rows appear - modifications from the current transaction do not apply.
MariaDB calculates a relevance for each result, based on a number of factors, including the number of words in the index, the number of unique words in a row, the total number of words in both the index and the result, and the weight of the word. In English, 'cool' will be weighted less than 'dandy', at least at present! The relevance can be returned as part of a query simply by using the MATCH function in the field list.
IN NATURAL LANGUAGE MODE is the default type of full-text search, and the keywords can be omitted. There are no special operators, and searches consist of one or more comma-separated keywords.
Searches are returned in descending order of relevance.
Boolean search permits the use of a number of special operators:
+
The word is mandatory in all rows returned.
-
The word cannot appear in any row returned.
<
The word that follows has a lower relevance than other words, although rows containing it will still match
>
The word that follows has a higher relevance than other words.
()
Used to group words into subexpressions.
~
The word following contributes negatively to the relevance of the row (which is different to the '-' operator, which specifically excludes the word, or the '<' operator, which still causes the word to contribute positively to the relevance of the row.
*
The wildcard, indicating zero or more characters. It can only appear at the end of a word.
"
Anything enclosed in the double quotes is taken as a whole (so you can match phrases, for example).
Searches are not returned in order of relevance, and nor does the 50% limit apply. Stopwords and word minimum and maximum lengths still apply as usual.
A query expansion search is a modification of a natural language search. The search string is used to perform a regular natural language search. Then, words from the most relevant rows returned by the search are added to the search string and the search is done again. The query returns the rows from the second search. The IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION or WITH QUERY EXPANSION modifier specifies a query expansion search. It can be useful when relying on implied knowledge within the data, for example that MariaDB is a database.
Creating a table, and performing a basic search:
CREATE TABLE ft_myisam(copy TEXT,FULLTEXT(copy)) ENGINE=MyISAM;
INSERT INTO ft_myisam(copy) VALUES ('Once upon a time'),
('There was a wicked witch'), ('Who ate everybody up');
SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('wicked');
+--------------------------+
| copy |
+--------------------------+
| There was a wicked witch |
+--------------------------+
Multiple words:
SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('wicked,witch');
+---------------------------------+
| copy |
+---------------------------------+
| There was a wicked witch |
+---------------------------------+
Since 'Once' is a stopword, no result is returned:
SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('Once');
Empty set (0.00 sec)
Inserting the word 'wicked' into more than half the rows excludes it from the results:
INSERT INTO ft_myisam(copy) VALUES ('Once upon a wicked time'),
('There was a wicked wicked witch'), ('Who ate everybody wicked up');
SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('wicked');
Empty set (0.00 sec)
Using IN BOOLEAN MODE to overcome the 50% limitation:
SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('wicked' IN BOOLEAN MODE);
+---------------------------------+
| copy |
+---------------------------------+
| There was a wicked witch |
| Once upon a wicked time |
| There was a wicked wicked witch |
| Who ate everybody wicked up |
+---------------------------------+
Returning the relevance:
SELECT copy,MATCH(copy) AGAINST('witch') AS relevance
FROM ft_myisam WHERE MATCH(copy) AGAINST('witch');
+---------------------------------+--------------------+
| copy | relevance |
+---------------------------------+--------------------+
| There was a wicked witch | 0.6775632500648499 |
| There was a wicked wicked witch | 0.5031757950782776 |
+---------------------------------+--------------------+
WITH QUERY EXPANSION. In the following example, 'MariaDB' is always associated with the word 'database', so it is returned when query expansion is used, even though not explicitly requested.
CREATE TABLE ft2(copy TEXT,FULLTEXT(copy)) ENGINE=MyISAM;
INSERT INTO ft2(copy) VALUES
('MySQL vs MariaDB database'),
('Oracle vs MariaDB database'),
('PostgreSQL vs MariaDB database'),
('MariaDB overview'),
('Foreign keys'),
('Primary keys'),
('Indexes'),
('Transactions'),
('Triggers');
SELECT * FROM ft2 WHERE MATCH(copy) AGAINST('database');
+--------------------------------+
| copy |
+--------------------------------+
| MySQL vs MariaDB database |
| Oracle vs MariaDB database |
| PostgreSQL vs MariaDB database |
+--------------------------------+
3 rows in set (0.00 sec)
SELECT * FROM ft2 WHERE MATCH(copy) AGAINST('database' WITH QUERY EXPANSION);
+--------------------------------+
| copy |
+--------------------------------+
| MySQL vs MariaDB database |
| Oracle vs MariaDB database |
| PostgreSQL vs MariaDB database |
| MariaDB overview |
+--------------------------------+
4 rows in set (0.00 sec)
Partial word matching with IN BOOLEAN MODE:
SELECT * FROM ft2 WHERE MATCH(copy) AGAINST('Maria*' IN BOOLEAN MODE);
+--------------------------------+
| copy |
+--------------------------------+
| MySQL vs MariaDB database |
| Oracle vs MariaDB database |
| PostgreSQL vs MariaDB database |
| MariaDB overview |
+--------------------------------+
Using boolean operators
SELECT * FROM ft2 WHERE MATCH(copy) AGAINST('+MariaDB -database'
IN BOOLEAN MODE);
+------------------+
| copy |
+------------------+
| MariaDB overview |
+------------------+
For simpler searches of a substring in text columns, see the LIKE operator.
This page is licensed: CC BY-SA / Gnu FDL
Stopwords are used to provide a list of commonly-used words that can be ignored for the purposes of Full-text-indexes.
Full-text indexes built in MyISAM and InnoDB have different stopword lists by default.
For full-text indexes on MyISAM tables, by default, the list is built from the file storage/myisam/ft_static.c
, and searched using the server's character set and collation. The ft_stopword_file system variable allows the default list to be overridden with words from another file, or for stopwords to be ignored altogether.
If the stopword list is changed, any existing full-text indexes need to be rebuilt
The following table shows the default list of stopwords, although you should always treat storage/myisam/ft_static.c
as the definitive list. See the Fulltext Index Overview for more details, and Full-text-indexes for related articles.
a's
able
about
above
according
accordingly
across
actually
after
afterwards
again
against
ain't
all
allow
allows
almost
alone
along
already
also
although
always
am
among
amongst
an
and
another
any
anybody
anyhow
anyone
anything
anyway
anyways
anywhere
apart
appear
appreciate
appropriate
are
aren't
around
as
aside
ask
asking
associated
at
available
away
awfully
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
believe
below
beside
besides
best
better
between
beyond
both
brief
but
by
c'mon
c's
came
can
can't
cannot
cant
cause
causes
certain
certainly
changes
clearly
co
com
come
comes
concerning
consequently
consider
considering
contain
containing
contains
corresponding
could
couldn't
course
currently
definitely
described
despite
did
didn't
different
do
does
doesn't
doing
don't
done
down
downwards
during
each
edu
eg
eight
either
else
elsewhere
enough
entirely
especially
et
etc
even
ever
every
everybody
everyone
everything
everywhere
ex
exactly
example
except
far
few
fifth
first
five
followed
following
follows
for
former
formerly
forth
four
from
further
furthermore
get
gets
getting
given
gives
go
goes
going
gone
got
gotten
greetings
had
hadn't
happens
hardly
has
hasn't
have
haven't
having
he
he's
hello
help
hence
her
here
here's
hereafter
hereby
herein
hereupon
hers
herself
hi
him
himself
his
hither
hopefully
how
howbeit
however
i'd
i'll
i'm
i've
ie
if
ignored
immediate
in
inasmuch
inc
indeed
indicate
indicated
indicates
inner
insofar
instead
into
inward
is
isn't
it
it'd
it'll
it's
its
itself
just
keep
keeps
kept
know
knows
known
last
lately
later
latter
latterly
least
less
lest
let
let's
like
liked
likely
little
look
looking
looks
ltd
mainly
many
may
maybe
me
mean
meanwhile
merely
might
more
moreover
most
mostly
much
must
my
myself
name
namely
nd
near
nearly
necessary
need
needs
neither
never
nevertheless
new
next
nine
no
nobody
non
none
noone
nor
normally
not
nothing
novel
now
nowhere
obviously
of
off
often
oh
ok
okay
old
on
once
one
ones
only
onto
or
other
others
otherwise
ought
our
ours
ourselves
out
outside
over
overall
own
particular
particularly
per
perhaps
placed
please
plus
possible
presumably
probably
provides
que
quite
qv
rather
rd
re
really
reasonably
regarding
regardless
regards
relatively
respectively
right
said
same
saw
say
saying
says
second
secondly
see
seeing
seem
seemed
seeming
seems
seen
self
selves
sensible
sent
serious
seriously
seven
several
shall
she
should
shouldn't
since
six
so
some
somebody
somehow
someone
something
sometime
sometimes
somewhat
somewhere
soon
sorry
specified
specify
specifying
still
sub
such
sup
sure
t's
take
taken
tell
tends
th
than
thank
thanks
thanx
that
that's
thats
the
their
theirs
them
themselves
then
thence
there
there's
thereafter
thereby
therefore
therein
theres
thereupon
these
they
they'd
they'll
they're
they've
think
third
this
thorough
thoroughly
those
though
three
through
throughout
thru
thus
to
together
too
took
toward
towards
tried
tries
truly
try
trying
twice
two
un
under
unfortunately
unless
unlikely
until
unto
up
upon
us
use
used
useful
uses
using
usually
value
various
very
via
viz
vs
want
wants
was
wasn't
way
we
we'd
we'll
we're
we've
welcome
well
went
were
weren't
what
what's
whatever
when
whence
whenever
where
where's
whereafter
whereas
whereby
wherein
whereupon
wherever
whether
which
while
whither
who
who's
whoever
whole
whom
whose
why
will
willing
wish
with
within
without
won't
wonder
would
wouldn't
yes
yet
you
you'd
you'll
you're
you've
your
yours
yourself
yourselves
zero
Stopwords on full-text indexes are only enabled if the innodb_ft_enable_stopword system variable is set (by default it is) at the time the index was created.
The stopword list is determined as follows:
If the innodb_ft_user_stopword_table system variable is set, that table is used as a stopword list.
If innodb_ft_user_stopword_table
is not set, the table set by innodb_ft_server_stopword_table is used.
If neither variable is set, the built-in list is used, which can be viewed by querying the INNODB_FT_DEFAULT_STOPWORD table in the Information Schema.
In the first two cases, the specified table must exist at the time the system variable is set and the full-text index created. It must be an InnoDB table with a single column, a VARCHAR named VALUE.
The default InnoDB stopword list differs from the default MyISAM list, being much shorter, and contains the following words:
a
about
an
are
as
at
be
by
com
de
en
for
from
how
i
in
is
it
la
of
on
or
that
the
this
to
was
what
when
where
who
will
with
und
the
www
This page is licensed: CC BY-SA / Gnu FDL