2 of 43

MariaDB Internals

Documentation on the internal workings of MariaDB.

Writing Plugins for MariaDB

About

Generally speaking, writing plugins for MariaDB is very similar to writing plugins for MySQL.

Authentication Plugins

Storage Engine Plugins

Storage engines can extend CREATE TABLE syntax with optional index, field, and table attribute clauses. See Extending CREATE TABLE for more information.

See Storage Engine Development.

Information Schema Plugins

Information Schema plugins can have their own FLUSH and SHOW statements. See FLUSH and SHOW for Information Schema plugins.

Encryption Plugins

Encryption plugins in MariaDB are used for the data at rest encryption feature. They are responsible for both key management and for the actual encryption and decryption of data.

Function Plugins

Function plugins add new SQL functions to MariaDB. Unlike the old UDF API, function plugins can do almost anything that a built-function can.

Plugin Declaration Structure

The MariaDB plugin declaration differs from the MySQL plugin declaration in the following ways:

it has no useless 'reserved' field (the very last field in the MySQL plugin declaration)
it has a 'maturity' declaration
it has a field for a text representation of the version field

MariaDB can load plugins that only have the MySQL plugin declaration but both PLUGIN_MATURITY and PLUGIN_AUTH_VERSION will show up as 'Unknown' in the INFORMATION_SCHEMA.PLUGINS table.

For compiled-in (not dynamically loaded) plugins, the presence of the MariaDB plugin declaration is mandatory.

Example Plugin Declaration

The MariaDB plugin declaration looks like this:

/* MariaDB plugin declaration */
maria_declare_plugin(example)
{
   MYSQL_STORAGE_ENGINE_PLUGIN, /* the plugin type (see include/mysql/plugin.h) */
   &example_storage_engine_info, /* pointer to type-specific plugin descriptor   */
   "EXAMPLEDB", /* plugin name */
   "John Smith",  /* plugin author */
   "Example of plugin interface", /* the plugin description */
   PLUGIN_LICENSE_GPL, /* the plugin license (see include/mysql/plugin.h) */
   example_init_func,   /* Pointer to plugin initialization function */
   example_deinit_func,  /* Pointer to plugin deinitialization function */
   0x0001 /* Numeric version 0xAABB means AA.BB version */,
   example_status_variables,  /* Status variables */
   example_system_variables,  /* System variables */
   "0.1 example",  /* String version representation */
   MariaDB_PLUGIN_MATURITY_EXPERIMENTAL /* Maturity (see include/mysql/plugin.h)*/
}
maria_declare_plugin_end;

_{This page is licensed: CC BY-SA / Gnu FDL}

Encryption Plugin API

MariaDB's data-at-rest encryption requires the use of a key management and encryption plugin. These plugins are responsible both for the management of encryption keys and for the actual encryption and decryption of data.

MariaDB supports the use of multiple encryption keys. Each encryption key uses a 32-bit integer as a key identifier. If the specific plugin supports key rotation, then encryption keys can also be rotated, which creates a new version of the encryption key.

See Data at Rest Encryption and Encryption Key Management for more information.

Encryption Plugin API

The Encryption plugin API was created to allow a plugin to:

implement key management, provide encryption keys to the server on request and change them according to internal policies.
implement actual data encryption and decryption with the algorithm defined by the plugin.

This is how the API reflects that:

/* Returned from get_latest_key_version() */
#define ENCRYPTION_KEY_VERSION_INVALID (~(unsigned int)0)
#define ENCRYPTION_KEY_NOT_ENCRYPTED (0)

#define ENCRYPTION_KEY_SYSTEM_DATA 1
#define ENCRYPTION_KEY_TEMPORARY_DATA 2

/* Returned from get_key()  */
#define ENCRYPTION_KEY_BUFFER_TOO_SMALL (100)

#define ENCRYPTION_FLAG_DECRYPT 0
#define ENCRYPTION_FLAG_ENCRYPT 1
#define ENCRYPTION_FLAG_NOPAD 2

struct st_mariadb_encryption {
  int interface_version; /**< version plugin uses */

  /********************* KEY MANAGEMENT ***********************************/

  /**
    Function returning latest key version for a given key id.

    @return A version or ENCRYPTION_KEY_VERSION_INVALID to indicate an error.
  */
  unsigned int (*get_latest_key_version)(unsigned int key_id);

  /**
    Function returning a key for a key version

    @param key_id       The requested key id
    @param version      The requested key version
    @param key          The key will be stored there. Can be NULL -
                        in which case no key will be returned
    @param key_length   in: key buffer size
                        out: the actual length of the key

    This method can be used to query the key length - the required
    buffer size - by passing key==NULL.

    If the buffer size is less than the key length the content of the
    key buffer is undefined (the plugin is free to partially fill it with
    the key data or leave it untouched).

    @return 0 on success, or
            ENCRYPTION_KEY_VERSION_INVALID, ENCRYPTION_KEY_BUFFER_TOO_SMALL
            or any other non-zero number for errors
  */
  unsigned int (*get_key)(unsigned int key_id, unsigned int version,
                          unsigned char *key, unsigned int *key_length);

  /********************* ENCRYPTION **************************************/
  /*
    The caller uses encryption as follows:
      1. Create the encryption context object of the crypt_ctx_size() bytes.
      2. Initialize it with crypt_ctx_init().
      3. Repeat crypt_ctx_update() until there are no more data to encrypt.
      4. Write the remaining output bytes and destroy the context object
         with crypt_ctx_finish().
  */

  /**
    Returns the size of the encryption context object in bytes
  */
  unsigned int (*crypt_ctx_size)(unsigned int key_id, unsigned int key_version);
  /**
    Initializes the encryption context object.
  */
  int (*crypt_ctx_init)(void *ctx, const unsigned char *key, unsigned int klen,
                        const unsigned char *iv, unsigned int ivlen, int flags,
                        unsigned int key_id, unsigned int key_version);
  /**
    Processes (encrypts or decrypts) a chunk of data

    Writes the output to th dst buffer. note that it might write
    more bytes that were in the input. or less. or none at all.
  */
  int (*crypt_ctx_update)(void *ctx, const unsigned char *src,
                          unsigned int slen, unsigned char *dst,
                          unsigned int *dlen);
  /**
    Writes the remaining output bytes and destroys the encryption context

    crypt_ctx_update might've cached part of the output in the context,
    this method will flush these data out.
  */
  int (*crypt_ctx_finish)(void *ctx, unsigned char *dst, unsigned int *dlen);
  /**
    Returns the length of the encrypted data

    It returns the exact length, given only the source length.
    Which means, this API only supports encryption algorithms where
    the length of the encrypted data only depends on the length of the
    input (a.k.a. compression is not supported).
  */
  unsigned int (*encrypted_length)(unsigned int slen, unsigned int key_id,
                                   unsigned int key_version);
};

The first method is used for key rotation. A plugin that doesn't support key rotation — for example, file_key_management — can return a fixed version for any valid key id. Note that it still has to return an error for an invalid key id. The version ENCRYPTION_KEY_NOT_ENCRYPTED means that the data should not be encrypted.

The second method is used for key management, the server uses it to retrieve the key corresponding to a specific key identifier and a specific key version.

The last five methods deal with encryption. Note that they take the key to use and key identifier and version. This is needed because the server can derive a session-specific, user-specific, or a tablespace-specific key from the original encryption key as returned by get_key(), so the key argument doesn't have to match the encryption key as the plugin knows it. On the other hand, the encryption algorithm may depend on the key identifier and version (and in the example_key_management plugin it does) so the plugin needs to know them to be able to encrypt the data.

Encryption methods are optional — if unset (as in the debug_key_management plugin), the server will fall back to AES_CBC.

Current Encryption Plugins

The MariaDB source tree has four encryption plugins. All these plugins are fairly simple and can serve as good examples of the Encryption plugin API.

file_key_management

It reads encryption keys from a plain-text file. It supports two different encryption algorithms. It supports multiple encryption keys. It does not support key rotation. See the File Key Management Plugin article for more details.

Versions

Version

Status

Introduced

1.0

Stable

MariaDB 10.1.18

1.0

Gamma

MariaDB 10.1.13

1.0

Alpha

MariaDB 10.1.3

aws_key_management

The AWS Key Management plugin uses the Amazon Web Services (AWS) Key Management Service (KMS) to generate and store AES keys on disk, in encrypted form, using the Customer Master Key (CMK) kept in AWS KMS. When MariaDB Server starts, the plugin will decrypt the encrypted keys, using the AWS KMS "Decrypt" API function. MariaDB data will then be encrypted and decrypted using the AES key. It supports multiple encryption keys. It supports key rotation.

See the AWS Key Management Plugin article for more details.

Versions

Version

Status

Introduced

1.0

Stable

MariaDB 10.2.6, MariaDB 10.1.24

1.0

Beta

MariaDB 10.1.18

1.0

Experimental

MariaDB 10.1.13

example_key_management

Uses random time-based generated keys, ignores key identifiers, supports key versions and key rotation. Uses AES_ECB and AES_CBC as encryption algorithms and changes them automatically together with key versions.

Versions

Version

Status

Introduced

1.0

Experimental

MariaDB 10.1.3

debug_key_management

Key is generated from the version, user manually controls key rotation. Only supports key identifier 1, uses only AES_CBC.

Versions

Version

Status

Introduced

1.0

Experimental

MariaDB 10.1.3

Encryption Service

Encryption is generally needed on the very low level inside the storage engine. That is, the storage engine needs to support encryption and have access to the encryption and key management functionality. The usual way for a plugin to access some functionality in the server is via a service. In this case the server provides the Encryption Service for storage engines (and other interested plugins) to use. These service functions are directly hooked into encryption plugin methods (described above).

Service functions are declared as follows:

unsigned int encryption_key_get_latest_version(unsigned int key_id);
unsigned int encryption_key_get(unsigned int key_id, unsigned int key_version,
                                unsigned char *buffer, unsigned int *length);
unsigned int encryption_ctx_size(unsigned int key_id, unsigned int key_version);
int encryption_ctx_init(void *ctx, const unsigned char *key, unsigned int klen,
                        const unsigned char *iv, unsigned int ivlen, int flags,
                        unsigned int key_id, unsigned int key_version);
int encryption_ctx_update(void *ctx, const unsigned char *src,
                          unsigned int slen, unsigned char *dst,
                          unsigned int *dlen);
int encryption_ctx_finish(void *ctx, unsigned char *dst, unsigned int *dlen);
unsigned int encryption_encrypted_length(unsigned int slen, unsigned int key_id,
                                         unsigned int key_version);

There are also convenience helpers to check for a key or key version existence and to encrypt or decrypt a block of data with one function call.

unsigned int encryption_key_id_exists(unsigned int id);
unsigned int encryption_key_version_exists(unsigned int id,
                                           unsigned int version);
int encryption_crypt(const unsigned char *src, unsigned int slen,
                     unsigned char *dst, unsigned int *dlen,
                     const unsigned char *key, unsigned int klen,
                     const unsigned char *iv, unsigned int ivlen, int flags,
                     unsigned int key_id, unsigned int key_version);

_{This page is licensed: CC BY-SA / Gnu FDL}

Information Schema plugins: SHOW and FLUSH statements

Information Schema plugins can support SHOW and FLUSH statements.

SHOW

SHOW statements support is enabled automatically. A plugin only needs to specify column names for the SHOW statement in the old_name member of the field declaration structure. Columns with the old_name set to 0 will be hidden from the SHOW statement. If all columns are hidden, the SHOW statement will not work for this plugin.

Note that SHOW statement is a user-friendly shortcut; it's easier to type and should be easier to view — if the Information Schema table contains many columns, the SHOW statement is supposed to display only most important columns and fit nicely on the 80x25 terminal screen.

Consider an example, LOCALES plugin:

static ST_FIELD_INFO locale_info_locale_fields_info[]=
{
  {"ID", 4, MYSQL_TYPE_LONGLONG, 0, 0, "Id", 0},
  {"NAME", 255, MYSQL_TYPE_STRING, 0, 0, "Name", 0},
  {"DESCRIPTION", 255,  MYSQL_TYPE_STRING, 0, 0, "Description", 0},
  {"MAX_MONTH_NAME_LENGTH", 4, MYSQL_TYPE_LONGLONG, 0, 0, 0, 0},
  {"MAX_DAY_NAME_LENGTH", 4, MYSQL_TYPE_LONGLONG, 0, 0, 0, 0},
  {"DECIMAL_POINT", 2, MYSQL_TYPE_STRING, 0, 0, 0, 0},
  {"THOUSAND_SEP", 2, MYSQL_TYPE_STRING, 0, 0, 0, 0},
  {"ERROR_MESSAGE_LANGUAGE", 64, MYSQL_TYPE_STRING, 0, 0, "Error_Message_Language", 0},
  {0, 0, MYSQL_TYPE_STRING, 0, 0, 0, 0}
};

While the INFORMATION_SCHEMA.LOCALES table has 8 columns, the SHOW LOCALES statement will only display 4 of them:

MariaDB [test]> show locales;
+-----+-------+-------------------------------------+------------------------+
| Id  | Name  | Description                         | Error_Message_Language |
+-----+-------+-------------------------------------+------------------------+
|   0 | en_US | English - United States             | english                |
|   1 | en_GB | English - United Kingdom            | english                |
|   2 | ja_JP | Japanese - Japan                    | japanese               |
|   3 | sv_SE | Swedish - Sweden                    | swedish                |
...

FLUSH

To support the FLUSH statement a plugin must declare the reset_table callback. For example, in the QUERY_RESPONSE_TIME plugin:

static int query_response_time_info_init(void *p)
{
  ST_SCHEMA_TABLE *i_s_query_response_time= (ST_SCHEMA_TABLE *) p;
  i_s_query_response_time->fields_info= query_response_time_fields_info;
  i_s_query_response_time->fill_table= query_response_time_fill;
  i_s_query_response_time->reset_table= query_response_time_flush;
  query_response_time_init();
  return 0;
}

_{This page is licensed: CC BY-SA / Gnu FDL}

Password Validation Plugin API

“Password validation” means ensuring that user passwords meet certain minimal security requirements. A dedicated plugin API allows the creation of password validation plugins that will check user passwords as they are set (in SET PASSWORD and GRANT statements) and either allow or reject them.

SQL-Level Extensions

MariaDB comes with three password validation plugins — the simple_password_check plugin, the cracklib_password_check plugin and the password_reuse_check plugin. They are not enabled by default; use INSTALL SONAME (or INSTALL PLUGIN) statement to install them.

When at least one password plugin is loaded, all new passwords will be validated and password-changing statements will fail if the password will not pass validation checks. Several password validation plugin can be loaded at the same time — in this case a password must pass all validation checks by all plugins.

Password-Changing Statements

One can use various SQL statements to change a user password:

With Plain Text Password

SET PASSWORD = PASSWORD('plain-text password');
SET PASSWORD FOR `user`@`host` = PASSWORD('plain-text password');
SET PASSWORD = OLD_PASSWORD('plain-text password');
SET PASSWORD FOR `user`@`host` = OLD_PASSWORD('plain-text password');
CREATE USER `user`@`host` IDENTIFIED BY 'plain-text password';
GRANT PRIVILEGES TO `user`@`host` IDENTIFIED BY 'plain-text password';

These statements are subject to password validation. If at least one password validation plugin is loaded, plain-text passwords specified in these statements will be validated.

With Password Hash

SET PASSWORD = 'password hash';
SET PASSWORD FOR `user`@`host` = 'password hash';
CREATE USER `user`@`host` IDENTIFIED BY PASSWORD 'password hash';
CREATE USER `user`@`host` IDENTIFIED VIA mysql_native_password USING 'password hash';
CREATE USER `user`@`host` IDENTIFIED VIA mysql_old_password USING 'password hash';
GRANT PRIVILEGES TO `user`@`host` IDENTIFIED BY PASSWORD 'password hash';
GRANT PRIVILEGES TO `user`@`host` IDENTIFIED VIA mysql_native_password USING 'password hash';
GRANT PRIVILEGES TO `user`@`host` IDENTIFIED VIA mysql_old_password USING 'password hash';

These statements can not possibly use password validation — there is nothing to validate, the original plain-text password is not available. MariaDB introduces a strict password validation mode — controlled by a strict_password_validation global server variable. If the strict password validation is enabled and at least one password validation plugin is loaded then these “unvalidatable” passwords will be rejected. Otherwise they will be accepted. By default a strict password validation is enabled (but note that it has no effect if no password validation plugin is loaded).

Examples

Failed password validation:

GRANT SELECT ON *.* to foobar IDENTIFIED BY 'raboof';
ERROR HY000: Your password does not satisfy the current policy requirements

SHOW WARNINGS;
+---------+------+----------------------------------------------------------------+
| Level	  | Code | Message                                                        |
+---------+------+----------------------------------------------------------------+
| Warning | 1819 | cracklib: it is based on your username                         |
| Error	  | 1819 | Your password does not satisfy the current policy requirements |
+---------+------+----------------------------------------------------------------+

Strict password validation:

GRANT SELECT ON *.* TO foo IDENTIFIED BY PASSWORD '2222222222222222';
ERROR HY000: The MariaDB server is running with the --strict-password-validation option so it cannot execute this statement

Plugin API

Password validation plugin API is very simple. A plugin must implement only one method — validate_password(). This method takes two arguments — user name and the plain-text password. And it returns 0 when the password has passed the validation and 1 otherwise,

See also mysql/plugin_password_validation.h and password validation plugins in plugin/simple_password_check/ and plugins/cracklib_password_check/.

_{This page is licensed: CC BY-SA / Gnu FDL}

Merging into MariaDB

This category explains how we merge various source trees into MariaDB

Creating a New Merge Tree

This article is obsolete. We don't use bzr anymore. This howto needs to be rewritten to explain how to create a merge tree in git.

Merge tree in the context of this HOWTO is a tree created specifically to simplify merges of third-party packages into MariaDB. WIth a merge tree there's a clear separation between upstream changes and our changes and in most cases bzr can do the merges automatically.

Here's how I created a merge tree for pcre:

prerequisites: we already have pcre in the MariaDB tree, together with our changes (otherwise one can trivially create a bzr repository out of source pcre tarball).
create an empty repository:

mkdir pcre
cd pcre
bzr init

download pcre source tarball of the same version that we have in the tree — pcre-8.34.tar.bz2
unpack it in the same place where the files are in the source tree:

tar xf ~/pcre-8.34.tar.bz2
mv pcre-8.34 pcre

Add files to the repository with the same file-ids as in the MariaDB tree!

bzr add --file-ids-from ~/Abk/mysql/10.0

All done. Commit and push

bzr commit -m pcre-8.34
bzr push --remember lp:~maria-captains/maria/pcre-mergetree

Now null-merge that into your MariaDB tree. Note, that for the initial merge you need to specify the revision range 0..1

cd ~/Abk/mysql/10.0
bzr merge -r 0..1 ~/mergetrees/pcre/

Remove pcre files that shouldn't be in MariaDB tree, revert all changes that came from pcre (remember — it's a null-merge, pcre-8.34 is already in MariaDB tree), rename files in place as needed, resolve conflicts:

bzr rm `bzr added`
bzr revert --no-backup `bzr modified`
bzr resolve pcre

Verify that the tree is unchanged and commit:

bzr status
bzr commit -m 'pcre-8.34 mergetree initial merge'

Congratulations, your new merge tree is ready!

Now see Merging with a merge tree.

_{This page is licensed: CC BY-SA / Gnu FDL}

Merging from MySQL (obsolete)

Note: This page is obsolete. The information is old, outdated, or otherwise currently incorrect. We are keeping the page for historical reasons only. Do not rely on the information in this article.

Merging from MySQL into MariaDB

Merging code changes from MySQL bzr repository

We generally merge only released versions of MySQL into MariaDB trunk. This is to be able to release a well-working release of MariaDB at any time, without having to worry about including half-finished changes from MySQL. Merges of MySQL revisions in-between MySQL releases can still be done (eg. to reduce the merge task to smaller pieces), but should then be pushed to the maria-5.1-merge branch, not to the main lp:maria branch.

The merge command should thus generally be of this form:

bzr merge -rtag:mysql-<MYSQL-VERSION> lp:mysql-server/5.1

As a general rule, when the MySQL and MariaDB side has changes with the same meaning but differing text, pick the MySQL variant when resolving this conflict. This will help reduce the number of conflicts in subsequent merges.

Buildbot testing

To assist in understanding test failures that arise during the merge, we pull the same revision to be merged into the lp:maria-captains/maria/mysql-5.1-testing tree for buildbot test. This allows to check easily if any failures introduced are also present in the vanilla MySQL tree being merged.

Helpful tags and diffs

To help keep track of merges, we tag the result of a merge:

mariadb-merge-mysql-<MYSQL-VERSION>

For example, when merging MySQL 5.1.39, the commit of the merge would be tagged like this:

mariadb-merge-mysql-5.1.39

The right-hand parent of tag:mariadb-merge-mysql-5.1.39 will be the revision tag:mysql-5.1.39. The left-hand parent will be a revision on the MariaDB trunk.

When merging, these tags and associated revisions can be used to generate some diffs, which are useful when resolving conflicts. Here is a diagram of the history in a merge:

B----maria------A0-------A1
 \              /       /
  \            /       /
   ---mysql---Y0------Y1

Here,

'B' is the base revision when MariaDB was originally branched from MySQL.
'A0' is the result of the last MySQL merge, eg.tag:mariadb-merge-mysql-5.1.38.
'Y0' is the MySQL revision that was last merged, eg.tag:mysql-5.1.38.
'Y1' is the MySQL revision to be merged in the new merge, eg. tag:mysql-5.1.39.
'A1' is the result of committing the new merge, to be tagged as eg. tag:mariadb-merge-mysql-5.1.39.

Then, these diffs can be useful:

'bzr diff -rY0..before:A1' - this is the MariaDB side of changes to be merged.
'bzr diff -rY0..Y1' - this is the MySQL side of changes to be merged.
'bzr diff -rA0..before:A1' - these are the new changes on the MariaDB side to be merged; this can be useful do separate them from other MariaDB-specific changes that have already been resolved against conflicting MySQL changes.

Merging documentation from MySQL source tarballs

The documentation for MySQL is not maintained in the MySQL source bzr repository. Therefore changes to MySQL documentation needs to be merged separately.

Only some of the MySQL documentation is available under the GPL (man pages, help tables, installation instructions). Notably the MySQL manual is not available under the GPL, and so is not included in MariaDB in any form.

The man pages, help tables, and installation instruction READMEs are obtained from MySQL source tarballs and manually merged into the MariaDB source trees. The procedure for this is as follows:

There is a tree on Launchpad used for tracking merges:

lp:~maria-captains/maria/mysql-docs-merge-base

(At the time of writing, this procedure only exists for the 5.1 series of MySQL and MariaDB. Additional merge base trees will be needed for other release series.)

This tree must only be used to import new documentation files from new MySQL upstream source tarballs. The procedure to import a new set of files when a new MySQL release happens is as follows:

Download the new MySQL source tarball and unpack it, say to mysql-5.1.38
run these commands:

T=../mysql-5.1.38
bzr branch lp:~maria-captains/maria/mysql-docs-merge-base
cd mysql-docs-merge-base
for i in Docs/INSTALL-BINARY INSTALL-SOURCE INSTALL-WIN-SOURCE support-files/MacOSX/ReadMe.txt scripts/fill_help_tables.sql $(cd "$T" && find man -type f | grep '\.[0-9]$' | grep -v '^man/ndb_' | grep -v '^man/mysqlman.1$') ; do cp "$T/$i" $i; bzr add $i ; done
bzr commit -m"Imported MySQL documentation files from $T"
bzr push lp:~maria-captains/maria/mysql-docs-merge-base

Now do a normal merge from lp:maria-captains/maria/mysql-docs-merge-base into lp:maria

_{This page is licensed: CC BY-SA / Gnu FDL}

Merging New XtraDB Releases (obsolete)

Background

Percona used to maintain XtraDB as a patch series against the InnoDB plugin. This affected how we started merging XtraDB in.

Now Percona maintains a normal source repository on launchpad (lp:percona-server). But we continue to merge the old way to preserve the history of our changes.

Merging

There used to be a lp:percona-xtradb tree, that we were merging from as:

bzr merge lp:percona-xtradb

Now we have to maintain our own XtraDB-5.5 repository to merge from. It is lp:~maria-captains/maria/xtradb-mergetree-5.5. Follow the procedures as described in Merging with a merge tree to merge from it.

_{This page is licensed: CC BY-SA / Gnu FDL}

Merging TokuDB (obsolete)

We merge TokuDB from Tokutek git repositories on GutHub:

Just merge normally at release points (use tag names) and don't forget to update storage/tokudb/CMakeLists.txt, setting TOKUDB_VERSION correctly.

_{This page is licensed: CC BY-SA / Gnu FDL}

Merging with a Merge Tree

If you have a merge tree, you merge into MariaDB as follows:

MariaDB merge trees are in the mergetrees repository. Add it as a new remote:

git remote add merge https://github.com/MariaDB/mergetrees

Check out the branch you want to update and merge, for example:

git checkout merge-innodb-5.6

delete everything in the branch
download the latest released source tarball, unpack it, copy files into the repository:

for InnoDB-5.6: use the content of the storage/innobase/ of the latest MySQL 5.6 source release tarball.
for performance schema 5.6: use storage/perfschema, include/mysql/psi, mysql-test/suite/perfschema, and mysql-test/suite/perfschema_stress from the latest MySQL 5.6 source release tarball.
for SphinxSE: use mysqlse/ subdirectory from the latest Sphinx source release tarball.
for XtraDB: use the content of the storage/innobase/ of the latest Percona-Server source release tarball (5.5 or 5.6 as appropriate).
for pcre: simply unpack the latest pcre release source tarball into the repository, rename pcre-X-XX/ to pcre.

Now git add ., git commit (use the tarball version as a comment), git push
merge this branch into MariaDB
Sometimes after a merge, some changes may be needed:

for performance schema 5.6: update storage/perfschema/ha_perfschema.cc, plugin version under maria_declare_plugin.
for InnoDB-5.6: update storage/innobase/include/univ.i, setting INNODB_VERSION_MAJOR, INNODB_VERSION_MINOR, INNODB_VERSION_BUGFIX to whatever MySQL version you were merging from.
for XtraDB-5.5: update storage/xtradb/include/univ.i, setting PERCONA_INNODB_VERSION, INNODB_VERSION_STR to whatever Percona-Server version you were merging from.
for XtraDB-5.6: update storage/xtradb/include/univ.i, setting PERCONA_INNODB_VERSION, INNODB_VERSION_MAJOR, INNODB_VERSION_MINOR, INNODB_VERSION_BUGFIX to whatever Percona-Server version you were merging from.

_{This page is licensed: CC BY-SA / Gnu FDL}

Query Optimizer

Delve into the MariaDB Server query optimizer. This section provides internal documentation on how queries are parsed, optimized, and executed for maximum efficiency and performance.

Block-Based Join Algorithms

In the versions of MariaDB/MySQL before 5.3 only one block-based join algorithm was implemented: the Block Nested Loops (BNL) join algorithm which could only be used for inner joins.

MariaDB 5.3 enhanced the implementation of BNL joins and provides a variety of block-based join algorithms that can be used for inner joins, outer joins, and semi-joins. Block-based join algorithms in MariaDB employ a join buffer to accumulate records of the first join operand before they start looking for matches in the second join operand.

This page documents the various block-based join algorithms.

Block Nested Loop (BNL) join
Block Nested Loop Hash (BNLH) join
Block Index join known as Batch Key Access (BKA) join
Block Index Hash join known as Batch Key Access Hash (BKAH) join

Block Nested Loop Join

The major difference between the implementation of BNL join in MariaDB 5.3 compared to earlier versions of MariaDB/MySQL is that the former uses a new format for records written into join buffers. This new format allows:

More efficient use of buffer space for null field values and field values of flexible length types (like the varchar type)
Support for so-called incremental join buffers saving buffer space for multi-way joins
Use of the algorithm for outer joins and semi-joins

How Block Nested Loop Join Works

The algorithm performs a join operation of tables t1 and t2 according to the following schema. The records of the first operand are written into the join buffer one by one until the buffer is full. The records of the second operand are read from the base/temporary table one by one. For every read record r2 of table t2 the join buffer is scanned, and, for any record r1 from the buffer such that r2 matches r1 the concatenation of the interesting fields of r1 and r2 is sent to the result stream of the corresponding partial join. To read the records of t2 a full table scan, a full index scan or a range index scan is performed. Only the records that meet the condition pushed to table t2 are checked for a match of the records from the join buffer. When the scan of the table t2 is finished a new portion of the records of the first operand fills the buffer and matches for these records are looked for in t2. The buffer refills and scans of the second operand that look for matches in the join buffer are performed again and again until the records of first operand are exhausted. In total the algorithm scans the second operand as many times as many refills of the join buffer occur.

More Efficient Usage of Join Buffer Space

No join buffer space is used for null field values. Any field value of a flexible length type is not padded by 0 up to the maximal field size anymore.

Incremental Join Buffers

If we have a query with a join of three tables t1, t2, t3 such that table t1 is joined with table t2 and the result of this join operation is joined with table t3 then two join buffers can be used to execute the query. The first join buffer B1 is used to store the records comprising interesting fields of table t1, while the second join buffer B2 contains the records with fields from the partial join of t1 and t2. The interesting fields of any record r1 from B1 are copied into B2 for any record r1,r2 from the partial join of t1 and t2. One could suggest storing in B2 just a pointer to the position of the r1 fields in B1 together with the interesting fields from t2. So for any record r2 matching the record r1 the buffer B2 would contain a reference to the fields of r1 in B1 and the fields of r2. In this case the buffer B2 is called incremental. Incremental buffers allow to avoid copying field values from one buffer into another. They also allow to save a significant amount of buffer space if for a record from t1 several matches from t2 are expected.

Using Join Buffers for Simple Outer Joins and Semi-joins

If a join buffer is used for a simple left outer join of tables t1 and t1 t1 LEFT JOIN t2 ON P(t1,t2) then each record r1 stored in the buffer is provided with a match flag. Initially this flag is set off. As soon as the first match for r1 is found this flag is set on. When all matching candidates from t2 have been check, the record the join buffer are scanned and for those of them that still have there match flags off null-complemented rows are generated. The same match flag is used for any record in the join buffer is a semi-join operation t1 SEMI JOIN t2 ON P(t1,t2) is performed with a block based join algorithm. When this match flag is set to on for a record r1 in the buffer no matches from table t2 for record r1 are looked for anymore.

Block Hash Join

Block based hash join algorithm is a new option to be used for join operations in MariaDB 5.3. It can be employed in the cases when there are equi-join sub-condition for the joined tables, in the other words when equalities of the form t2.f1= e1(t1),...,t2.fn=en(t1) can be extracted from the full join condition. As any block based join algorithm this one used a join buffer filled with the records of the first operand and looks through the records of the second operand to find matches for the records in the buffer.

How Block Hash Join Works

For each refill of the join buffer and each record r1 from it the algorithm builds a hash table with the keys constructed over the values e1(r1),...en(r1). Then the records of t2 are looked through. For each record r2 from t2 that the condition pushed to the table t2 a hash key over the fields r2.f1,..., r2.fn is calculated to probe into the hash table. The probing returns those records from the buffer to which r2 matches. As for BNL join algorithm this algorithm scans the second operand as many time as many refills of the buffer occur. Yet it has to look only through the records of one bucket in the hash table when looking for the records to which a record from t2 matches, not through all records in the join buffer as BNL join algorithm does. The implementation of this algorithm in MariaDB builds the hash table with hash keys at the very end of the join buffer. That's why the number of records written into the buffer at one refill is less then for BNL join algorithms. However a much shorter list of possible matching candidates makes this the block hash join algorithm usually much faster then BNL join.

Batch Key Access Join

Batch Keys Access join algorithm performs index look-ups when looking for possible matching candidates provided by the second join operand. With this respect the algorithm behave itself as the regular join algorithm. Yet BKA performs index look-ups for a batch of the records from the join buffer. For conventional database engines like InnoDB/MyISAM it allows to fetch matching candidates in an optimal way. For the engines with remote data store such as FederateX/Spider the algorithm allows to save on transfers between the MySQL node and the data store nodes.

How Batch Keys Access Join Works

The implementation of the algorithm in 5.3 heavily exploits the multi-range-read interface and its properties. The interface hides the actual mechanism of fetching possible candidates for matching records from the table to be joined. As any block based join algorithm the BKA join repeatedly fills the join buffer with records of the first operand and for each refill it finds records from the join table that could match the records in the buffer. To find such records it asks the MRR interface to perform index look-ups with the keys constructed over all records from the buffer. Together with each key the interface receives a return address - a reference to the record over which this key has been constructed. The actual implementation functions of the MRR interface organize and optimize somehow the process of fetching the records of the joined table by the received keys. Each fetched record r2 is appended with the return address associated with the key by which the record has been found and the result is passed to the BKA join procedure. The procedure takes the record r1 from the join buffer by the return address, joins it with r2 and checks the join condition. If the condition is evaluated to true the joined records is sent to the result stream of the join operation. So for each record returned by the MRR interface only one record from the join buffer is accessed. The number of records from table t2 fetched by the BKA join is exactly the same as for the regular nested loops join algorithm. Yet BKA join allows to optimize the order in which the records are fetched.

Interaction of BKA Join With the MRR Functions

BKA join interacts with the MRR functions respecting the following contract. The join procedure calls the MRR function multi_range_read_init passing it the callback functions that allows to initialize reading keys for the records in the join buffer and to iterate over these keys. It also passes the parameters of the buffer for MRR needs allocated within the join buffer space. Then BKA join repeatedly calls the MRR function multi_range_read_next. The function works as an iterator function over the records fetched by index look-ups with the keys produced by a callback function set in the call of multi_range_read_init. A call of the function multi_range_read_next returns the next fetched record through the dedicated record buffer, and the associated reference to the matched record from the join buffer as the output parameter of the function.

Managing Usage of Block-Based Join Algorithms

Currently 4 different types of block-based join algorithms are supported. For a particular join operation each of them can be employed with a regular (flat) join buffer or with an incremental join buffer.

Three optimizer switches - join_cache_incremental, join_cache_hashed, join_cache_bka – and the system variable join_cache_level control which of the 8 variants of the block-based algorithms will be used for join operations.

If join_cache_bka is off then BKA and BKAH join algorithms are not allowed. If join_cache_hashed is off then BNLH and BKAH join algorithms are not allowed. If join_cache_incremental is off then no incremental variants of the block-based join algorithms are allowed.

By default the switches join_cache_incremental, join_cache_hashed, join_cache_bka are set to 'on'. However it does not mean that by default any of block-based join algorithms is allowed to be used. All of them are allowed only if the system variable join_cache_level is set to 8. This variable can take an integer value in the interval from 0 to 8.

If the value is set to 0 no block-based algorithm can be used for a join operation. The values from 1 to 8 correspond to the following variants of block-based join algorithms :

1 – Flat BNL
2 – Incremental BNL
3 – Flat BNLH
4 – Incremental BNLH
5 – Flat BKA
6 – Incremental BKA
7 – Flat BKAH
8 – Incremental BKAH

If the value of join_cache_level is set to N, any of block-based algorithms with the level greater than N is disallowed.

So if join_cache_level is set to 5, no usage of BKAH is allowed and usage of incremental BKA is not allowed either while usage of all remaining variants are controlled by the settings of the optimizer switches join_cache_incremental, join_cache_hashed, join_cache_bka.

By default join_cache_level is set to 2. In other words only usage of flat or incremental BNL is allowed.

By default block-based algorithms can be used only for regular (inner) join operations. To allow them for outer join operations (left outer joins and right outer joins) the optimizer switch outer_join_with_cache has to be set to 'on'. Setting the optimizer switch semijoin_with_cache to 'on' allows using these algorithms for semi-join operations.

Currently, only incremental variants of the block-based join algorithms can be used for nested outer joins and nested semi-joins.

Size of Join Buffers

The maximum size of join buffers used by block-based algorithms is controlled by setting the join_buffer_size system variable. This value must be large enough in order for the join buffer employed for a join operation to contain all relevant fields for at least one joined record.

MariaDB 5.3 introduced the system variable join_buffer_space_limit that limits the total memory used for join buffers in a query.

To optimize the usage of the join buffers within the limit set by join_buffer_space_limit, one should use the optimizer switch optimize_join_buffer_size=on. When this flag is set to 'off' (default until MariaDB 10.4.2), the size of the used join buffer is taken directly from the join_buffer_size system variable. When this flag is set to 'on' (default from MariaDB 10.4.3) then the size of the buffer depends on the estimated number of rows in the partial join whose records are to be stored in the buffer.

To use BKA/BKAH join algorithms for InnoDB/MyISAM, one must set the optimizer switch mrr to 'on'. When using these algorithms for InnoDB/MyISAM the overall performance of the join operations can be dramatically improved if the optimizer switch mrr_sort_keys is set 'on'.

_{This page is licensed: CC BY-SA / Gnu FDL}

Condition Selectivity Computation Internals

This page describes how the MariaDB optimizer computes condition selectivities.

calculate_cond_selectivity_for_table(T)

This function computes selectivity of the restrictions on a certain table T. (TODO: name in the optimizer trace)

Selectivity is computed from

selectivities of restrictions on different columns ( histogram data)
selectivities of potential range accesses.

Restrictions on different columns, as well as disjoint sets of columns, are considered independent, so their selectivities are multiplied.

Data From Potential Range Accesses

First, we take into account the selectivities of potential range accesses.

If range accesses on indexes IDX1 and IDX2 do not use the same table column (either the indexes do not have common columns, or they do but range accesses do not use them), then they are considered independent, and their selectivities can be multiplied.

However, in general, range accesses on different indexes may use restrictions on the same column and so cannot be considered independent.

In this case, the following approach is used:

We start with selectivity=1, an empty set of range accesses, and an empty set of columns for which we have taken the selectivity into account.

Then, we add range accesses one by one, updating the selectivity value and noting which columns we have taken into account.

Range accesses that use more key parts are added first.

If we are adding a range access $R whose columns do not overlap with the ones already added, we can just multiply the total selectivity by $R's selectivity.

If $R's columns overlap with columns we've got selectivity data for, the process is as follows:

Find the prefix of columns whose selectivity hasn't been taken into account yet. Then, take the selectivity of the whole range access and multiply it by

rec_per_key[i-1]/rec_per_key[i]

(TODO: and this logic is not clear. More, one can produce table->cond_selectivity>1 this way. See MDEV-20740)

Data From Histograms

Then, we want to take into account selectivity data from histograms. Each histogram covers one single column.

If the selectivity of a column hasn't been taken into account on the previous step, we take it into account now by multiplying the selectivity by it. Otherwise, we assume that range access has fully taken the column selectivity into account and do nothing.

The third step is sampling-based selectivity data which is out of the scope of this document.

table_cond_selectivity()

This function computes selectivity of restrictions that can be applied after table T has been joined with the join prefix {T1, ..., Tk}.

There are two cases:

Table T uses ref access. In this case, the returned rows match the equalities ref_access is constructed from. Restrictions on just table T are not checked, yet.
Table T uses ALL/index/quick select. In this case, restrictions on table T have been applied but cross-table restrictions were not.

_{This page is licensed: CC BY-SA / Gnu FDL}

Extended Keys

Syntax

Enable:

SET optimizer_switch='extended_keys=on';

Disable:

SET optimizer_switch='extended_keys=off';

Description

Extended Keys is an optimization set with the optimizer_switch system variable, which makes use of existing components of InnoDB keys to generate more efficient execution plans. Using these components in many cases allows the server to generate execution plans which employ index-only look-ups. It is set by default.

Extended keys can be used with:

ref and eq-ref accesses
range scans
index-merge scans
loose scans
min/max optimizations

Examples

An example of how extended keys could be employed for a query built over a DBT-3/TPC-H database with one added index defined on p_retailprice:

SELECT o_orderkey
FROM  part, lineitem, orders
WHERE p_retailprice > 2095 AND o_orderdate='1992-07-01'
      AND o_orderkey=l_orderkey AND p_partkey=l_partkey;

The above query asks for the orderkeys of the orders placed on 1992-07-01 which contain parts with a retail price greater than $2095.

Using Extended Keys, the query could be executed by the following execution plan:

Scan the entries of the index i_p_retailprice where p_retailprice>2095 and read p_partkey values from the extended keys.
For each value p_partkey make an index look-up into the table lineitem employing index i_l_partkey and fetch the values of l_orderkey from the extended index.
For each fetched value of l_orderkey, append it to the date '1992-07-01' and use the resulting key for an index look-up by index i_o_orderdate to fetch the values of o_orderkey from the found index entries.

All access methods of this plan do not touch table rows, which results in much better performance.

Here is the explain output for the above query:

MariaDB [dbt3sf10]> EXPLAIN
   -> SELECT o_orderkey
   ->   FROM part, lineitem, orders
   ->   WHERE p_retailprice > 2095 AND o_orderdate='1992-07-01'
   ->         AND o_orderkey=l_orderkey AND p_partkey=l_partkey\G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: part
        type: range
possible_keys: PRIMARY,i_p_retailprice
         key: i_p_retailprice
     key_len: 9
         ref: NULL
        rows: 100
       Extra: Using where; Using index
*************************** 2. row ***************************
          id: 1
 select_type: SIMPLE
       table: lineitem
        type: ref
possible_keys: PRIMARY,i_l_suppkey_partkey,i_l_partkey,i_l_orderkey,i_l_orderkey_quantity
         key: i_l_partkey
     key_len: 5
         ref: dbt3sf10.part.p_partkey
        rows: 15
       Extra: Using index
*************************** 3. row ***************************
          id: 1
 select_type: SIMPLE
       table: orders
        type: ref
possible_keys: PRIMARY,i_o_orderdate
         key: i_o_orderdate
     key_len: 8
         ref: const,dbt3sf10.lineitem.l_orderkey
        rows: 1
       Extra: Using index
3 rows in set (0.00 sec)

MIN/MAX optimization

Min/Max optimization without GROUP BY

MariaDB and MySQL can optimize the MIN() and MAX() functions to be a single row lookup in the following cases:

There is only one table used in the SELECT.
You only have constants, MIN() and MAX() in the SELECT part.
The argument to MIN() and MAX() is a simple column reference that is part of a key.
There is no WHERE clause or the WHERE is used with a constant for all prefix parts of the key before the argument to MIN()/MAX().
If the argument is used in the WHERE clause, it can be compared to a constant with < or <= in case of MAX() and with > or >= in case of MIN().

Here are some examples to clarify this. In this case we assume there is an index on columns (a,b,c)

SELECT MIN(a),MAX(a) FROM t1
SELECT MIN(b) FROM t1 WHERE a=const
SELECT MIN(b),MAX(b) FROM t1 WHERE a=const
SELECT MAX(c) FROM t1 WHERE a=const AND b=const
SELECT MAX(b) FROM t1 WHERE a=const AND b<const
SELECT MIN(b) FROM t1 WHERE a=const AND b>const
SELECT MIN(b) FROM t1 WHERE a=const AND b BETWEEN const AND const
SELECT MAX(b) FROM t1 WHERE a=const AND b BETWEEN const AND const

Instead of a=const the condition a IS NULL can be used.

The above optimization also works for subqueries:

SELECT x FROM t2 WHERE y= (SELECT MIN(b) FROM t1 WHERE a=const)

Cross joins, where there is no join condition for a table, can also be optimized to a few key lookups:

SELECT MIN(t1.key_part_1), MAX(t2.key_part_1) FROM t1, t2

Min/Max optimization with GROUP BY

MariaDB and MySQL support loose index scan, which can speed up certain GROUP BY queries. The basic idea is that when scanning a BTREE index (the most common index type for the MariaDB storage engines) we can jump over identical values for any prefix of a key and thus speed up the scan significantly.

Loose scan is possible in the following cases:

The query uses only one table.
The GROUP BY part only uses indexed columns in the same order as in the index.
The only aggregated functions in the SELECT part are MIN() and MAX() functions and all of them using the same column which is the next index part after the used GROUP BY columns.
Partial indexed columns cannot be used (like only indexing 10 characters of a VARCHAR(20) column).

Loose scan will apply for your query if EXPLAIN shows Using index for group-by in the Extra column. In this case the optimizer will do only one extra row fetch to calculate the value for MIN() or MAX() for every unique key prefix.

The following examples assume that the table t1 has an index on (a,b,c).

SELECT a, b, MIN(c),MAX(c) FROM t1 GROUP BY a,b

Notes When an Index Cannot Be Used

MariaDB starting with 10.6.16

This is a new note added in 10.6.16.

Warning About Incompatible Index Comparison

A frequent mistake database developers make is to compare an indexed column with another column that is not compatible with the indexed column. For example, comparing string columns with number columns, or using incompatible character sets or collations.

Because of this we have introduced notes (low severity warnings) when an indexed column cannot use the index to lookup rows.

The warnings are of different types:

If one compares an indexed column with a value of a different type one, will get a warning like the following:

Note   1105    Cannot use key `PRIMARY` part[0] for lookup: `test`.`t1`.`id` of 
  type `char` = "1" of type `bigint`

If one compares indexed character columns with a value of an incompatible collation, one will get a warning like the following:

Note   1105    Cannot use key `s2` part[0] for lookup: `test`.`t1`.`s2` of 
  collation `latin1_swedish_ci` = "'a' collate latin1_german1_ci" of collation `latin1_german1_ci`

Note that in MariaDB 10.6 to MariaDB 11.3 we will use the error 1105 (Unknown error), as we cannot add an new error code in a GA version. In MariaDB 11.4 we will change this to be a unique error code.

Enabling the Note

By default, the warning is only shown when executing EXPLAIN on a query. To enable for all queries, use the option/server variable:

In config file:
--note-verbosity=all

As a server variable:
@@note_verbosity="all";

note_verbosity describes with note categories one want to get notes for. Be aware that if the old sql_notes variable is 0, one will not get any notes.

It can have one or many of the following options:

Option

Description

basic

All old notes.

unusable_keys

Give warnings for unusable keys for SELECT, DELETE and UPDATE.

explain

Give warnings about unusable keys for EXPLAIN.

One can also set note_verbosity to the value of all to set all options.

Enabling Warnings and Notes for the Slow Query Log

One can get the note about incompatible keys also in the slow query log by adding the option warnings in the log_slow_verbosity option/variable. It will automatically be enabled if one uses log_slow_verbosity=all.

In config file:
--log-slow-verbosity=warnings

As a server variable:
@@log_slow_verbosity="all";

Optimizer Debugging With GDB

Some useful things for debugging optimizer code.

Useful Print Functions

dbug_print_item() prints the contents of an Item object into a buffer and returns pointer to it.
dbug_print_sel_arg() prints an individual SEL_ARG object (NOT the whole graph or tree) and returns pointer to the buffer holding the printout.
dbug_print_table_row prints the current row buffer of the given table.
There are more dbug_print_XX functions for various data structures

Printing the Optimizer Trace

The optimizer trace is collected as plain text. One can print the contents of the trace collected so far as follows:

printf "%s\n", thd->opt_trace->current_trace->current_json->output.str.Ptr

Starting from 11.0, there is dbug_print_opt_trace() function which one call from gdb.

Printing Current Partial Join Prefix

best_access_path() is a function that adds another table to the join prefix.

When in or around that function, the following can be useful:

A macro to print the join prefix already constructed:

define bap_print_prefix
  set $i=0
  while ($i < idx)
    p join->positions[$i++].table->table->alias.Ptr
  end
end

Other Settings

May need to set innodb_fatal_semaphore_wait_threshold to be high enough?

Optimizer Development

Notes about Optimizer Development

mysql-test

InnoDB Estimates are unstable

This is caused by background statistics update. It may cause the numbers in EXPLAIN output to be off-by-one. It may also cause different query plans to be picked on different runs (See e.g. MDEV-32901 for details)

On a per-table basis, one can use STATS_AUTO_RECALC=0 as table parameter.

On a per-file basis, one can use this include:

--source mysql-test/include/innodb_stable_estimates.inc

Run mtr with Optimizer Trace enabled

TODO

_{This page is licensed: CC BY-SA / Gnu FDL}

optimizer_max_sel_arg_weight

Basics

As mentioned in the Range Optimizer, ranges on multiple key parts can create a combinatorial amount of ranges.

optimizer_max_sel_arg_weight setting is a limit to reduce the number of ranges generated by dropping restrictions on higher key parts if the number of ranges becomes too high.

(Note that there is also optimizer_max_sel_args which limits the number of intermediary SEL_ARG objects that can be created. This is a different limitation)

Combinatorial number of ranges

Let's reuse the example from the Range Optimizer page.

CREATE TABLE t2 (
  keypart1 INT,
  keypart2 VARCHAR(100),
  keypart3 INT,
  INDEX idx(keypart1, keypart2, keypart3)
);

SELECT * FROM t2 
WHERE
  keypart1 IN (1,2,3,4,5,6,7,8,9,10) AND keypart2 IN ('a','b', 'c') AND keypart3 IN (1,2,3,4);

Range optimizer will produce 10 * 3 * 4 = 120 ranges.

SELECT * FROM information_schema.optimizer_trace\G

//...
                    "range_scan_alternatives": [
                      {
                        "index": "idx",
                        "ranges": [
                          "(1,a,1) <= (keypart1,keypart2,keypart3) <= (1,a,1)",
                          "(1,a,2) <= (keypart1,keypart2,keypart3) <= (1,a,2)",
                          "(1,a,3) <= (keypart1,keypart2,keypart3) <= (1,a,3)",
                          "(1,a,4) <= (keypart1,keypart2,keypart3) <= (1,a,4)",
                          "(1,b,1) <= (keypart1,keypart2,keypart3) <= (1,b,1)",
                          //... # 114 lines omitted ...
                           "(3,b,3) <= (keypart1,keypart2,keypart3) <= (3,b,3)",
                          "(3,b,4) <= (keypart1,keypart2,keypart3) <= (3,b,4)",
                         ],

This number is fine but if your IN-list are thousands then the number of ranges can in the millions which may cause excessive CPU or memory usage (Note: this however is avoided in some cases when IN-predicate is converted into subquery. But there are cases when that is not done)

SEL_ARG graph

Internally, the Range Optimizer builds this kind of graph:

Vertical black lines connect adjacent "intervals" on the same key part. Red lines connect a key part to a subsequent key part.

To produce ranges, one walks this graph by starting from left most corner. Walking right "attaches" the ranges on one key part to another to form multi-part ranges. One must mind that Not all combinations produce multi-part ranges, though.

Walking top-to-bottom produces adjacent ranges.

Weight of SEL_ARG graph

How do we limit the number of ranges? We should remove the parts of SEL_ARG graph that describe ranges on big key parts. That way, we can still build ranges, although we will build fewer ranges that may contain more rows.

Due to the way the graph is constructed, we cannot tell how many ranges it would produce, so we introduce a parameter "weight" which is easy to compute and is roughly proportional to the number of ranges we estimate to produce.

Here is how the weight is computed:

The weight of subgraph3 is just the number of nodes, 4.
The weight of subbraph2 the number of nodes for keypart2 (3), and the weight of subgraph1 multiplied by 3 since there are 3 references to it.
The weight of subgraph1 is the number of nodes for keypart1 (10) plus the weight of subgraph2 multiplied by 10 since there are 10 references to it.

Here the total weight is 160 which has the same order of magnitude as the number of ranges.

SEL_ARG graphs are constructed for all parts of WHERE clause and are AND/ORed according to the AND/OR structure of the WHERE clause (after normalization). If the optimizer notices that it has produced a SEL_ARG graph that exceeds the maximum weight, the parts of the graph describing higher key parts are removed until the weight is within the limit.

Example of effect of limiting weight

Continuing with our example:

-- This is very low, don't use in production:
SET @@optimizer_max_sel_arg_weight=50;
SELECT * FROM t2 WHERE keypart1 IN (1,2,3,4,5,6,7,8,9,10) AND keypart2 IN ('a','b', 'c') AND keypart3 IN (1,2,3,4);
SELECT * FROM information_schema.optimizer_trace\G

shows

"range_scan_alternatives": [
                      {
                        "index": "idx",
                        "ranges": [
                          "(1,a) <= (keypart1,keypart2) <= (1,a)",
                          "(1,b) <= (keypart1,keypart2) <= (1,b)",
                         // (30 lines in total)
                          "(10,b) <= (keypart1,keypart2) <= (10,b)",
                          "(10,c) <= (keypart1,keypart2) <= (10,c)"
                        ],

One can see that now the range list is much smaller, 30 lines instead of 120. This was achieved by discarding the restrictions on keypart3.

_{This page is licensed: CC BY-SA / Gnu FDL}

Range Optimizer

Range optimizer is a part of MariaDB (and MySQL) optimizer that takes as input

the table and index definition(s)
the WHERE condition (or ON expression if the table is inner in an outer join)

and constructs list of ranges one can scan in an index to read the rows that match the WHERE condition, or a superset of it. It can also construct an "index_merge" plan, where one needs ranges from two or more indexes to compute a union (formed from multiple condition disjunctions) and/or intersection (formed from multiple condition conjunctions).

Basic example

Consider a table

CREATE TABLE t1 (
  key1 INT,
  key2 VARCHAR(100),
  ...
  INDEX(key1),
  INDEX(key2)
);

and query

-- Turn on optimizer trace so we can see the ranges:
SET optimizer_trace=1; 
EXPLAIN SELECT * FROM t1 WHERE key1<10 AND key2='foo';
SELECT * FROM information_schema.optimizer_trace\G

This shows the ranges that the optimizer was able to infer:

"range_scan_alternatives": [
                      {
                        "index": "key1",
                        "ranges": ["(NULL) < (key1) < (10)"],
                        ...
                      },
                      {
                        "index": "key2",
                        "ranges": ["(foo) <= (key2) <= (foo)"],
                        ...
                      }
                    ],

Ranges are non-overlapping

Range optimizer produces a list of ranges without overlaps. Consider this WHERE clause where conditions do have overlaps:

SELECT * FROM t1 WHERE (key1 BETWEEN 10 AND 20  AND key1 > 14)  OR key1 IN (17, 22, 33);
SELECT * FROM information_schema.optimizer_trace\G

We get

...
                  "analyzing_range_alternatives": {
                    "range_scan_alternatives": [
                      {
                        "index": "key1",
                        "ranges": [
                          "(14) < (key1) <= (20)",
                          "(22) <= (key1) <= (22)",
                          "(33) <= (key1) <= (33)"
                        ],

Ranges for multi-part indexes

Let's consider an index with multiple key parts. (note: due to Extended Keys optimization, an index may have more key parts than you've explicitly defined)

CREATE TABLE t2 (
  keypart1 INT,
  keypart2 VARCHAR(100),
  keypart3 INT,
  INDEX idx(keypart1, keypart2, keypart3)
);

Range optimizer will generate a finite set of ranges over lexicographical ordering over (keypart1, keypart2, ...).

Example:

SELECT * FROM t2 WHERE keypart1 IN (1,2,3) AND keypart2 BETWEEN 'bar' AND 'foo';
SELECT * FROM information_schema.optimizer_trace\G

gives

"range_scan_alternatives": [
                      {
                        "index": "idx",
                        "ranges": [
                          "(1,bar) <= (keypart1,keypart2) <= (1,foo)",
                          "(2,bar) <= (keypart1,keypart2) <= (2,foo)",
                          "(3,bar) <= (keypart1,keypart2) <= (3,foo)"
                        ],

Compare with a similar query:

SELECT * FROM t2 WHERE keypart1 BETWEEN 1 AND 3 AND keypart2 BETWEEN 'bar' AND 'foo';
SELECT * FROM information_schema.optimizer_trace\G

this will generate just one bigger range:

"range_scan_alternatives": [
                      {
                        "index": "idx",
                        "ranges": ["(1,bar) <= (keypart1,keypart2) <= (3,foo)"],
                        ...

which includes for example rows like (keypart1,keypart2)=(1,zzz). One could argue that the optimizer could be able to figure out that for condition keypart1 between 1 and 3 the only possible values are 1, 2, 3 but this is not implemented.

Not all comparisons produce ranges

Note that some keypart comparisons produce multi-part ranges while some do not. The governing rule is the same: the conditions together must produce an interval (or a finite number of intervals) in lexicographic order in the index.

Some examples:

WHERE keypart1<= 10 AND keypart2<'foo'

can use the second keypart:

"ranges": ["(NULL) < (keypart1,keypart2) < (10,foo)"],

but the interval will still include rows like (keypart1, keypart2) = (8, 'zzzz')

Non-inclusive bound on keypart1 prevents any use of keypart2. For

WHERE keypart1< 10 keypart2<'foo';

we get

"ranges": ["(NULL) < (keypart1) < (10)"]

Non-agreeing comparison (less than and greater than) do not produce a multi-part range:

WHERE keypart1<= 10 AND keypart2>'foo';

gives

"ranges": ["(NULL) < (keypart1) <= (10)"],

A "Hole" in keyparts means higher keypart cannot be used.

WHERE keypart1= 10 AND keypart3<='foo';

gives

"ranges": ["(10) <= (keypart1) <= (10)"]

Combinatorial blow-ups

For multi-part keys, range analyzer can produce ranges that are "tight", that is, they only include rows that will match the WHERE condition. On the other hand, some SQL constructs can produce very large (combinatorial) amounts of ranges. Consider a query

SELECT * FROM t2 WHERE keypart1 IN (1,2,3,4) AND keypart2 IN ('a','b', 'c')

two IN-lists produce 3*4 =12 ranges:

"range_scan_alternatives": [
                      {
                        "index": "idx",
                        "ranges": [
                          "(1,a) <= (keypart1,keypart2) <= (1,a)",
                          "(1,b) <= (keypart1,keypart2) <= (1,b)",
                          "(1,c) <= (keypart1,keypart2) <= (1,c)",
                          "(2,a) <= (keypart1,keypart2) <= (2,a)",
                          "(2,b) <= (keypart1,keypart2) <= (2,b)",
                          "(2,c) <= (keypart1,keypart2) <= (2,c)",
                          "(3,a) <= (keypart1,keypart2) <= (3,a)",
                          "(3,b) <= (keypart1,keypart2) <= (3,b)",
                          "(3,c) <= (keypart1,keypart2) <= (3,c)",
                          "(4,a) <= (keypart1,keypart2) <= (4,a)",
                          "(4,b) <= (keypart1,keypart2) <= (4,b)",
                          "(4,c) <= (keypart1,keypart2) <= (4,c)"
                        ],

if one adds and keypart3 IN (1,2,3,4,5), the amount of ranges will be 345=60 and so forth. See optimizer_max_sel_arg_weight on how to combat this.

_{This page is licensed: CC BY-SA / Gnu FDL}

The Optimizer Cost Model from MariaDB 11.0

Background

Before MariaDB 11.0, the MariaDB Query optimizer used a 'basic cost' of 1 for:

One disk access
Fetching a key
Fetching a row based on the rowid (= unique row identifier) from the key

There were some smaller costs:

filter lookup: 0.01
Examining a where clause: 0.20
Comparing two keys: 0.05
Fetching a row through an index from a temporary memory table: 0.05

The above costs are reasonable for finding out the best index to use. However, they where not good for finding out if we should use a table scan, index scan or range lookup. The cost for the different engines were not properly calibrated.

New Cost Model

In MariaDB 11.0 we have fixed the above shortcomings by changing the basic cost for 'storage engine operations' to be 1 millisecond. This means that for most queries the query cost (LAST_QUERY_COST) should be close (or at least proportional) to the time the server is spending in the storage engine + join_cache + sorting.

Note that the user level costs are in microseconds (as milliseconds would have so many zero's that it makes it hard to compare values).

The engine costs have also been separated into smaller parts to make things more accurate.

The "disk"-read cost now assumes a mid level SSD disk with 400MB/second. This can be changed by the end user by modifying OPTIMIZER_DISK_READ_COST.

All engine specific costs are visible ininformation_schema.optimizer_costs.

For example:

The "default" cost for an engine can be found with:

SELECT * FROM information_schema.optimizer_costs WHERE engine="DEFAULT"\G
*************************** 1. row ***************************
                         ENGINE: DEFAULT
       OPTIMIZER_DISK_READ_COST: 10.240000
OPTIMIZER_INDEX_BLOCK_COPY_COST: 0.035600
     OPTIMIZER_KEY_COMPARE_COST: 0.011361
        OPTIMIZER_KEY_COPY_COST: 0.015685
      OPTIMIZER_KEY_LOOKUP_COST: 0.435777
   OPTIMIZER_KEY_NEXT_FIND_COST: 0.082347
      OPTIMIZER_DISK_READ_RATIO: 0.020000
        OPTIMIZER_ROW_COPY_COST: 0.060866
      OPTIMIZER_ROW_LOOKUP_COST: 0.130839
   OPTIMIZER_ROW_NEXT_FIND_COST: 0.045916
   OPTIMIZER_ROWID_COMPARE_COST: 0.002653
      OPTIMIZER_ROWID_COPY_COST: 0.002653

The above costs are the default (base) for all engines and should be reasonable for engines that does not have a clustered index (like MyISAM, Aria etc). The default costs can be changed by specifying just the cost as an argument, like mariadbd --optimizer-disk-read-cost=20 or from SQL: set global optimizer_disk_read_cost=20. An engine specific cost can be tuned by prefixing the cost with the engine name, like set global innodb.optimizer_disk_read_cost=20.

An engine can tune some or all of the above cost in the storage engine interface. Here follows the cost for the InnoDB storage engine.

SELECT * FROM information_schema.optimizer_costs WHERE engine="innodb"\G
*************************** 1. row ***************************
                         ENGINE: InnoDB
       OPTIMIZER_DISK_READ_COST: 10.240000
OPTIMIZER_INDEX_BLOCK_COPY_COST: 0.035600
     OPTIMIZER_KEY_COMPARE_COST: 0.011361
        OPTIMIZER_KEY_COPY_COST: 0.015685
      OPTIMIZER_KEY_LOOKUP_COST: 0.791120
   OPTIMIZER_KEY_NEXT_FIND_COST: 0.099000
      OPTIMIZER_DISK_READ_RATIO: 0.020000
        OPTIMIZER_ROW_COPY_COST: 0.060870
      OPTIMIZER_ROW_LOOKUP_COST: 0.765970
   OPTIMIZER_ROW_NEXT_FIND_COST: 0.070130
   OPTIMIZER_ROWID_COMPARE_COST: 0.002653
      OPTIMIZER_ROWID_COPY_COST: 0.002653

As can be seen, the ROW_LOOKUP_COST is close to the KEY_LOOKUP_COST, which is because InnoDB has clustered primary key indexes and is using it to find the row from a secondary index.

Some engines, like HEAP/MEMORY implement their own cost functions as different indexes in the same engine can have different costs. This is why some of the cost numbers for these engines are 0.

There are also some SQL level costs that are independent of the storage engine:

SELECT * FROM information_schema.global_variables WHERE variable_name LIKE "%WHERE%cost%" OR variable_name LIKE "%scan%cost%";
+---------------------------+----------------+
| VARIABLE_NAME             | VARIABLE_VALUE |
+---------------------------+----------------+
| OPTIMIZER_SCAN_SETUP_COST | 10.000000      |
| OPTIMIZER_WHERE_COST      | 0.032000       |
+---------------------------+----------------+

Description of the Different Cost Variables

Time and cost are quite interchangeable in the new cost model. Below we will use cost for most things, except for OPTIMIZER_DISK_READ_COST as one should use published/tested timings for the SSD/harddisk if one wants to change the value..

Variable

Type

Description

OPTIMIZER_DISK_READ_COST

Engine

Time in microseconds to read a 4K block from a disk/SSD. The default is set for a 400MB/second SSD

OPTIMIZER_INDEX_BLOCK_COPY_COST

Engine

Cost to lock and a copy a block from the global cache to a local cache. This cost is added for every block accessed, independent of whether they are cached or not

OPTIMIZER_KEY_COMPARE_COST

Engine

Cost to compare two keys

OPTIMIZER_KEY_COPY_COST

Engine

Cost to copy a key from the index to a local buffer as part of searching for a key

OPTIMIZER_KEY_LOOKUP_COST

Engine

Cost to find a key entry in the index (index read)

OPTIMIZER_KEY_NEXT_FIND_COST

Engine

Cost to find the next key in the index (index next)

OPTIMIZER_DISK_READ_RATIO

Engine

The ratio of BLOCK_NOT_IN_CACHE/CACHE_READS. The cost of disk usage is calculated as estimated_blocks * OPTIMIZER_DISK_READ_RATIO * OPTIMIZER_DISK_READ_COST. A value of 0 means that all blocks are always in the cache. A value of 1 means that a block is never in the cache

OPTIMIZER_ROW_COPY_COST

Engine

Cost of copying a row to a local buffer. Should be slightly more than OPTIMIZER_KEY_COPY_COST

OPTIMIZER_ROW_LOOKUP_COST

Engine

Cost to find a row based on the rowid (Rowid is stored in the index together with the key)

OPTIMIZER_ROW_NEXT_FIND_COST

Engine

Cost of finding the next row

OPTIMIZER_ROWID_COMPARE_COST

Engine

Cost of comparing two rowids

OPTIMIZER_ROWID_COPY_COST

Engine

Cost of copying a rowid from the index

OPTIMIZER_SCAN_SETUP_COST

Session

Cost of starting a table or index scan. This has a low value to encourage the optimizer to use index lookup also tables with very few rows

OPTIMIZER_WHERE_COST

Session

Cost to execute the WHERE clause for every found row. Increasing this variable will encourage the optimizer to find plans which read fewer rows

More information of the costs and how they were calculated can be found in the Docs/optimizer_costs.txt file in the MariaDB Source distributions.

Other Optimizer Cost Changes

When counting disk accesses, we assume that all rows and index data are cached for the duration of the query. This is to avoid the following problem:
- table t1 with 1 million_rows is scanned
  - For each row we do a lookup in table t2, which has only 10 rows

If we would count all lookups in t2, there would be 1 million lookups. If this would be the case, the optimizer would choose to use a join cache on the rows in t1 and do a table scan over t2.

The cost of sorting (filesort) is now more accurate, which allows the optimizer to better choose between index scan and filesort for ORDER BY/GROUP BY queries.

A lot of rule-based cost has been changed to be cost-based:

The decision to use an index (and which index) for resolving ORDER BY/GROUP BY were only partly cost-based before.
The old optimizer would limit the number of ‘expected key lookups’ to 10% of the number of rows. This would cause the optimizer to use an index to scan a big part of a table when a full table scan would be much faster. This code is now removed.
InnoDB would limit the number of rows in a range to 50% of the total rows, which would confuse the optimizer for big ranges. The cap is now removed.
If there was a usable filter for an index, it was sometimes used without checking the complete cost of the filter.
‘Aggregate distinct optimization with indexes’ is now cost-based. This will change many queries from "Using index for group-by (scanning)” to “Using index for group-by”.

Other Notable Plan Changes

Indexes can now be used for ORDER BY/GROUP BY in sub queries (instead of filesort)
Derived tables and queries with UNION can now create a distinct key (instead of a key with duplicates) to speed up key accesses.
Indexes with more used key parts are preferred if the number of resulting rows is the same:
- WHERE key_part_1 = 1 and key_part_2 < 10
- This will now use a RANGE over both key parts instead of using lookups on key_part_1.
For very small tables, index lookup is preferred over table scan.
EXPLAIN does not report "Using index" for scans using a clustered primary key as technically this a table scan.

When the Optimizer Changes Matter

The new, improved optimizer should be able to find a better plan

If you are using queries with more than two tables.
If you have indexes with a lot of identical values.
If you are using ranges that cover more than 10% of a table.
- WHERE key between 1 and 1000 -- Table has values 1-2000
If you have complex queries when not all used columns are or can be indexed.
- In which case you may need to depend on selectivity to get the right plan.
If you are using queries mixing different storage engines.
- Like using tables from both InnoDB and Memory in the same query.
If you have had to use FORCE INDEX to get a good plan.
If using ANALYZE TABLE made your plans worse (or not good enough).
If your queries have lots of derived tables (subselects).
You are using ORDER BY / GROUP BY that could be resolved via indexes.

Changing Costs

All engine and “SQL level” cost variables can be changed via MariaDB startup options, in configuration files or dynamically using SQL.

In Configuration Files (and Command Line)

[mariadbd]
# Archive is using a hard disk (typical seek is 8-10 ms)
archive.OPTIMIZER_DISK_READ_COST=8000
# All other engines are using an SSD.
OPTIMIZER_DISK_READ_COST=10.240000

From SQL

# Tell optimizer TO find a plan WITH AS few accepted ROWS AS possible
SET SESSION OPTIMIZER_WHERE_COST=1.0;
# Inform the optimizer that InnoDB buffer pool has a 80% hit rate
SET GLOBAL innodb.OPTIMIZER_DISK_READ_RATIO=0.20;

Note engine costs are GLOBAL while other costs can also be SESSION.
To keep things fast, engine-specific costs are stored in the table definition (TABLE_SHARE). One effect of this is that if one changes the cost for an engine, it will only take effect when new, not previously cached tables are accessed. You can use FLUSH TABLES to force the table to use the new costs at next access.

Examples of Changing Costs

OPTIMIZER_WHERE_COST is added as a cost for all 'accepted rows'. Increasing this variable will cause the optimizer to choose plans with less estimated rows.
One can specify the kind of disk used by the system by changing OPTIMIZER_DISK_READ_COST. This should be the time to do a random read of a 4096 byte block.
The cost of a potential disk read is calculated as OPTIMIZER_DISK_READ_COST * OPTIMIZER_DISK_READ_RATIO. Increasing OPTIMIZER_DISK_READ_RATIO will inform the optimizer that not all data is cached.
OPTIMIZER_SCAN_SETUP_COST will increase the cost of a table scan. One can increase this to avoid using table scans.

For Storage Engine Developers

The costs for an engine are set the following way when the engine plugin is loaded/initialized:

Copy the "default" storage engine costs to the plugin engine costs.
- #handlerton->costspoints to the engine specific cost data.
Call handlerton->update_optimizer_costs() to let the storage engine update the costs.
Apply all user specific engine costs (from configuration files/startup) to the engine costs structure.
When a TABLE_SHARE is created, the costs are copied from handlerton->costs to TABLE_SHARE.optimizer_costs . handler::update_optimizer_costs() is called to allow the engine to tune the cost for this specific table instance. This is done to avoid having to take any "cost" mutex while running queries.
User changes to engine costs are stored in the data pointed to by handlerton->costs. This is why FLUSH TABLES is needed to activate new engine costs.
To speed up cost access for the optimizer, handler::set_optimizer_costs() is called for each query to copy OPTIMIZER_WHERE_COST and OPTIMIZER_SCAN_SETUP_COST to the engine cost structure.

_{This page is licensed: CC BY-SA / Gnu FDL}

Optimizer Trace

Basic Optimizer Trace Example

MariaDB> set optimizer_trace='enabled=on';

MariaDB> select * from t1 where a<10;

MariaDB> select * from information_schema.optimizer_trace limit 1\G
*************************** 1. row ***************************
                            QUERY: select * from t1 where a<10
                            TRACE: {
  "steps": [
    {
      "join_preparation": {
        "select_id": 1,
        "steps": [
          {
            "expanded_query": "select t1.a AS a,t1.b AS b,t1.c AS c from t1 where t1.a < 10"
          }
        ]
      }
    },
    {
      "join_optimization": {
        "select_id": 1,
        "steps": [
          {
            "condition_processing": {
              "condition": "WHERE",
              "original_condition": "t1.a < 10",
              "steps": [
                {
                  "transformation": "equality_propagation",
                  "resulting_condition": "t1.a < 10"
                },
                {
                  "transformation": "constant_propagation",
                  "resulting_condition": "t1.a < 10"
                },
                {
                  "transformation": "trivial_condition_removal",
                  "resulting_condition": "t1.a < 10"
                }
              ]
            }
          },
          {
            "table_dependencies": [
              {
                "table": "t1",
                "row_may_be_null": false,
                "map_bit": 0,
                "depends_on_map_bits": []
              }
            ]
          },
          {
            "ref_optimizer_key_uses": []
          },
          {
            "rows_estimation": [
              {
                "table": "t1",
                "range_analysis": {
                  "table_scan": {
                    "rows": 1000,
                    "cost": 206.1
                  },
                  "potential_range_indexes": [
                    {
                      "index": "a",
                      "usable": true,
                      "key_parts": ["a"]
                    },
                    {
                      "index": "b",
                      "usable": false,
                      "cause": "not applicable"
                    }
                  ],
                  "setup_range_conditions": [],
                  "group_index_range": {
                    "chosen": false,
                    "cause": "no group by or distinct"
                  },
                  "analyzing_range_alternatives": {
                    "range_scan_alternatives": [
                      {
                        "index": "a",
                        "ranges": ["(NULL) < (a) < (10)"],
                        "rowid_ordered": false,
                        "using_mrr": false,
                        "index_only": false,
                        "rows": 10,
                        "cost": 13.751,
                        "chosen": true
                      }
                    ],
                    "analyzing_roworder_intersect": {
                      "cause": "too few roworder scans"
                    },
                    "analyzing_index_merge_union": []
                  },
                  "chosen_range_access_summary": {
                    "range_access_plan": {
                      "type": "range_scan",
                      "index": "a",
                      "rows": 10,
                      "ranges": ["(NULL) < (a) < (10)"]
                    },
                    "rows_for_plan": 10,
                    "cost_for_plan": 13.751,
                    "chosen": true
                  }
                }
              },
              {
                "selectivity_for_indexes": [
                  {
                    "index_name": "a",
                    "selectivity_from_index": 0.01
                  }
                ],
                "selectivity_for_columns": [],
                "cond_selectivity": 0.01
              }
            ]
          },
          {
            "considered_execution_plans": [
              {
                "plan_prefix": [],
                "table": "t1",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "access_type": "range",
                      "resulting_rows": 10,
                      "cost": 13.751,
                      "chosen": true
                    }
                  ]
                }
              }
            ]
          },
          {
            "attaching_conditions_to_tables": {
              "original_condition": "t1.a < 10",
              "attached_conditions_computation": [],
              "attached_conditions_summary": [
                {
                  "table": "t1",
                  "attached": "t1.a < 10"
                }
              ]
            }
          }
        ]
      }
    },
    {
      "join_execution": {
        "select_id": 1,
        "steps": []
      }
    }
  ]
}
MISSING_BYTES_BEYOND_MAX_MEM_SIZE: 0
          INSUFFICIENT_PRIVILEGES: 0

_{This page is licensed: CC BY-SA / Gnu FDL}

How to Collect Large Optimizer Traces

Optimizer traces can be large for some queries.

In order to collect a large trace, you need to perform the following steps (using 128 MB as an example):

set global max_allowed_packet=128*1024*1024;

Reconnect specifying --max-allowed-packet=128000000 for the client as well.

set optimizer_trace=1;
set optimizer_trace_max_mem_size=127*1024*1024;

Now, one can run the query and save the large trace.

Optimizer Trace for Developers

This article describes guidelines for what/how to write to Optimizer Trace when doing server development.

Basic considerations

The trace is a "structured log" of what was done by the optimizer. Prefer to do tracing as soon as a rewrite/decision is made (instead of having a separate trace_something() function).

Generally, a function should expect to find the trace in a state where we're writing an array. The rationale is that array elements are ordered, while object members are not (even if they come in a certain order in the JSON text). We're writing a log, so it's natural for different entries to form an array.

Typically you'll want to start an unnamed object, then use member names to show what kind of entry you're about to write:

[
  ...,  # Something before us
  {
    "my_new_rewrite": {
       "from": "foo", 
       "to": "bar",
       ...
    }
  }
  ...

(TODO other considerations)

Making sure the trace is valid

Json_writer_object and Json_writer_array classes use RAII idiom and ensure that JSON objects and arrays are "closed" in the reverse order they were started.

However, they do not ensure these constraints:

JSON objects must have named members.
JSON arrays must have unnamed members.

Tracing code has runtime checks for these. Attempt to write invalid JSON will cause assertion failure.

Test coverage

It is possible to run mysql-test-run with this argument

--mysqld=--optimizer_trace=enabled=on

This will run all tests with tracing on. As mentioned earlier, debug build will perform checks that we are not producing invalid trace.

The BuildBot instance atalso runs tests with this argument, see mtr_opttrace pass in kvm-fulltest and kvm-fulltest2.

Debugging

See optimizer-debugging-with-gdb/#printing-the-optimizer-trace for commands to print the trace for the current statement.

_{This page is licensed: CC BY-SA / Gnu FDL}

Optimizer Trace Guide

The optimizer trace uses the JSON format. It is basically a structured log file showing what actions were taken by the query optimizer.

A Basic Example

Let's take a simple query:

MariaDB> explain select * from t1 where a<10;
+------+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+
| id   | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                 |
+------+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+
|    1 | SIMPLE      | t1    | range | a             | a    | 5       | NULL | 10   | Using index condition |
+------+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+

One can see the full trace here. Taking only the component names, one gets:

MariaDB> select * from information_schema.optimizer_trace limit 1\G
*************************** 1. row ***************************
                            QUERY: select * from t1 where a<10
                            TRACE: 
{
  "steps": [
    {
      "join_preparation": { ... }
    },
    {
      "join_optimization": {
        "select_id": 1,
        "steps": [
          { "condition_processing": { ... } },
          { "table_dependencies": [ ... ] },
          { "ref_optimizer_key_uses": [ ... ] },
          { "rows_estimation": [
              {
                "range_analysis": {
                   "analyzing_range_alternatives" : { ... },
                  "chosen_range_access_summary": { ... },
                },
                "selectivity_for_indexes" : { ... },
                "selectivity_for_columns" : { ... }
              }
            ]
          },
          { "considered_execution_plans": [ ... ] },
          { "attaching_conditions_to_tables": { ... } }
         ]
      }
    },
    {
      "join_execution": { ... }
    }
  ]
}

Trace Structure

For each SELECT, there are two "Steps":

join_preparation
join_optimization

Join preparation shows early query rewrites. join_optmization is where most of the query optimizations are done. They are:

condition_processing - basic rewrites in WHERE/ON conditions.
ref_optimizer_key_uses - Construction of possible ways to do ref and eq_ref accesses.
rows_estimation - Consideration of range and index_merge accesses.
considered_execution_plans - Join optimization itself, that is, choice of the join order.
attaching_conditions_to_tables - Once the join order is fixed, parts of the WHERE clause are "attached" to tables to filter out rows as early as possible.

The above steps are for just one SELECT. If the query has subqueries, each SELECT will have these steps, and there will be extra steps/rewrites to handle the subquery construct itself.

Extracting Trace Components

If you are interested in some particular part of the trace, MariaDB has two functions that come in handy:

JSON_EXTRACT extracts a part of JSON document
JSON_DETAILED presents it in a user-readable way.

For example, the contents of the analyzing_range_alternatives node can be extracted like so:

MariaDB> select JSON_DETAILED(JSON_EXTRACT(trace, '$**.analyzing_range_alternatives')) 
   ->   from INFORMATION_SCHEMA.OPTIMIZER_TRACE\G
*************************** 1. row ***************************
JSON_DETAILED(JSON_EXTRACT(trace, '$**.analyzing_range_alternatives')): [
    {
        "range_scan_alternatives": 
        [
            {
                "index": "a_b_c",
                "ranges": 
                [
                    "(1) <= (a,b) < (4,50)"
                ],
                "rowid_ordered": false,
                "using_mrr": false,
                "index_only": false,
                "rows": 4,
                "cost": 6.2509,
                "chosen": true
            }
        ],
        "analyzing_roworder_intersect": 
        {
            "cause": "too few roworder scans"
        },
        "analyzing_index_merge_union": []
    }
]

Examples of Various Information in the Trace

Basic Rewrites

A lot of applications construct database query text on the fly, which sometimes means that the query has constructs that are repetitive or redundant. In most cases, the optimizer will be able to remove them. One can check the trace to be sure:

explain select * from t1 where not (col1 >= 3);

Optimizer trace will show:

"steps": [
  {
    "join_preparation": {
      "select_id": 1,
      "steps": [
        {
          "expanded_query": "select t1.a AS a,t1.b AS b,t1.col1 AS col1 from t1 where t1.col1 < 3"
        }

Here, one can see that NOT was removed.

Similarly, one can also see that IN(...) with one element is the same as equality:

explain select * from t1 where col1  in (1);

will show

"join_preparation": {
    "select_id": 1,
    "steps": [
      {
        "expanded_query": "select t1.a AS a,t1.b AS b,t1.col1 AS col1 from t1 where t1.col1 = 1"

On the other hand, converting an UTF-8 column to UTF-8 is not removed:

explain select * from t1 where convert(utf8_col using utf8) = 'hello';

will show

"join_preparation": {
    "select_id": 1,
    "steps": [
      {
        "expanded_query": "select t1.a AS a,t1.b AS b,t1.col1 AS col1,t1.utf8_col AS utf8_col from t1 where convert(t1.utf8_col using utf8) = 'hello'"
          }

so redundant CONVERT calls should be used with caution.

VIEW Processing

MariaDB has two algorithms to handle VIEWs: merging and materialization. If you run a query that uses a VIEW, the trace will have either

"view": {
              "table": "view1",
              "select_id": 2,
              "algorithm": "merged"
            }

{
            "view": {
              "table": "view2",
              "select_id": 2,
              "algorithm": "materialized"
            }
          },

depending on which algorithm was used.

Range Optimizer - What Ranges Will Be Scanned

The MariaDB optimizer has a complex part called the Range Optimizer. This is a module that examines WHERE (and ON) clauses and constructs index ranges that need to be scanned to answer the query. The rules for constructing the ranges are quite complex.

An example: Consider a table

CREATE TABLE some_events ( 
  start_date DATE, 
  end_date DATE, 
  ...
  key (start_date, end_date)
);

and a query:

explain select * from some_events where start_date >= '2019-09-10' and end_date <= '2019-09-14';
+------+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| id   | select_type | table       | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+------+-------------+-------------+------+---------------+------+---------+------+------+-------------+
|    1 | SIMPLE      | some_events | ALL  | start_date    | NULL | NULL    | NULL | 1000 | Using where |
+------+-------------+-------------+------+---------------+------+---------+------+------+-------------+

One might think that the optimizer would be able to use the restrictions on both start_date and end_date to construct a narrow range to be scanned. But this is not so, one of the restrictions creates a left-endpoint range and the other one creates a right-endpoint range, hence they cannot be combined.

select 
   JSON_DETAILED(JSON_EXTRACT(trace, '$**.analyzing_range_alternatives')) as trace 
from information_schema.optimizer_trace\G
*************************** 1. row ***************************
trace: [
    {
        "range_scan_alternatives": 
        [
            {
                "index": "start_date",
                "ranges": 
                [
                    "(2019-09-10,NULL) < (start_date,end_date)"
                ],
...

the potential range only uses one of the bounds.

Ref Access Options

Index-based Nested-loops joins are called "ref access" in the MariaDB optimizer.

The optimizer analyzes the WHERE/ON conditions and collects all equality conditions that can be used by ref access using some index.

The list of conditions can be found in the ref_optimizer_key_uses node. (TODO example)

Join Optimization

The join optimizer's node is named considered_execution_plans.

The optimizer constructs the join orders in a left-to-right fashion. That is, if the query is a join of three tables:

SELECT * FROM t1, t2, t3 WHERE ...

then the optimizer will

Pick the first table (say, it is t1),
consider adding another table (say, t2), and construct a prefix "t1, t2"
consider adding the third table (t3), and constructing a prefix "t1, t2, t3", which is a complete join plan Other join orders will be considered as well.

The basic operation here is: "given a join prefix of tables A,B,C ..., try adding table X to it". In JSON, it looks like this:

{
        "plan_prefix": ["t1", "t2"],
        "table": "t3",
        "best_access_path": {
          "considered_access_paths": [
            {
              ...
            }
          ]
        }
      }

(search for plan_prefix followed by table).

If you are interested in how the join order of #t1,t2,t3

was constructed (or not constructed), you need to search for these patterns:

"plan_prefix":[], "table":"t1"
"plan_prefix":["t1"], "table":"t2"
"plan_prefix":["t1", "t2"], "table":"t3"

_{This page is licensed: CC BY-SA / Gnu FDL}

Optimizer Trace Overview

Usage

This feature produces a trace as a JSON document for any SELECT/UPDATE/DELETE containing information about decisions taken by the optimizer during the optimization phase (choice of table access method, various costs, transformations, etc). This feature helps to explain why some decisions were taken by the optimizer and why some were rejected.

Associated System Variables

optimizer_trace=’enabled=on/off’
- Default value is off
optimizer_trace_max_mem_size= value
- Default value: 1048576
optimizer_record_context{=1|0}
- Default value: OFF. From MariaDB 12.1.

INFORMATION_SCHEMA.OPTIMIZER_TRACE

Each connection stores a trace from the last executed statement. One can view the trace by reading the Information Schema OPTIMIZER_TRACE table.

Structure of the optimizer trace table:

SHOW CREATE TABLE INFORMATION_SCHEMA.OPTIMIZER_TRACE \G
*************************** 1. row ***************************
       Table: OPTIMIZER_TRACE
Create Table: CREATE TEMPORARY TABLE `OPTIMIZER_TRACE` (
  `QUERY` longtext NOT NULL DEFAULT '',
  `TRACE` longtext NOT NULL DEFAULT '',
  `MISSING_BYTES_BEYOND_MAX_MEM_SIZE` int(20) NOT NULL DEFAULT 0,
  `INSUFFICIENT_PRIVILEGES` tinyint(1) NOT NULL DEFAULT 0
) ENGINE=Aria DEFAULT CHARSET=utf8 PAGE_CHECKSUM=0

Optimizer Trace Contents

See Optimizer Trace Guide for an overview of what one can find in the trace.

Traceable Queries

These include SELECT, UPDATE, DELETE as well as their multi-table variants and all of the preceding prefixed by EXPLAIN and ANALYZE.

Enabling Optimizer Trace

To enable optimizer trace run:

SET optimizer_trace='enabled=on';

Memory Usage

Each trace is stored as a string. It is extended (with realloc()) as the optimization progresses and appends data to it. The optimizer_trace_max_mem_size variable sets a limit on the total amount of memory used by the current trace.

If this limit is reached, the current trace isn't extended (so it will be incomplete), and the MISSING_BYTES_BEYOND_MAX_MEM_SIZE column will show the number of bytes missing from this trace.

Privilege Checking

In complex scenarios where the query uses SQL SECURITY DEFINER views or stored routines, it may be that a user is denied from seeing the trace of its query because it lacks some extra privileges on those objects. In that case, the trace will be shown as empty and the INSUFFICIENT_PRIVILEGES column will show "1".

Limitations

Currently, only one trace is stored. It is not possible to trace the sub-statements of a stored routine; only the statement at the top level is traced.

_{This page is licensed: CC BY-SA / Gnu FDL}

Optimizer Trace Resources

Optimizer Trace Walkthrough talk at MariaDB Fest 2020:
A tool for processing Optimizer Trace: opttrace . Doesn't work with MariaDB at the moment but everyone is welcome to make it work.

_{This page is licensed: CC BY-SA / Gnu FDL}

MariaDB Source Code Internals

Articles about MariaDB source code and related internals

Stored Procedure Internals

Implementation Specification for Stored Procedures

How Parsing and Execution of Queries Work

In order to execute a query, the function sql_parse.cc:mysql_parse() is called, which in turn calls the parser (yyparse()) with an updated Lex structure as the result. mysql_parse() then calls mysql_execute_command() which dispatches on the command code (in Lex) to the corresponding code for executing that particular query.

There are three structures involved in the execution of a query which are of interest to the stored procedure implementation:

Lex (mentioned above) is the "compiled" query, that is the output from the parser and what is then interpreted to do the actual work. It constains an enum value (sql_command) which is the query type, and all the data collected by the parser needed for the execution (table names, fields, values, etc).
THD is the "run-time" state of a connection, containing all that is needed for a particular client connection, and, among other things, the Lex structure currently being executed.
Item_*: During parsing, all data is translated into "items", objects of the subclasses of "Item", such as Item_int, Item_real, Item_string, etc., for basic datatypes, and also various more specialized Item types for expressions to be evaluated (Item_func objects).

How to Fit Stored Procedures into this Scheme

Overview of the Classes and Files for Stored Procedures

(More detailed APIs at the end of this page)

class sp_head (sp_head.{cc,h})

This contains, among other things, an array of "instructions" and the method for executing the procedure.

class sp_pcontext (sp_pcontext.{cc,h}

This is the parse context for the procedure. It's primarily used during parsing to keep track of local parameters, variables and labels, but it's also used at CALL time to find the parameters mode (IN, OUT or INOUT) and type when setting up the runtime context.

class sp_instr (sp_head.{cc,h})

This is the base class for "instructions", that is, what is generated by the parser. It turns out that we only need a minimum of 5 different sub classes:

sp_instr_stmt Execute a statement. This is the "call-out" any normal SQL statement, like a SELECT, INSERT etc. It contains the Lex structure for the statement in question.
sp_instr_set Set the value of a local variable (or parameter)
sp_instr_jump An unconditional jump.
sp_instr_jump_if_not Jump if condition is not true. It turns out that the negative test is most convenient when generating the code for the flow control constructs.
sp_instr_freturn Return a value from a FUNCTION and exit. For condition HANDLERs some special instructions are also needed, see that section below.

class sp_rcontext (sp_rcontext.h)

This is the runtime context in the THD structure. It contains an array of items, the parameters and local variables for the currently executing stored procedure. This means that variable value lookup is in runtime is constant time, a simple index operation.

class Item_splocal (Item.{cc,h})

This is a subclass of Item. Its sole purpose is to hide the fact that the real Item is actually in the current frame (runtime context). It contains the frame offset and defers all methods to the real Item in the frame. This is what the parser generates for local variables.

Utility Functions (sp.{cc,h})

This contains functions for creating, dropping and finding a stored procedure in the mysql.proc table (or the internal cache).

Parsing CREATE PROCEDURE

When parsing a CREATE PROCEDURE the parser first initializes thesphead and spcont (runtime context) fields in the Lex. The sql_command code for the result of parsing a isSQLCOM_CREATE_PROCEDURE.

The parsing of the parameter list and body is relatively straightforward:

Parameters: name, type and mode (IN/OUT/INOUT) is pushed to spcont
Declared local variables: Same as parameters (mode is then IN)
Local Variable references: If an identifier is found in spcont, an Item_splocal is created with the variable's frame index, otherwise an Item_field or Item_ref is created (as before).
Statements: The Lex in THD is replaced by a new Lex structure and the statement, is parsed as usual. A sp_instr_stmt is created, containing the new Lex, and added to the instructions in sphead. Afterwards, the procedure's Lex is restored in THD.
SET var: Setting a local variable generates a sp_instr_set instruction, containing the variable's frame offset, the expression (an Item), and the type.
Flow control: Flow control constructs such as IF, WHILE, etc, generate a conditional and unconditional jumps in the "obvious" way, but a few notes may be required:
Forward jumps: When jumping forward, the exact destination is not known at the time of the creation of the jump instruction. The
1. spheadtherefore contains a list of instruction-label pairs for each forward reference. When the position later is known, the instructions in the list are updated with the correct location.
Loop constructs have optional labels. If a loop doesn't have a label, an anonymous label is generated to simplify the parsing.
There are two types of CASE. The "simple" case is implemented with an anonymous variable bound to the value to be tested.

A Simple Example

Parsing the procedure:

CREATE PROCEDURE a(s CHAR(16))
      BEGIN
        DECLARE x INT;
        SET x = 3;
        WHILE x > 0 DO
          SET x = x-1;
          INSERT INTO db.tab VALUES (x, s);
        END WHILE;
      END

would generate the following structures:

______
      thd: |      |     _________
           | lex -+--->|         |                     ___________________
           |______|    | spcont -+------------------->| "s",in,char(16):0 |
                       | sphead -+------              |("x",in,int     :1)|
                       |_________|      |             |___________________|
                                    ____V__________________
                                   | m_name: "a"           |
                                   | m_defstr: "create ..."|
                                   | m_instr: ...          |
                                   |_______________________|

Note that the contents of the spcont is changing during the parsing, at all times reflecting the state of the would-be runtime frame. The m_instr is an array of instructions:

Pos.  Instruction
       0    sp_instr_set(1, '3')
       1    sp_instr_jump_if_not(5, 'x>0')
       2    sp_instr_set(1, 'x-1')
       3    sp_instr_stmt('insert into ...')
       4    sp_instr_jump(1)
       5    <end>

Here, '3', 'x>0', etc, represent the Items or Lex for the respective expressions or statements.

Parsing CREATE FUNCTION

Creating a function is essentially the same thing as for a PROCEDURE, with the addition that a FUNCTION has a return type and a RETURN statement, but no OUT or INOUT parameters.

The main difference during parsing is that we store the result type in the sp_head. However, there are big differences when it comes to invoking a FUNCTION. (See below.)

Storing, Caching, Dropping

As seen above, the entired definition string, including the "CREATE PROCEDURE" (or "FUNCTION") is kept. The procedure definition string is stored in the table mysql.proc with the name and type as the key, the type being one of the enum ("procedure","function").

A PROCEDURE is just stored in the mysql.proc table. A FUNCTION has an additional requirement. They will be called in expressions with the same syntax as UDFs, so UDFs and stored FUNCTIONs share the namespace. Thus, we must make sure that we do not have UDFs and FUNCTIONs with the same name (even if they are stored in different places).

This means that we can reparse the procedure as many time as we want. The first time, the resulting Lex is used to store the procedure in the database (using the function sp.c:sp_create_procedure()).

The simplest way would be to just leave it at that, and re-read the procedure from the database each time it is called. (And in fact, that's the way the earliest implementation will work.) However, this is not very efficient, and we can do better. The full implementation should work like this:

Upon creation time, parse and store the procedure. Note that we still need to parse it to catch syntax errors, but we can't check if called procedures exists for instance.
Upon first CALL, read from the database, parse it, and cache the resulting Lex in memory. This time we can do more error checking.
Upon subsequent CALLs, use the cached Lex.

Note that this implies that the Lex structure with its sphead must be reentrant, that is, reusable and shareable between different threads and calls. The runtime state for a procedure is kept in the sp_rcontext in THD.

The mechanisms of storing, finding, and dropping procedures are encapsulated in the files sp.{cc,h}.

CALLing a Procedure

A CALL is parsed just like any statement. The resulting Lex has the sql_command SQLCOM_CALL, the procedure's name and the parameters are pushed to the Lex' value_list.

sql_parse.cc:mysql_execute_command() then uses sp.cc:sp_find() to get the sp_head for the procedure (which may have been read from the database or fetched from the in-memory cache) and calls the sp_head's method execute(). Note: It's important that substatements called by the procedure do not do send_ok(). Fortunately, there is a flag in THD->net to disable this during CALLs. If a substatement fails, it will however send an error back to the client, so the CALL mechanism must return immediately and without sending an error.

The sp_head::execute() method works as follows:

Keep a pointer to the old runtime context in THD (if any)
Create a new runtime context. The information about the required size is in sp_head's parse time context.
Push each parameter (from the CALL's Lex->value_list) to the new context. If it's an OUT or INOUT parameter, the parameter's offset in the caller's frame is set in the new context as well.
For each instruction, call its execute() method. The result is a pointer to the next instruction to execute (or NULL) if an error occurred.
On success, set the new values of the OUT and INOUT parameters in the caller's frame.

USE database

Before executing the instruction we also keeps the current default database (if any). If this was changed during execution (i.e. a USE statement has been executed), we restore the current database to the original.

This is the most useful way to handle USE in procedures. If we didn't, the caller would find himself in a different database after calling a function, which can be confusing. Restoring the database also gives full freedom to the procedure writer:

It's possible to write "general" procedures that are independent of the actual database name.
It's possible to write procedures that work on a particular database by calling USE, without having to use fully qualified table names everywhere (which doesn't help if you want to call other, "general", procedures anyway).

Evaluating Items

There are three occasions where we need to evaluate an expression:

When SETing a variable
When CALLing a procedure
When testing an expression for a branch (in IF, WHILE, etc)

The semantics in stored procedures is "call-by-value", so we have to evaluate any "func" Items at the point of the CALL or SET, otherwise we would get a kind of "lazy" evaluation with unexpected results with respect to OUT parameters for instance. For this the support function, sp_head.cc:eval_func_item() is needed.

Calling a FUNCTION

Functions don't have an explicit call keyword like procedures. Instead, they appear in expressions with the conventional syntax "fun(arg, ...)". The problem is that we already have User Defined Functions (UDFs) which are called the same way. A UDF is detected by the lexical analyzer (not the parser!), in the find_keyword() function, and returns a UDF_*_FUNC or UDA_*_SUM token with the udf_func object as the yylval.

So, stored functions must be handled in a similar way, and as a consequence, UDFs and functions must not have the same name.

Detecting and Parsing a FUNCTION Invocation

The existence of UDFs are checked during the lexical analysis (in sql_lex.cc:find_keyword()). This has the drawback that they must exist before they are referred to, which was ok before SPs existed, but then it becomes a problem. The first implementation of SP FUNCTIONs will work the same way, but this should be fixed a.s.a.p. (This will required some reworking of the way UDFs are handled, which is why it's not done from the start.) For the time being, a FUNCTION is detected the same way, and returns the token SP_FUNC. During the parsing we only check for the existence of the function, we don't parse it, since wa can't call the parser recursively.

When encountering a SP_FUNC with parameters in the expression parser, an instance of the new Item_func_sp class is created. Unlike UDFs, we don't have different classes for different return types, since we at this point don't know the type.

Collecting FUNCTIONs to invoke

A FUNCTION differs from a PROCEDURE in one important aspect: Whereas a PROCEDURE is CALLed as statement by itself, a FUNCTION is invoked "on-the-fly" during the execution of another statement. This makes things a lot more complicated compared to CALL:

We can't read and parse the FUNCTION from the mysql.proc table at the point of invocation; the server requires that all tables used are opened and locked at the beginning of the query execution. One "obvious" solution would be to simply push "mysql.proc" to the list of tables used by the query, but this implies a "join" with this table if the query is a select, so it doesn't work (and we can't exclude this table easily; since a privileged used might in fact want to search the proc table). Another solution would of course be to allow the opening and closing of the mysql.proc table during a query execution, but this it not possible at the present.

So, the solution is to collect the names of the referred FUNCTIONs during parsing in the lex. Then, before doing anything else in mysql_execute_command(), read all functions from the database an keep them in the THD, where the functionsp_find_function() can find them during the execution. Note: Even with an in-memory cache, we must still make sure that the functions are indeed read and cached at this point. The code that read and cache functions from the database must also be invoked recursively for each read FUNCTION to make sure we have all the functions we need.

Parsing DROP PROCEDURE/FUNCTION

The procedure name is pushed to Lex->value_list. The sql_command code for the result of parsing a isSQLCOM_DROP_PROCEDURE/SQLCOM_DROP_FUNCTION.

Dropping is done by simply getting the procedure with the sp_find() function and calling sp_drop() (both in sp.{cc,h}).

DROP PROCEDURE/DROP FUNCTION also supports the non-standard "IF EXISTS", analogous to other DROP statements in MariaDB.

Condition and Handlers

Condition names are lexical entities and are kept in the parser context just like variables. But, condition are just "aliases" for SQLSTATE strings, or mysqld error codes (which is a non-standard extension in MySQL), and are only used during parsing.

Handlers comes in three types, CONTINUE, EXIT and UNDO. The latter is like an EXIT handler with an implicit rollback, and is currently not implemented. The EXIT handler jumps to the end of its BEGIN-END block when finished. The CONTINUE handler returns to the statement following that which invoked the handler.

The handlers in effect at any point is part of each thread's runtime state, so we need to push and pop handlers in the sp_rcontext during execution. We use special instructions for this:

sp_instr_hpush_jump Push a handler. The instruction contains the necessary information, like which conditions we handle and the location of the handler. The jump takes us to the location after the handler code.
sp_instr_hpop Pop the handlers of the current frame (which we are just leaving).

It might seems strange to jump past the handlers like that, but there's no extra cost in doing this, and for technical reasons it's easiest for the parser to generate the handler instructions when they occur in the source.

When an error occurs, one of the error routines is called and an error message is normally sent back to the client immediately. Catching a condition must be done in these error routines (there are quite a few) to prevent them from doing this. We do this by calling a method in the THD's sp_rcontext (if there is one). If a handler is found, this is recorded in the context and the routine returns without sending the error message. The execution loop (sp_head::execute()) checks for this after each statement and invokes the handler that has been found. If several errors or warnings occurs during one statement, only the first is caught, the rest are ignored.

Invoking and returning from a handler is trivial in the EXIT case. We simply jump to it, and it will have an sp_instr_jump as its last instruction.

Calling and returning from a CONTINUE handler poses some special problems. Since we need to return to the point after its invocation, we push the return location on a stack in the sp_rcontext (this is done by the execution loop). The handler then ends with a special instruction, sp_instr_hreturn, which returns to this location.

CONTINUE handlers have one additional problem: They are parsed at the lexical level where they occur, so variable offsets will assume that it's actually called at that level. However, a handler might be invoked from a sub-block where additional local variables have been declared, which will then share the location of any local variables in the handler itself. So, when calling a CONTINUE handler, we need to save any local variables above the handler's frame offset, and restore them upon return. (This is not a problem for EXIT handlers, since they will leave the block anyway.) This is taken care of by the execution loop and the sp_instr_hreturn instruction.

Examples

EXIT handler:

begin
        declare x int default 0;

        begin
          declare exit handler for 'XXXXX' set x = 1;

          (statement1);
          (statement2);
        end;
        (statement3);
      end

Pos.  Instruction
       0    sp_instr_set(0, '0')
       1    sp_instr_hpush_jump(4, 1)           # location and frame size
       2    sp_instr_set(0, '1')
       3    sp_instr_jump(6)
       4    sp_instr_stmt('statement1')
       5    sp_instr_stmt('statement2')
       6    sp_instr_hpop(1)
       7    sp_instr_stmt('statement3')

CONTINUE handler:

CREATE PROCEDURE hndlr1(val INT)
      BEGIN
        DECLARE x INT DEFAULT 0;
        DECLARE foo CONDITION FOR 1146;
        DECLARE CONTINUE HANDLER FOR foo SET x = 1;

        INSERT INTO t3 VALUES ("hndlr1", val);     # Non-existing table?
        IF x>0 THEN
          INSERT INTO t1 VALUES ("hndlr1", val);   # This instead then
        END IF;
      END|

Pos.  Instruction
       0    sp_instr_set(1, '0')
       1    sp_instr_hpush_jump(4, 2)
       2    sp_instr_set(1, '1')
       3    sp_instr_hreturn(2)                 # frame size
       4    sp_instr_stmt('insert ... t3 ...')
       5    sp_instr_jump_if_not(7, 'x>0')
       6    sp_instr_stmt('insert ... t1 ...')
       7    sp_instr_hpop(2)

Cursors

For stored procedures to be really useful, you want to have cursors. MySQL doesn't yet have "real" cursor support (with API and ODBC support, allowing updating, arbitrary scrolling, etc), but a simple asensitive, non-scrolling, read-only cursor can be implemented in SPs using the class Protocol_cursor. This class intecepts the creation and sending of results sets and instead stores it in-memory, as MYSQL_FIELDS and MYSQL_ROWS (as in the client API).

To support this, we need the usual name binding support in sp_pcontext (similar to variables and conditions) to keep track on declared cursor names, and a corresponding run-time mechanism in sp_rcontext. Cursors are lexically scoped like everything with a body or BEGIN/END block, so they are pushed and poped as usual (see conditions and variables above). The basic operations on a cursor are OPEN, FETCH and CLOSE, which will each have a corresponding instruction. In addition, we need instructions to push a new cursor (this will encapsulate the LEX of the SELECT statement of the cursor), and a pop instruction:

sp_instr_cpush Push a cursor to the sp_rcontext. This instruction contains the LEX for the select statement
sp_instr_cpop Pop a number of cursors from the sp_rcontext.
sp_instr_copen Open a cursor: This will execute the select and get the result set in a sepeate memroot.
sp_instr_cfetch Fetch the next row from the in-memory result set. The instruction contains a list of the variables (frame offsets) to set.
sp_instr_cclose Free the result set.

A cursor is a separate class, sp_cursor (defined in sp_rcontex.h) which encapsulates the basic operations used by the above instructions. This class contains the LEX, Protocol_cursor object, and its memroot, as well as the cursor's current state. Compiling and executing is fairly straight-forward. sp_instr_copen is a subclass of sp_instr_stmt and uses its mechanism to execute a substatement.

Example

begin
        declare x int;
        declare c cursor for select a from t1;

        open c;
        fetch c into x;
        close c;
      end

Pos.  Instruction
       0    sp_instr_cpush('select a from ...')
       1    sp_instr_copen(0)                   # The 0'th cursor
       2    sp_instr_cfetch(0)                  # Contains the variable list
       3    sp_instr_cclose(0)
       4    sp_instr_cpop(1)

The SP cache

There are two ways to cache SPs:

one global cache, share by all threads/connections,
one cache per thread.

There are pros and cons with both methods:

Pros: Save memory, each SP only read from table once,
Cons: Needs locking (= serialization at access), requires thread-safe data structures,

Pros: Fast, no locking required (almost), limited thread-safe requirement,
Cons: Uses more memory, each SP read from table once per thread.

Unfortunately, we cannot use alternative 1 for the time being, as most of the data structures to be cached (lex and items) are not reentrant and thread-safe. (Things are modified at execution, we have THD pointers stored everywhere, etc.) This leaves us with alternative 2, one cache per thread; or actually two, since we keep FUNCTIONs and PROCEDUREs in separate caches. This is not that terrible; the only case when it will perform significantly worse than a global cache is when we have an application where new threads are connecting, calling a procedure, and disconnecting, over and over again.

The cache implementation itself is simple and straightforward, a hashtable wrapped in a class and a C API (see APIs below).

There is however one issue with multiple caches: dropping and altering procedures. Normally, this should be a very rare event in a running system; it's typically something you do during development and testing, so it's not unthinkable that we would simply ignore the issue and let any threads running with a cached version of an SP keep doing so until its disconnected. But assuming we want to keep the caches consistent with respect to drop and alter, it can be done:

A global counter is needed, initialized to 0 at start.
At each DROP or ALTER, increase the counter by one.
Each cache has its own copy of the counter, copied at the last read.
When looking up a name in the cache, first check if the global counter is larger than the local copy. If so, clear the cache and return "not found", and update the local counter; otherwise, lookup as usual.

This minimizes the cost to a single brief lock for the access of an integer when operating normally. Only in the event of an actual drop or alter, is the cache cleared. This may seem to be drastic, but since we assume that this is a rare event, it's not a problem. It would of course be possible to have a much more fine-grained solution, keeping track of each SP, but the overhead of doing so is not worth the effort.

Class and Function APIs

This is an outline of the key types. Some types and other details in the actual files have been omitted for readability.

The parser context: sp_pcontext.h

typedef enum
      {
        sp_param_in,
        sp_param_out,
        sp_param_inout
      } sp_param_mode_t;

      typedef struct
      {
        LEX_STRING name;
        enum enum_field_types type;
        sp_param_mode_t mode;
        uint offset;                    // Offset in current frame
        my_bool isset;
      } sp_pvar_t;

      typedef struct sp_cond_type
      {
        enum { number, state, warning, notfound, exception } type;
        char sqlstate[6];
        uint mysqlerr;
      } sp_cond_type_t;

      class sp_pcontext
      {
        sp_pcontext();

        // Return the maximum frame size
        uint max_framesize();

        // Return the current frame size
        uint current_framesize();

        // Return the number of parameters
        uint params();

        // Set the number of parameters to the current frame size
        void set_params();

        // Set type of the variable at offset 'i' in the frame
        void set_type(uint i, enum enum_field_types type);

        // Mark the i:th variable to "set" (i.e. having a value) with
        // 'val' true.
        void set_isset(uint i, my_bool val);

        // Push the variable 'name' to the frame.
        void push_var(LEX_STRING *name,
                      enum enum_field_types type, sp_param_mode_t mode);

        // Pop 'num' variables from the frame.
        void pop_var(uint num = 1);

        // Find variable by name
        sp_pvar_t *find_pvar(LEX_STRING *name);

        // Find variable by index
        sp_pvar_t *find_pvar(uint i);

        // Push label 'name' of instruction index 'ip' to the label context
        sp_label_t *push_label(char *name, uint ip);

        // Find label 'name' in the context
        sp_label_t *find_label(char *name);

        // Return the last pushed label
        sp_label_t *last_label();

        // Return and remove the last pushed label.
        sp_label_t *pop_label();

        // Push a condition to the context
        void push_cond(LEX_STRING *name, sp_cond_type_t *val);

        // Pop a 'num' condition from the context
        void pop_cond(uint num);

        // Find a condition in the context
        sp_cond_type_t *find_cond(LEX_STRING *name);

        // Increase the handler count
        void add_handler();

        // Returns the handler count
        uint handlers();

	// Push a cursor
        void push_cursor(LEX_STRING *name);

	// Find a cursor
	my_bool find_cursor(LEX_STRING *name, uint *poff);

	// Pop 'num' cursors
	void pop_cursor(uint num);

	// Return the number of cursors
	uint cursors();
      }

Run-time context (call frame): sp_rcontext.h:

#define SP_HANDLER_NONE      0
    #define SP_HANDLER_EXIT      1
    #define SP_HANDLER_CONTINUE  2
    #define SP_HANDLER_UNDO      3

    typedef struct
    {
      struct sp_cond_type *cond;
      uint handler;             // Location of handler
      int type;
      uint foffset;             // Frame offset for the handlers declare level
    } sp_handler_t;

    class sp_rcontext
    {
      // 'fsize' is the max size of the context, 'hmax' the number of handlers,
      // 'cmax' the number of cursors
      sp_rcontext(uint fsize, uint hmax, , uint cmax);

      // Push value (parameter) 'i' to the frame
      void push_item(Item *i);

      // Set slot 'idx' to value 'i'
      void set_item(uint idx, Item *i);

      // Return the item in slot 'idx'
      Item *get_item(uint idx);

      // Set the "out" index 'oidx' for slot 'idx. If it's an IN slot,
      // use 'oidx' -1.
      void set_oindex(uint idx, int oidx);

      // Return the "out" index for slot 'idx'
      int get_oindex(uint idx);

      // Set the FUNCTION result
      void set_result(Item *i);

      // Get the FUNCTION result
      Item *get_result();

      // Push handler at location 'h' for condition 'cond'. 'f' is the
      // current variable frame size.
      void push_handler(sp_cond_type_t *cond, uint h, int type, uint f);

      // Pop 'count' handlers
      void pop_handlers(uint count);

      // Find a handler for this error. This sets the state for a found
      // handler in the context. If called repeatedly without clearing,
      // only the first call's state is kept.
      int find_handler(uint sql_errno);

      // Returns 1 if a handler has been found, with '*ip' and '*fp' set
      // to the handler location and frame size respectively.
      int found_handler(uint *ip, uint *fp);

      // Clear the found handler state.
      void clear_handler();

      // Push a return address for a CONTINUE handler
      void push_hstack(uint ip);

      // Pop the CONTINUE handler return stack
      uint pop_hstack();

      // Save variables from frame index 'fp' and up.
      void save_variables(uint fp);

      // Restore saved variables from to frame index 'fp' and up.
      void restore_variables(uint fp);

      // Push a cursor for the statement (lex)
      void push_cursor(LEX *lex);

      // Pop 'count' cursors
      void pop_cursors(uint count);

      // Pop all cursors
      void pop_all_cursors();

      // Get the 'i'th cursor
      sp_cursor *get_cursor(uint i);

    }

The procedure: sp_head.h:

#define TYPE_ENUM_FUNCTION  1
      #define TYPE_ENUM_PROCEDURE 2

      class sp_head
      {
        int m_type;             // TYPE_ENUM_FUNCTION or TYPE_ENUM_PROCEDURE

        sp_head();

        void init(LEX_STRING *name, LEX *lex, LEX_STRING *comment, char suid);

        // Store this procedure in the database. This is a wrapper around
        // the function sp_create_procedure().
        int create(THD *);

        // Invoke a FUNCTION
        int
        execute_function(THD *thd, Item **args, uint argcount, Item **resp);

        // CALL a PROCEDURE
        int
        execute_procedure(THD *thd, List<Item> *args);

        // Add the instruction to this procedure.
        void add_instr(sp_instr *);

        // Returns the number of instructions.
        uint instructions();

        // Returns the last instruction
        sp_instr *last_instruction();

        // Resets lex in 'thd' and keeps a copy of the old one.
        void reset_lex(THD *);

        // Restores lex in 'thd' from our copy, but keeps some status from the
        // one in 'thd', like ptr, tables, fields, etc.
        void restore_lex(THD *);

        // Put the instruction on the backpatch list, associated with
        // the label.
        void push_backpatch(sp_instr *, struct sp_label *);

        // Update all instruction with this label in the backpatch list to
        // the current position.
        void backpatch(struct sp_label *);

        // Returns the SP name (with optional length in '*lenp').
        char *name(uint *lenp = 0);

        // Returns the result type for a function
        Item_result result();

        // Sets various attributes
        void sp_set_info(char *creator, uint creatorlen,
                         longlong created, longlong modified,
                         bool suid, char *comment, uint commentlen);
      }

Instructions

The base class

class sp_instr
        {
          // 'ip' is the index of this instruction
          sp_instr(uint ip);

          // Execute this instrution.
          // '*nextp' will be set to the index of the next instruction
          // to execute. (For most instruction this will be the
          // instruction following this one.)
          // Returns 0 on success, non-zero if some error occurred.
          virtual int execute(THD *, uint *nextp)
        }
<<code>>


===== Statement instruction
<<code>>
        class sp_instr_stmt : public sp_instr
        {
          sp_instr_stmt(uint ip);

          int execute(THD *, uint *nextp);

          // Set the statement's Lex
          void set_lex(LEX *);

          // Return the statement's Lex
          LEX *get_lex();
        }

SET instruction

class sp_instr_set : public sp_instr
        {
          // 'offset' is the variable's frame offset, 'val' the value,
          // and 'type' the variable type.
          sp_instr_set(uint ip,
                       uint offset, Item *val, enum enum_field_types type);

          int execute(THD *, uint *nextp);
        }

Unconditional jump

class sp_instr_jump : public sp_instr
        {
          // No destination, must be set.
          sp_instr_jump(uint ip);

          // 'dest' is the destination instruction index.
          sp_instr_jump(uint ip, uint dest);

          int execute(THD *, uint *nextp);

          // Set the destination instruction 'dest'.
          void set_destination(uint dest);
        }

Conditional jump

class sp_instr_jump_if_not : public sp_instr_jump
        {
          // Jump if 'i' evaluates to false. Destination not set yet.
          sp_instr_jump_if_not(uint ip, Item *i);

          // Jump to 'dest' if 'i' evaluates to false.
          sp_instr_jump_if_not(uint ip, Item *i, uint dest)

          int execute(THD *, uint *nextp);
        }

Return a function value

class sp_instr_freturn : public sp_instr
        {
          // Return the value 'val'
          sp_instr_freturn(uint ip, Item *val, enum enum_field_types type);
          
          int execute(THD *thd, uint *nextp);
        }

Push a handler and jump

class sp_instr_hpush_jump : public sp_instr_jump
        {
          // Push handler of type 'htype', with current frame size 'fp'
          sp_instr_hpush_jump(uint ip, int htype, uint fp);

          int execute(THD *thd, uint *nextp);

          // Add condition for this handler
          void add_condition(struct sp_cond_type *cond);
        }

Pops handlers

class sp_instr_hpop : public sp_instr
        {
          // Pop 'count' handlers
          sp_instr_hpop(uint ip, uint count);

          int execute(THD *thd, uint *nextp);
        }

Return from a CONTINUE handler

class sp_instr_hreturn : public sp_instr
        {
          // Return from handler, and restore variables to 'fp'.
          sp_instr_hreturn(uint ip, uint fp);

          int execute(THD *thd, uint *nextp);
        }

Push a CURSOR

class sp_instr_cpush : public sp_instr_stmt
	{
          // Push a cursor for statement 'lex'
	  sp_instr_cpush(uint ip, LEX *lex)

	  int execute(THD *thd, uint *nextp);
        }

Pop CURSORs

class sp_instr_cpop : public sp_instr_stmt
	{
          // Pop 'count' cursors
	  sp_instr_cpop(uint ip, uint count)

	  int execute(THD *thd, uint *nextp);
        }

Open a CURSOR

class sp_instr_copen : public sp_instr_stmt
	{
          // Open the 'c'th cursor
	  sp_instr_copen(uint ip, uint c);

	  int execute(THD *thd, uint *nextp);
        }

Close a CURSOR

class sp_instr_cclose : public sp_instr
	{
          // Close the 'c'th cursor
	  sp_instr_cclose(uint ip, uint c);

	  int execute(THD *thd, uint *nextp);
        }

Fetch a row with CURSOR

class sp_instr_cfetch : public sp_instr
	{
          // Fetch next with the 'c'th cursor
	  sp_instr_cfetch(uint ip, uint c);

	  int execute(THD *thd, uint *nextp);

	  // Add a target variable for the fetch
	  void add_to_varlist(struct sp_pvar *var);
        }

Utility functions: sp.h

#define SP_OK                 0
      #define SP_KEY_NOT_FOUND     -1
      #define SP_OPEN_TABLE_FAILED -2
      #define SP_WRITE_ROW_FAILED  -3
      #define SP_DELETE_ROW_FAILED -4
      #define SP_GET_FIELD_FAILED  -5
      #define SP_PARSE_ERROR       -6

      // Finds a stored procedure given its name. Returns NULL if not found.
      sp_head *sp_find_procedure(THD *, LEX_STRING *name);

      // Store the procedure 'name' in the database. 'def' is the complete
      // definition string ("create procedure ...").
      int sp_create_procedure(THD *,
                              char *name, uint namelen,
                              char *def, uint deflen,
                              char *comment, uint commentlen, bool suid);

      // Drop the procedure 'name' from the database.
      int sp_drop_procedure(THD *, char *name, uint namelen);

      // Finds a stored function given its name. Returns NULL if not found.
      sp_head *sp_find_function(THD *, LEX_STRING *name);

      // Store the function 'name' in the database. 'def' is the complete
      // definition string ("create function ...").
      int sp_create_function(THD *,
                             char *name, uint namelen,
                             char *def, uint deflen,
                             char *comment, uint commentlen, bool suid);

      // Drop the function 'name' from the database.
      int sp_drop_function(THD *, char *name, uint namelen);

The cache: sp_cache.h

/* Initialize the SP caching once at startup */
      void sp_cache_init();

      /* Clear the cache *cp and set *cp to NULL */
      void sp_cache_clear(sp_cache **cp);

      /* Insert an SP to cache. If **cp points to NULL, it's set to a
         new cache */
      void sp_cache_insert(sp_cache **cp, sp_head *sp);

      /* Lookup an SP in cache */
      sp_head *sp_cache_lookup(sp_cache **cp, char *name, uint namelen);

      /* Remove an SP from cache */
      void sp_cache_remove(sp_cache **cp, sp_head *sp);

The mysql.proc schema

This is the mysql.proc table used in MariaDB 10.4:

CREATE TABLE `proc` (
  `db` char(64) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '',
  `name` char(64) NOT NULL DEFAULT '',
  `type` enum('FUNCTION','PROCEDURE','PACKAGE','PACKAGE BODY') NOT NULL,
  `specific_name` char(64) NOT NULL DEFAULT '',
  `language` enum('SQL') NOT NULL DEFAULT 'SQL',
  `sql_data_access` enum('CONTAINS_SQL','NO_SQL','READS_SQL_DATA','MODIFIES_SQL_DATA') NOT NULL DEFAULT 'CONTAINS_SQL',
  `is_deterministic` enum('YES','NO') NOT NULL DEFAULT 'NO',
  `security_type` enum('INVOKER','DEFINER') NOT NULL DEFAULT 'DEFINER',
  `param_list` blob NOT NULL,
  `returns` longblob NOT NULL,
  `body` longblob NOT NULL,
  `definer` char(141) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '',
  `created` timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(),
  `modified` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `sql_mode` set('REAL_AS_FLOAT','PIPES_AS_CONCAT','ANSI_QUOTES','IGNORE_SPACE','IGNORE_BAD_TABLE_OPTIONS','ONLY_FULL_GROUP_BY','NO_UNSIGNED_SUBTRACTION','NO_DIR_IN_CREATE','POSTGRESQL','ORACLE','MSSQL','DB2','MAXDB','NO_KEY_OPTIONS','NO_TABLE_OPTIONS','NO_FIELD_OPTIONS','MYSQL323','MYSQL40','ANSI','NO_AUTO_VALUE_ON_ZERO','NO_BACKSLASH_ESCAPES','STRICT_TRANS_TABLES','STRICT_ALL_TABLES','NO_ZERO_IN_DATE','NO_ZERO_DATE','INVALID_DATES','ERROR_FOR_DIVISION_BY_ZERO','TRADITIONAL','NO_AUTO_CREATE_USER','HIGH_NOT_PRECEDENCE','NO_ENGINE_SUBSTITUTION','PAD_CHAR_TO_FULL_LENGTH','EMPTY_STRING_IS_NULL','SIMULTANEOUS_ASSIGNMENT') NOT NULL DEFAULT '',
  `comment` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `character_set_client` char(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `collation_connection` char(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `db_collation` char(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `body_utf8` longblob DEFAULT NULL,
  `aggregate` enum('NONE','GROUP') NOT NULL DEFAULT 'NONE',
  PRIMARY KEY (`db`,`name`,`type`)
) ENGINE=Aria DEFAULT CHARSET=utf8 PAGE_CHECKSUM=1 TRANSACTIONAL=1 COMMENT='Stored Procedures'

_{This page is licensed: CC BY-SA / Gnu FDL}

MariaDB Memory Usage

How MariaDB uses memory

Connect Memory Usage

When creating a connection, a THD object is created for that connection. This contains all connection information and also caches to speed up queries and avoid frequent malloc() calls.

When creating a new connection, the following malloc() calls are done for the THD:

The following information is the state in MariaDB 10.6.1 when compiled without debugging.

Local Thread Memory

This is part of select memory_used from information_schema.processlist.

Amount allocated

Where allocated

Description

26646

THD::THD

Allocation of THD object

256

Statement_map::Statement_map(), my_hash_init(key_memory_prepared_statement_map, &st_hash

Prepared statements

256

my_hash_init(key_memory_prepared_statement_map, &names_hash

Names of used prepared statements

128

wsrep_wfc(), Opt_trace_context(), dynamic_array()

1024

Diagnostics_area::init(),init_sql_alloc(PSI_INSTRUMENT_ME, &m_warn_root

120

Session_sysvars_tracker, global_system_variables.session_track_system_variables

Tracking of changed session variables

280

THD::THD,my_hash_init(key_memory_user_var_entry,&user_vars

280

THD::THD,my_hash_init(PSI_INSTRUMENT_ME, &sequences

Cache of used sequences

1048

THD::THD, m_token_array= my_malloc(PSI_INSTRUMENT_ME, max_digest_length

16416

CONNECT::create_thd(), my_net_init(), net_allocate_new_packet()

This is for reading data from the connected user

16416

check_connection(), thd->packet.alloc()

This is for sending data to connected user

Objects Stored in THD->memroot During Connect

Amount allocated

Where allocated

Description

send_server_handshake_packet, mpvio->cached_server_packet.pkt=

parse_client_handshake_packet, thd->copy_with_error(...db,db_len)

parse_client_handshake_packet, sctx->user=

368

ACL_USER::copy(), root=

Allocation of ACL_USER object

ACL_USER::copy(), dst->user= safe_lexcstrdup_root(root, user)

ACL_USER::copy()

Allocation of other connect attributes

ACL_USER::copy()

mysql_change_db()

Store current db in THD

dbname_cache->insert(db_name)

Store db name in db name cache

mysql_change_db(), my_register_filename(db.opt)

Store filename db.opt

8216

load_db_opt(), init_io_cache()

Disk cache for reading db.opt

1112

load_db_opts(), put_dbopts()

Cache default database parameters

State at First Call to mysql_execute_command

(gdb) p thd->status_var.local_memory_used
$24 = 75496
(gdb) p thd->status_var.global_memory_used
$25 = 17544
(gdb) p thd->variables.query_prealloc_size
$30 = 24576
(gdb) p thd->variables.trans_prealloc_size
$37 = 4096

_{This page is licensed: CC BY-SA / Gnu FDL}

Using MariaDB with Your Programs (API)

Progress Reporting

MariaDB supports progress reporting for some long running commands.

What is Progress Reporting?

Progress reporting means that:

There is a Progress column in SHOW PROCESSLIST which shows the total progress (0-100%)
INFORMATION_SCHEMA.PROCESSLIST has three columns which allow you to see in which process stage we are and how much of that stage is completed:
- STAGE
- MAX_STAGE
- PROGRESS (within current stage).
The client receives progress messages which it can display to the user to indicate how long the command will take.

We have separate progress reporting for stages because different stages take different amounts of time.

Supported Commands

Currently, the following commands can send progress report messages to the client:

ALTER TABLE
CREATE INDEX
DROP INDEX
LOAD DATA INFILE (not LOAD DATA LOCAL INFILE, as in that case we don't know the size of the file).

Some Aria storage engine operations also support progress messages:

Limitations

Although the above commands support progress reporting, there are some limitations to what progress is reported. To be specific, when executing one of these commands against an InnoDB table with ALGORITHM=INPLACE (which is the default in MariaDB 10.0+), progress is only reported during the merge sort phase while reconstructing indexes.

Enabling and Disabling Progress Reporting

mysqld (the MariaDB server) automatically sends progress report messages to clients that support the new protocol, using the value of the progress_report_time variable. They are sent every max(global.progress_report_time , progress_report_time) seconds (by default 5). You can disable the sending of progress report messages to the client by setting either the local variable (affects only the current connection) or the global variable (affects all connections) to 0.

If the extra column in SHOW PROCESSLIST gives you a compatibility problem, you can disable it by starting mysqld with the --old flag.

Clients Which Support Progress Reporting

The mariadb command line client
The mytop that comes with MariaDB has a '%' column which shows the progress.

Progress Reporting in the mysql Command Line Client

Progress reporting is enabled by default in the mariadb client. You can disable it with --disable-progress-reports. It is automatically disabled in batch mode.

When enabled, for every supported command you get a progress report like:

ALTER TABLE my_mail ENGINE=aria;
Stage: 1 of 2 'copy to tmp table'  5.37% of stage done

This is updated every progress_report_time seconds (the default is 5). If the global progress_report_time is higher, this will be used. You can also disable error reporting by setting the variable to 0.

How to Add Support for Progress Reporting to a Client

You need to use the MariaDB 5.3 or later client library. You can check that the library supports progress reporting by doing:

#ifdef CLIENT_PROGRESS

To enable progress reporting to the client you need to addCLIENT_PROGRESS to the connect_flag in mysql_real_connect():

mysql_real_connect(mysql, host, user, password,
                   database, opt_mysql_port, opt_mysql_unix_port,
                   connect_flag | CLIENT_PROGRESS);

Then you need to provide a callback function for progress reports:

static void report_progress(const MYSQL *mysql, uint stage, uint max_stage,
                            double progress, const char *proc_info,
                            uint proc_info_length);

mysql_options(&mysql, MYSQL_PROGRESS_CALLBACK, (void*) report_progress);

The above report_progress function will be called for each progress message.

This is the implementation used by mysql.cc:

uint last_progress_report_length;

static void report_progress(const MYSQL *mysql, uint stage, uint max_stage,
                            double progress, const char *proc_info,
                            uint proc_info_length)
{
  uint length= printf("Stage: %d of %d '%.*s' %6.3g%% of stage done",
                      stage, max_stage, proc_info_length, proc_info, 
                      progress);
  if (length < last_progress_report_length)
    printf("%*s", last_progress_report_length - length, "");
  putc('\r', stdout);
  fflush(stdout);
  last_progress_report_length= length;
}

If you want only one number for the total progress, you can calculate it with:

double total_progress=
 ((stage -1) / (double) max_stage * 100.00 + progress / max_stage);

Note: proc_info is totally independent of stage. You can have many different proc_info values within a stage. The idea behind proc_info is to give the user more information about what the server is doing.

How to Add Support for Progress Reporting to a Storage Engine

The functions to use for progress reporting are:

void thd_progress_init(MYSQL_THD thd, unsigned int max_stage);

Initialize progress reporting with stages. This is mainly used for commands that are totally executed within the engine, like CHECK TABLE. You should not use this for operations that could be called by, for example,ALTER TABLE as this has already called the function.

max_stage is the number of stages your storage engine will have.

void thd_progress_report(MYSQL_THD thd, unsigned long long progress,
                         unsigned long long max_progress);

The above is used for reporting progress.

progress is how much of the file/rows/keys you have gone through.
max_progress is the max number of rows you will go through.

You can call this with varying numbers, but normally the ratioprogress/max_progress should be increasing.

This function can be called even if you are not using stages, for example when enabling keys as part of ALTER TABLE or ADD INDEX.

void thd_progress_next_stage(MYSQL_THD thd);

To go to the next stage in a multi-stage process initiated bythd_progress_init():

void thd_progress_end(MYSQL_THD thd);

End progress reporting; Sets 'Progress' back to 0 in SHOW PROCESSLIST.

const char *thd_proc_info(thd, 'stage name');

This sets the name of the current status/stage that is displayed inSHOW PROCESSLIST and in the client. It's recommended that you call this between stages and thus before thd_progress_report() andthd_progress_next_stage().

This functions returns the last used proc_info. It's recommended that you restore proc_info to its original value when you are done processing.

Note: thd_proc_info() is totally independent of stage. You can have many different proc_info values within a stage to give the user more information about what is going on.

Examples to Look at in the MariaDB Source:

client/mysql.cc for an example of how to use reporting.
libmysql/client.c:cli_safe_read() to see how progress packets are handled in client
sql/protocol.cc::net_send_progress_packet() for how progress packets are handled in server.

Format of Progress Packets

The progress packet is sent as an error packet with error number 65535.

It contains the following data (in addition to the error header):

Option

Number of bytes

Other info

Number of strings. For future

Stage

Stage from 1 - Max_stage

Max_stage

Max number of stages

Progress

Progress in % * 1000

Status_length

1-2

Packet length of string in net_field_length() format

Status

Status_length

Status / Stage name

libMariaDB

libmysqld

Articles about libmysqld.so, the embedded MariaDB server

Embedded MariaDB Interface

The embedded MariaDB server, libmysqld has the identical interface as the C client librarylibmysqclient.

The normal usage of the embedded server is to use the normal mysql.h include file in your application and link with libmysqld instead of libmysqlclient.

The intention is that one should be able to move from a server/client version of MariaDB to a single server version of MariaDB by just changing which library you link with.

This means that the embedded C client API only changes when the normal C API changes, usually only between major releases.

The only major change required in your application if you are going to use the embedded server is that you have to call the following functions from your application:

int mysql_library_init(int argc, char **argv, char **groups)
void mysql_library_end(void);

This is also safe to do when using the standard C library.

Notes

libmysqld.so has many more exported symbols than the C library to allow one to expose and use more parts of MariaDB. In normal applications one should not use them, as they may change between every release.
Before MariaDB 5.5.60 (MariaDB 10.0.35, MariaDB 10.1.33, MariaDB 10.2.15, MariaDB 10.3.7), the embedded server library did not support SSL when it was used to connect to remote servers.
Starting with MariaDB 10.5 the embedded server library and related test binaries are no longer part of binary tarball release archives.

The mariadb-test and mariadb-test-embedded Programs

The mariadb-test program runs a test case against a MariaDB or MySQL server and optionally compares the output with a result file. This program reads input written in a special test language. Typically, you invokemariadb-test via mariadb-test-run.pl rather than invoking it directly.

mariadb-test_embedded is similar but is built with support for the libmariadbd embedded server.

Features of mariadb-test:

Can send SQL statements to the server for execution
Can execute external shell commands
Can test whether the result from an SQL statement or shell command is as expected
Can connect to one or more standalone mariadbd servers and switch between connections
Can connect to an embedded server (libmariadbd), if MariaDB is compiled with support for libmariadbd. (In this case, the executable is named mariadb-test_embedded rather than mariadb-test.)

By default, mariadb-test reads the test case on the standard input. To run mariadb-test this way, you normally invoke it like this:

shell> mariadb-test **[options] [db_name]** < //test_file//

You can also name the test case file with a--test-file=file_name option.

The exit value from mariadb-test is 0 for success, 1 for failure, and 62 if it skips the test case (for example, if after checking some preconditions it decides not to run the test).

Options

mariadb-test supports the following options:

Option

Description

--help, -?

Display a help message and exit.

--basedir=dir, -b dir

The base directory for tests.

--character-sets-dir=path

The directory where character sets are installed.

--compress, -C

Compress all information sent between the client and the server if both support compression.

--connect-timeout=N

This can be used to set the MYSQL_OPT_CONNECT_TIMEOUT parameter of mysql_options to change the number of seconds before an unsuccessful connection attempt times out.

--continue-on-error

Continue test even if we got an error. This is mostly useful when testing a storage engine to see what from a test file it can execute, or to find all syntax errors in a newly created big test file

--cursor-protocol

Use cursors for prepared statements.

--database=db_name, -D db_name

The default database to use.

--debug[=debug_options], -#[debug_options]

Write a debugging log if MariaDB is built with debugging support. The default debug_options value is d:t:S:i:O,/tmp/mysqltest.trace on Unix and d:t:i:O,\mysqld.trace on Windows.

--debug-check

Print some debugging information when the program exits.

--debug-info

Print debugging information and memory and CPU usage statistics when the program exits.

--host=host_name, -h host_name

Connect to the server on the given host.

--logdir=dir_name

The directory to use for log files.

--mark-progress

Write the line number and elapsed time to test_file.progress.

--max-connect-retries=num

The maximum number of connection attempts when connecting to server.

--max-connections=num

The maximum number of simultaneous server connections per client (that is, per test). If not set, the maximum is 128. Minimum allowed limit is 8, maximum is 5120.

--no-defaults

Do not read default options from any option files. If used, this must be the first option.

--non-blocking-api

Use the non-blocking client API for communication.

--overlay-dir=name

Overlay directory.

--password[=password], -p[password]

The password to use when connecting to the server. If you use the short option form (-p), you cannot have a space between the option and the password. If you omit the password value following the --password or -p option on the command line, you are prompted for one.

plugin-dir

Directory for client-side plugins.

--port=port_num, -P port_num

The TCP/IP port number to use for the connection, or 0 for default to, in order of preference, my.cnf, $MYSQL_TCP_PORT, /etc/services, built-in default (3306).

--prologue=name

Include the contents of the given file before processing the contents of the test file. The included file should have the same format as other mariadb-test test files. This option has the same effect as putting a --source file_name command as the first line of the test file.

--protocol=name

The protocol to use for connection (tcp, socket, pipe, memory).

--ps-protocol

Use the prepared-statement protocol for communication.

--quiet

Suppress all normal output. This is a synonym for --silent.

--record, -r

Record the output that results from running the test file into the file named by the --result-file option, if that option is given. It is an error to use this option without also using --result-file.

--result-file=file_name, -R file_name

This option specifies the file for test case expected results. --result-file, together with --record, determines how mariadb-test treats the test actual and expected results for a test case:If the test produces no results, mariadb-test exits with an error message to that effect, unless --result-file is given and the named file is an empty file.Otherwise, if --result-file is not given, mariadb-test sends test results to the standard output. With --result-file but not --record, mariadb-test reads the expected results from the given file and compares them with the actual results. If the results do not match, mariadb-test writes a reject file in the same directory as the result file, outputs a diff of the two files, and exits with an error. With both --result-file and --record, mariadb-test updates the given file by writing the actual test results to it.

--result-format-version=#

Version of the result file format to use.

--server-arg=value, -A value

Pass the argument as an argument to the embedded server. For example, --server-arg=--tmpdir=/tmp or --server-arg=--core. Up to 64 arguments can be given.

--server-file=file_name, -F file_name

Read arguments for the embedded server from the given file. The file should contain one argument per line.

--shared-memory-base-name

Shared-memory name to use for Windows connections using shared memory to a local server (started with the --shared-memory option). Case-sensitive.

--silent, -s

Suppress all normal output.

--sleep=num, -T num

Cause all sleep commands in the test case file to sleep num seconds. This option does not affect real_sleep commands. An option value of 0 can be used, which effectively disables sleep commands in the test case.

--socket=path, -S path

The socket file to use when connecting to localhost (which is the default host).

--sp-protocol

Execute DML statements within a stored procedure. For every DML statement, mariadb-test creates and invokes a stored procedure that executes the statement rather than executing the statement directly.

--ssl

Enable TLS for secure connection (automatically enabled with other flags). Disable with --skip-ssl.

--ssl-ca=name

CA file in PEM format (check OpenSSL docs, implies --ssl).

--ssl-capath=name

CA directory (check OpenSSL docs, implies --ssl).

--ssl-cert=name

X509 cert in PEM format (implies --ssl).

--ssl-cipher=name

SSL cipher to use (implies --ssl).

--ssl-key=name

X509 key in PEM format (implies --ssl).

--ssl-crl=name

Certificate revocation list (implies --ssl).

--ssl-crlpath=name

Certificate revocation list path (implies --ssl).

--ssl-verify-server-cert

Verify server's "Common Name" in its cert against hostname used when connecting. This option is disabled by default.

--suite-dir=name

Suite directory.

--tail-lines=nn

Specify how many lines of the result to include in the output if the test fails because an SQL statement fails. The default is 0, meaning no lines of result printed.

--test-file=file_name, -x file_name

Read test input from this file. The default is to read from the standard input.

--timer-file=file_name, -m file_name

If given, the number of microseconds spent running the test will be written to this file. This is used by mariadb-test-run.pl for its reporting.

--tmpdir=dir_name, -t dir_name

The temporary directory where socket files are created.

--user=user_name, -u user_name

The user name to use when connecting to the server.

--verbose, -v

Verbose mode. Print out more information about what the program does.

--version, -V

Display version information and exit.

--view-protocol

Every SELECT statement is wrapped inside a view.

--wait-longer-for-timeouts

Wait longer for timeouts. Useful when running under valgrind.

Non-Blocking Client Library

MariaDB client library (starting with version 5.5.21) and MySQL Connector/C (starting with version 2.1.0) supports _non-blocking_ operations

About Non-blocking Operation in the Client Library

MariaDB, starting with version 5.5.21 supports non-blocking operations in the client-library. This allows an application to start a query or other operation against the database, and then continue to do other work (in the same thread) while the request is sent over the network, the query is processed in the server, and the result travels back. As parts of the result become ready, the application can — at its leisure — call back into the library to continue processing, repeating this until the operation is completed.

Non-blocking operation is implemented entirely within the client library. This means no special server support is necessary and non-blocking operation works with any version of the MariaDB or MySQL server, the same as the normal blocking API. It also means that it is not possible to have two queries running at the same time on the same connection (this is a protocol limitation). But a single thread can have any number of non-blocking queries running at the same time, each using its own MYSQL connection object.

Non-blocking operation is useful when an application needs to run a number of independent queries in parallel at the same time, to speed up operation compared to running them sequentially one after the other. This could be multiple queries against a single server (to better utilize multiple CPU cores and/or a high-capacity I/O system on the server), or it could be queries against multiple servers (e.g. SHOW STATUS against all running servers for monitoring, or a map/reduce-like operation against a big sharded database).

Non-blocking operation is also very useful in applications that are already written in a non-blocking style, for example using a framework like libevent, or, for example, a GUI-application using an event loop. Using the non-blocking client library allows the integrations of database queries into such applications, without the risk of long-running queries "hanging" the user interface or stalling the event loop, and without having to manually spawn separate threads to run the queries and re-synchronize with the threads to get the results back.

In this context, "blocking" means the situation where communication on the network socket to the server has to wait while processing the query. Waiting can be necessary because the server has not yet had time to process the query, or because the data needs to travel over the network from the server, or even because the first part of a large request needs to be sent out on the network before local socket buffers can accept the last part. Whenever such a wait is necessary, control returns to the application. The application will then runselect() or poll() (or something similar) to detect when any wait condition is satisfied, and then call back into the library to continue processing.

An example program is available in the MariaDB source tree:

tests/async_queries.c

It uses libevent to run a set of queries in parallel from within a single thread / event loop. This is a good example of how to integrate non-blocking query processing into an event-based framework.

The non-blocking API in the client library is entirely optional. The new library is completely ABI- and source-compatible with existing applications. Also, applications not using non-blocking operations are not affected, nor is there any significant performance penalty for having support for non-blocking operations in the library for applications which do not use them.

The library internally uses co-routines, and requires a co-routine implementation to work. Native implementations are included for i386, amd64, and (since Connector/C version 3.3.12) aarch64 architectures. For other architectures, a fallback to ucontext is automatically used if available. An alternate fallback boost::context can also be used instead of ucontext by building with -DWITH_BOOST_CONTEXT=ON (boost::context is not used by default). If no co-routine implementation is available the non-blocking operations are disabled and will not work.

_{This page is licensed: CC BY-SA / Gnu FDL}

Non-blocking API Reference

Here is a list of all functions in the non-blocking client API and their parameters. Apart from operating in a non-blocking way, they all work exactly the same as their blocking counterparts, so their exact semantics can be obtained from the documentation of the normal client API.

The API also contains the following three functions which are used to get the socket fd and timeout values when waiting for events to occur:

my_socket mysql_get_socket(const MYSQL *mysql)

Return the descriptor of the socket used for the connection.

unsigned int STDCALL mysql_get_timeout_value(const MYSQL *mysql)

This should only be called when a _start() or _cont() function returns a value with the MYSQL_WAIT_TIMEOUT flag set. In this case, it returns the value, in seconds, after which a timeout has occured and the application should call the appropriate _cont() function passingMYSQL_WAIT_TIMEOUT as the event that occured.

This is used to handle connection and read timeouts.

unsigned int STDCALL mysql_get_timeout_value_ms(const MYSQL *mysql)

This function is available starting from MariaDB 5.5.28 and MariaDB 10.0.0.

Like mysql_get_timeout_value(), this should only be called when a _start() or _cont() function returns a value with the MYSQL_WAIT_TIMEOUT flag set. In this case, it returns the value, in millisecondsseconds, after which a timeout has occured and the application should call the appropriate _cont() function passing MYSQL_WAIT_TIMEOUT as the event that occured.

The difference to mysql_get_timeout_value() is that this provides millisecond resolution for timeouts, rather than just whole seconds. In MariaDB 10.0, internal timeouts can now be in milliseconds, while in 5.5 and below it was only whole seconds.

This milliseconds version is provided also in MariaDB 5.5 (from 5.5.28 onwards) to make it easier for applications to work with either library version. However, in 5.5 it always returns a multiple of 1000 milliseconds.

At the end is a list of all functions from the normal API which can be used safely in a non-blocking program, since they never need to block.

int mysql_real_connect_start(MYSQL **ret, MYSQL *mysql, const char *host,
                         const char *user, const char *passwd, const char *db,
                         unsigned int port, const char *unix_socket,
                         unsigned long client_flags)

int mysql_real_connect_cont(MYSQL **ret, MYSQL *mysql, int ready_status)

mysql_real_connect_start() initiates a non-blocking connection request to a server.

When mysql_real_connect_start() or mysql_real_connect_cont() returns zero, a copy of the passed 'mysql' argument is stored in *ret.

int mysql_real_query_start(int *ret, MYSQL *mysql, const char *stmt_str,
unsigned long length)
int mysql_real_query_cont(int *ret, MYSQL *mysql, int ready_status)

int mysql_fetch_row_start(MYSQL_ROW *ret, MYSQL_RES *result)
int mysql_fetch_row_cont(MYSQL_ROW *ret, MYSQL_RES *result, int ready_status)

Initiate fetch of another row from a SELECT query.

If the MYSQL_RES was obtained from mysql_use_result(), then this function allows stream processing, where initial rows are returned to the application while the server is still sending subsequent rows. When no more data is available on the socket, mysql_fetch_row_start() ormysql_fetch_row_cont() will return MYSQL_WAIT_READ (or possiblyMYSQL_WAIT_WRITE if using TLS and TLS re-negotiation is needed; alsoMYSQL_WAIT_TIMEOUT may be set if read timeout is enabled). When data becomes available, more rows can be fetched with mysql_fetch_row_cont().

If the MYSQL_RES was obtained from mysql_store_result() /mysql_store_result_start() / mysql_store_result_cont(), then this function cannot block — mysql_fetch_row_start() will always return 0 (and if desired, plain mysql_fetch_row() may be used instead with equivalent effect).

int mysql_set_character_set_start(int *ret, MYSQL *mysql, const char *csname)
int mysql_set_character_set_cont(int *ret, MYSQL *mysql, int ready_status)

mysql_select_db_start(int *ret, MYSQL *mysql, const char *db)
int mysql_select_db_cont(int *ret, MYSQL *mysql, int ready_status)

int mysql_send_query_start(int *ret, MYSQL *mysql, const char *q, unsigned long length)
int mysql_send_query_cont(int *ret, MYSQL *mysql, int ready_status)

int mysql_store_result_start(MYSQL_RES **ret, MYSQL *mysql)
int mysql_store_result_cont(MYSQL_RES **ret, MYSQL *mysql, int ready_status)

int mysql_free_result_start(MYSQL_RES *result)
int mysql_free_result_cont(MYSQL_RES *result, int ready_status)

This function can need to wait if not all rows were fetched before it was called (then it needs to consume any pending rows sent from the server so they do not interfere with any subsequent queries sent).

If all rows were already fetched, then this function will not need to wait.mysql_free_result_start() will return zero (or if so desired, plainmysql_free_result() may be used instead).

Note that mysql_free_result() returns no value, so there is no extra 'ret' parameter for mysql_free_result_start() ormysql_free_result_cont().

int mysql_close_start(MYSQL *sock)
int mysql_close_cont(MYSQL *sock, int ready_status)

mysql_close() sends a COM_QUIT request to the server, though it does not wait for any reply.

Thus teoretically it can block (if the socket buffer is full), though in practise it is probably unlikely to occur frequently.

The non-blocking version of mysql_close() is provided for completeness; for many applications using the normal mysql_close() is probably sufficient (and may be simpler).

Note that mysql_close() returns no value, so there is no extra 'ret' parameter for mysql_close_start() or mysql_close_cont().

int mysql_change_user_start(my_bool *ret, MYSQL *mysql, const char *user, const
                            char *passwd, const char *db)
int mysql_change_user_cont(my_bool *ret, MYSQL *mysql, int ready_status)

int mysql_query_start(int *ret, MYSQL *mysql, const char *q)
int mysql_query_cont(int *ret, MYSQL *mysql, int ready_status)

int mysql_shutdown_start(int *ret, MYSQL *mysql, enum mysql_enum_shutdown_level
                        shutdown_level)
int mysql_shutdown_cont(int *ret, MYSQL *mysql, int ready_status)

int mysql_dump_debug_info_start(int *ret, MYSQL *mysql)
int mysql_dump_debug_info_cont(int *ret, MYSQL *mysql, int ready_status)

int mysql_refresh_start(int *ret, MYSQL *mysql, unsigned int refresh_options)
int mysql_refresh_cont(int *ret, MYSQL *mysql, int ready_status)

int mysql_kill_start(int *ret, MYSQL *mysql, unsigned long pid)
int mysql_kill_cont(int *ret, MYSQL *mysql, int ready_status)

int mysql_set_server_option_start(int *ret, MYSQL *mysql,
                              enum enum_mysql_set_option option)
int mysql_set_server_option_cont(int *ret, MYSQL *mysql, int ready_status)

int mysql_ping_start(int *ret, MYSQL *mysql)
int mysql_ping_cont(int *ret, MYSQL *mysql, int ready_status)

int mysql_stat_start(const char **ret, MYSQL *mysql)
int mysql_stat_cont(const char **ret, MYSQL *mysql, int ready_status)

int mysql_list_dbs_start(MYSQL_RES **ret, MYSQL *mysql, const char *wild)
int mysql_list_dbs_cont(MYSQL_RES **ret, MYSQL *mysql, int ready_status)

int mysql_list_tables_start(MYSQL_RES **ret, MYSQL *mysql, const char *wild)
int mysql_list_tables_cont(MYSQL_RES **ret, MYSQL *mysql, int ready_status)

int mysql_list_processes_start(MYSQL_RES **ret, MYSQL *mysql)
int mysql_list_processes_cont(MYSQL_RES **ret, MYSQL *mysql, int ready_status)

int mysql_list_fields_start(MYSQL_RES **ret, MYSQL *mysql, const char *table,
                        const char *wild)
int mysql_list_fields_cont(MYSQL_RES **ret, MYSQL *mysql, int ready_status)

int mysql_read_query_result_start(my_bool *ret, MYSQL *mysql)
int mysql_read_query_result_cont(my_bool *ret, MYSQL *mysql, int ready_status)

int mysql_stmt_prepare_start(int *ret, MYSQL_STMT *stmt, const char *query,
                         unsigned long length)
int mysql_stmt_prepare_cont(int *ret, MYSQL_STMT *stmt, int ready_status)

int mysql_stmt_execute_start(int *ret, MYSQL_STMT *stmt)
int mysql_stmt_execute_cont(int *ret, MYSQL_STMT *stmt, int ready_status)

int mysql_stmt_fetch_start(int *ret, MYSQL_STMT *stmt)
int mysql_stmt_fetch_cont(int *ret, MYSQL_STMT *stmt, int ready_status)

int mysql_stmt_store_result_start(int *ret, MYSQL_STMT *stmt)
int mysql_stmt_store_result_cont(int *ret, MYSQL_STMT *stmt, int ready_status)

int mysql_stmt_close_start(my_bool *ret, MYSQL_STMT *stmt)
int mysql_stmt_close_cont(my_bool *ret, MYSQL_STMT *stmt, int ready_status)

int mysql_stmt_reset_start(my_bool *ret, MYSQL_STMT *stmt)
int mysql_stmt_reset_cont(my_bool *ret, MYSQL_STMT *stmt, int ready_status)

int mysql_stmt_free_result_start(my_bool *ret, MYSQL_STMT *stmt)
int mysql_stmt_free_result_cont(my_bool *ret, MYSQL_STMT *stmt, int ready_status)

int mysql_stmt_send_long_data_start(my_bool *ret, MYSQL_STMT *stmt,
                                unsigned int param_number,
                                const char *data, unsigned long length)
int mysql_stmt_send_long_data_cont(my_bool *ret, MYSQL_STMT *stmt, int ready_status)

int mysql_commit_start(my_bool *ret, MYSQL *mysql)
int mysql_commit_cont(my_bool *ret, MYSQL *mysql, int ready_status)

int mysql_rollback_start(my_bool *ret, MYSQL *mysql)
int mysql_rollback_cont(my_bool *ret, MYSQL *mysql, int ready_status)

int mysql_autocommit_start(my_bool *ret, MYSQL *mysql, my_bool auto_mode)
int mysql_autocommit_cont(my_bool *ret, MYSQL *mysql, int ready_status)

int mysql_next_result_start(int *ret, MYSQL *mysql)
int mysql_next_result_cont(int *ret, MYSQL *mysql, int ready_status)

int mysql_stmt_next_result_start(int *ret, MYSQL_STMT *stmt)
int mysql_stmt_next_result_cont(int *ret, MYSQL_STMT *stmt, int ready_status)

Client API functions which never block

The following client API functions never need to do I/O and thus can never block. Therefore, they can be used as normal in programs using non-blocking operations; no need to call any special _start() variant. (Even if a_start() variant was available, it would always return zero, so no_cont() call would ever be needed).

_{This page is licensed: CC BY-SA / Gnu FDL}

Using the Non-blocking Library

The MariaDB non-blocking client API is modelled after the normal blocking library calls. This makes it easy to learn and remember. It makes it easier to translate code from using the blocking API to using the non-blocking API (or vice versa). And it also makes it simple to mix blocking and non-blocking calls in the same code path.

For every library call that may block on socket I/O, such as 'int mysql_real_query(MYSQL, query, query_length)', two additional non-blocking calls are introduced:

int mysql_real_query_start(&status, MYSQL, query, query_length)
int mysql_real_query_cont(&status, MYSQL, wait_status)

To do non-blocking operation, an application first callsmysql_real_query_start() instead of mysql_real_query(), passing the same parameters.

If mysql_real_query_start() returns zero, then the operation completed without blocking, and 'status' is set to the value that would normally be returned from mysql_real_query().

Else, the return value from mysql_real_query_start() is a bitmask of events that the library is waiting on. This can be MYSQL_WAIT_READ,MYSQL_WAIT_WRITE, or MYSQL_WAIT_EXCEPT, corresponding to the similar flags for select() or poll(); and it can include MYSQL_WAIT_TIMEOUT when waiting for a timeout to occur (e.g. a connection timeout).

In this case, the application continues other processing and eventually checks for the appropriate condition(s) to occur on the socket (or for timeout). When this occurs, the application can resume the operation by callingmysql_real_query_cont(), passing in 'wait_status' a bitmask of the events which actually occurred.

Just like mysql_real_query_start(), mysql_real_query_cont() returns zero when done, or a bitmask of events it needs to wait on. Thus the application continues to repeatedly call mysql_real_query_cont(), intermixed with other processing of its choice; until zero is returned, after which the result of the operation is stored in 'status'.

Some calls, like mysql_option(), do not do any socket I/O, and so can never block. For these, there are no separate _start() or _cont() calls. See the "Non-blocking API reference" page for a full list of what functions can and can not block.

The checking for events on the socket / timeout can be done with select() or poll() or a similar mechanism. Though often it will be done using a higher-level framework (such as libevent), which supplies facilities for registering and acting on such conditions.

The descriptor of the socket on which to check for events can be obtained by calling mysql_get_socket(). The duration of any timeout can be obtained from mysql_get_timeout_value().

Here is a trivial (but full) example of running a query with the non-blocking API. The example is found in the MariaDB source tree asclient/async_example.c. (A larger, more realistic example using libevent is found as tests/async_queries.c in the source):

static void run_query(const char *host, const char *user, const char *password) {
  int err, status;
  MYSQL mysql, *ret;
  MYSQL_RES *res;
  MYSQL_ROW row;

  mysql_init(&mysql);
  mysql_options(&mysql, MYSQL_OPT_NONBLOCK, 0);

  status = mysql_real_connect_start(&ret, &mysql, host, user, password, NULL, 0, NULL, 0);
  while (status) {
    status = wait_for_mysql(&mysql, status);
    status = mysql_real_connect_cont(&ret, &mysql, status);
  }

  if (!ret)
    fatal(&mysql, "Failed to mysql_real_connect()");

  status = mysql_real_query_start(&err, &mysql, SL("SHOW STATUS"));
  while (status) {
    status = wait_for_mysql(&mysql, status);
    status = mysql_real_query_cont(&err, &mysql, status);
  }
  if (err)
    fatal(&mysql, "mysql_real_query() returns error");

  /* This method cannot block. */
  res= mysql_use_result(&mysql);
  if (!res)
    fatal(&mysql, "mysql_use_result() returns error");

  for (;;) {
    status= mysql_fetch_row_start(&row, res);
    while (status) {
      status= wait_for_mysql(&mysql, status);
      status= mysql_fetch_row_cont(&row, res, status);
    }
    if (!row)
      break;
    printf("%s: %s\n", row[0], row[1]);
  }
  if (mysql_errno(&mysql))
    fatal(&mysql, "Got error while retrieving rows");
  mysql_free_result(res);
  mysql_close(&mysql);
}

/* Helper function to do the waiting for events on the socket. */
static int wait_for_mysql(MYSQL *mysql, int status) {
  struct pollfd pfd;
  int timeout, res;

  pfd.fd = mysql_get_socket(mysql);
  pfd.events =
    (status & MYSQL_WAIT_READ ? POLLIN : 0) |
    (status & MYSQL_WAIT_WRITE ? POLLOUT : 0) |
    (status & MYSQL_WAIT_EXCEPT ? POLLPRI : 0);
  if (status & MYSQL_WAIT_TIMEOUT)
    timeout = 1000*mysql_get_timeout_value(mysql);
  else
    timeout = -1;
  res = poll(&pfd, 1, timeout);
  if (res == 0)
    return MYSQL_WAIT_TIMEOUT;
  else if (res < 0)
    return MYSQL_WAIT_TIMEOUT;
  else {
    int status = 0;
    if (pfd.revents & POLLIN) status |= MYSQL_WAIT_READ;
    if (pfd.revents & POLLOUT) status |= MYSQL_WAIT_WRITE;
    if (pfd.revents & POLLPRI) status |= MYSQL_WAIT_EXCEPT;
    return status;
  }
}

Setting MYSQL_OPT_NONBLOCK

Before using any non-blocking operation, it is necessary to enable it first by setting the MYSQL_OPT_NONBLOCK option:

mysql_options(&mysql, MYSQL_OPT_NONBLOCK, 0);

This call can be made at any time — typically it will be done at the start, before mysql_real_connect(), but it can be done at any time to start using non-blocking operations.

If a non-blocking operation is attempted without setting theMYSQL_OPT_NONBLOCK option, the program will typically crash with a NULL pointer exception.

The argument for MYSQL_OPT_NONBLOCK is the size of the stack used to save the state of a non-blocking operation while it is waiting for I/O and the application is doing other processing. Normally, applications will not have to change this, and it can be passed as zero to use the default value.

Mixing blocking and non-blocking operation

It is possible to freely mix blocking and non-blocking calls on the sameMYSQL connection.

Thus, an application can do a normal blocking mysql_real_connect() and subsequently do a non-blocking mysql_real_query_start(). Or vice versa, do a non-blocking mysql_real_connect_start(), and later do a blockingmysql_real_query() on the resulting connection.

Mixing can be useful to allow code to use the simpler blocking API in parts of the program where waiting is not a problem. For example establishing the connection(s) at program startup, or doing small quick queries between large, long-running ones.

The only restriction is that any previous non-blocking operation must have finished before starting a new blocking (or non-blocking) operation, see the next section: "Terminating a non-blocking operation early" below.

Terminating a non-blocking operation early

When a non-blocking operation is started with mysql_real_query_start() or another _start() function, it must be allowed to finish before starting a new operation. Thus, the application must continue calling mysql_real_query_cont() until zero is returned, indicating that the operation is completed. It is not allowed to leave one operation "hanging" in the middle of processing and then start a new one on top of it.

It is, however, permissible to terminate the connection completely withmysql_close() in the middle of processing a non-blocking call. A new connection must then be initiated with mysql_real_connect before new queries can be run, either with a new MYSQL object or re-using the old one.

In the future, we may implement an abort facility to force an on-going operation to terminate as quickly as possible (but it will still be necessary to call mysql_real_query_cont() one last time after abort, allowing it to clean up the operation and return immediately with an appropriate error code).

Restrictions

DNS

When mysql_real_connect_start() is passed a hostname (as opposed to a local unix socket or an IP address, it may need to look up the hostname in DNS, depending on local host configuration (e.g. if the name is not in/etc/hosts or cached). Such DNS lookups do not happen in a non-blocking way. This means that mysql_real_connect_start() will not return control to the application while waiting for the DNS response. Thus the application may "hang" for some time if DNS is slow or non-functional.

If this is a problem, the application can pass an IP address tomysql_real_connect_start() instead of a hostname, which avoids the problem. The IP address can be obtained by the application with whatever non-blocking DNS loopup operation is available to it from the operating system or event framework used. Alternatively, a simple solution may be to just add the hostname to the local host lookup file (/etc/hosts on Posix/Unix/Linux machines).

Windows Named Pipes and Shared Memory connections

There is no support in the non-blocking API for connections using Windows named pipes or shared memory

Named pipes and shared memory can still be used, using either the blocking or the non-blocking API. However, operations that need to wait on I/O on the named pipe will not return control to the application; instead they will "hang" waiting for the operation to complete, just like the normal blocking API calls.

_{This page is licensed: CC BY-SA / Gnu FDL}

MariaDB Internals

Writing Plugins for MariaDB

About

Authentication Plugins

Storage Engine Plugins

Information Schema Plugins

Encryption Plugins

Function Plugins

Plugin Declaration Structure

Example Plugin Declaration

Encryption Plugin API

Encryption Plugin API

Current Encryption Plugins

file_key_management

Versions

aws_key_management

Versions

example_key_management

Versions

debug_key_management

Versions

Encryption Service

Information Schema plugins: SHOW and FLUSH statements

SHOW

FLUSH

Password Validation Plugin API

SQL-Level Extensions

Password-Changing Statements

With Plain Text Password

With Password Hash

Examples

Plugin API

Merging into MariaDB

Creating a New Merge Tree

Merging from MySQL (obsolete)

Merging from MySQL into MariaDB

Merging code changes from MySQL bzr repository

Buildbot testing

Helpful tags and diffs

Merging documentation from MySQL source tarballs

Merging New XtraDB Releases (obsolete)

Background

Merging

Merging TokuDB (obsolete)

Merging with a Merge Tree

Query Optimizer

Block-Based Join Algorithms

Block Nested Loop Join

How Block Nested Loop Join Works

More Efficient Usage of Join Buffer Space

Incremental Join Buffers

Using Join Buffers for Simple Outer Joins and Semi-joins

Block Hash Join

How Block Hash Join Works

Batch Key Access Join

How Batch Keys Access Join Works

Interaction of BKA Join With the MRR Functions

Managing Usage of Block-Based Join Algorithms

Size of Join Buffers

Related MRR Settings

Condition Selectivity Computation Internals

calculate_cond_selectivity_for_table(T)

Data From Potential Range Accesses

Data From Histograms

table_cond_selectivity()

Extended Keys

Syntax

Description

Examples

See Also

MIN/MAX optimization

Min/Max optimization without GROUP BY

Min/Max optimization with GROUP BY

See also

Notes When an Index Cannot Be Used

Warning About Incompatible Index Comparison

Enabling the Note

Enabling Warnings and Notes for the Slow Query Log

See Also

Optimizer Debugging With GDB