1 of 8

High Availability

MariaDB ColumnStore ensures high availability with multi-node setups and shared storage, while MaxScale adds monitoring and failover for continuous analytics.

Performance Related Configuration Settings

MariaDB ColumnStore

Introduction "Introduction"
Memory management - NumBlocksPct and TotalUmMemory "Memory management - NumBlocksPct and TotalUmMemory"
Query concurrency - MaxOutstandingRequests "Query concurrency - MaxOutstandingRequests"
Join tuning - PmMaxMemorySmallSide "Join tuning - PmMaxMemorySmallSide"
Multi table join tuning "Multi table join tuning"
Disk based joins - AllowDiskBasedJoin "Disk based joins - AllowDiskBasedJoin"

Introduction

A number of system configuration variables exist to allow fine tuning of the system to suit the physical hardware and query characteristics. In general the default values will work relatively well for many cases.

The configuration parameters are maintained in the /usr/local/mariadb/columnstore/etc/Columnstore.xml file. In a multiple server deployment these should only be edited on the PM1 server as this will be automatically replicated to other servers by the system. A system restart will be required for the configuration change to take affect.

Convenience utility programs getConfig and setConfig are available to safely update the Columnstore.xml without needing to be comfortable with editing XML files. The -h argument will display usage information.

Memory management - NumBlocksPct and TotalUmMemory

The NumBlocksPct configuration parameter specifies the percentage of physical memory to utilize for disk block caching. For a Single Server or Combined Multi Server deployment, the default value is 50 to ensure enough physical memory for the UM and for a non-combined multi serve deployment the default value is 70.

The TotalUmMemory configuration parameter specifies the percentage of physical memory to utilize for joins, intermediate results and set operations on the UM. This specifies an upper limit for small table results in joins rather than a pre-allocation of memory.

In a single server or combined deployment, the sum of NumBlocksPct and TotalUmMemory should typically not exceed 75% of physical memory. With very large memory servers this could be raised but the key point is to leave enough memory for other processes including mysqld.

With version 1.2.2 onwards these can be set to static numeric limits instead of percentages by entering a number with 'M' or 'G' at the end to signify MiB or GiB.

Query concurrency - MaxOutstandingRequests

ColumnStore handles concurrent query execution by managing the rate of concurrent batch primitive steps from the UM to the PM. This is configured using the MaxOutstandingRequests parameter and has a default value of 20. Each batch primitive step is executed within the context of 1 extent column according to this high level process:

The UM issues up to MaxOutstandingRequests number of batch primitive steps.
The PM processes the request using many threads and returns its response. These generally take a fraction of a second up to a low number of seconds depending on the amount of Physical I/O and the performance of that storage.
The UM will issue new requests as prior requests complete maintaining the maximum number of outstanding requests.

This scheme allows for large queries to use all available resources when not otherwise being consumed and for smaller queries to execute with minimal delay. Lower values optimize for higher throughput of smaller queries while a larger value optimizes for response time of that query. The default value should work well under most circumstances however the value should be increased as the number of PM nodes is increased.

How many Queries are running and how many queries are currently in the queue can be checked with

SELECT calgetsqlcount();

Join tuning - PmMaxMemorySmallSide

ColumnStore maintains statistics for table and utilizes this to determine which is the larger table of the two. This is based both on the number of blocks in that table and estimation of the predicate cardinality. The first step is to apply any filters as appropriate to the smaller table and returning this data set to the UM. The size of this data set is compared against the configuration parameter PmMaxMemorySmallSide which has a default value of 64 (MB). This value can be set all the way up to 4GB. This default allows for approximately 1M rows on the small table side to be joined against billions (or trillions) on the large table side. If the size of the small data set is less than PmMaxMemorySmallSide the dataset will be sent to the PM for creation of a distributed hashmap otherwise it is created on the UM. Thus this setting is important to tuning of joins and whether the operation can be distributed or not. This should be set to support your largest expected small table join size up to available memory:

Although this will increase the size of data sent from the UM to PM to support the join, it means that the join and subsequent aggregates are pushed down, scaled out, and a smaller data set is returned back to the UM.
In a multiple PM deployment, the sizing should be based from available physical memory on the PM servers, how much memory to reserve for block caching, and the number of simultaneous join operations that can be expected to run times the average small table join data size.

Multi table join tuning

The above logic for a single table join extrapolates out to multi table joins where the small table values are precalculated and performed as one single scan against the large table. This works well for the typical star schema case joining multiple dimension tables with a large fact table. For some join scenarios it may be necessary to sequence joins to create the intermediate datasets for joining, this would happen for instance with a snowflake schema structure. In some extreme cases it may be hard for the optimizer to be able to determine the most optimal join path. In this case a hint is available to force a join ordering. The INFINIDB_ORDERED hint will force the first table in the from clause to be considered the largest table and override any statistics based decision, for example:

SELECT /*! INFINIDB_ORDERED */ r_regionkey     
FROM region r, customer c, nation n    
WHERE r.r_regionkey = n.n_regionkey      
AND n.n_nationkey = c.c_nationkey

Note: INFINIDB\_ORDERED is deprecated and does not work anymore for ColumnStore 1.2 and above.

use set infinidb_ordered_only=ON;

and for 1.4 set columnstore_ordered_only=ON;

Disk-based joins - AllowDiskBasedJoin

When a join is very large and exceeds the PmMaxMemorySmallSide setting, it is performed in memory in the UM server. For very large joins, this could exceed the available memory, in which case this is detected and a query error reported. Several configuration parameters are available to enable and configure usage of disk overflow should this occur:

AllowDiskBasedJoin – Controls the option to use disk Based joins or not. Valid values are Y (enabled) or N (disabled). By default, this option is disabled.
TempFileCompression – Controls whether the disk join files are compressed or noncompressed. Valid values are Y (use compressed files) or N (use non-compressed files).
TempFilePath – The directory path used for the disk joins. By default, this path is the tmp directory for your installation (i.e., /usr/local/mariadb/columnstore/tmp). Files (named infinidb-join-data*) in this directory will be created and cleaned on an as-needed basis. The entire directory is removed and recreated by ExeMgr at startup. It is strongly recommended that this directory be stored on a dedicated partition.

A MariaDB global or session variable is available to specify a memory limit at which point the query is switched over to disk-based joins:

infinidb_um_mem_limit - Memory limit in MB per user (i.e., switch to disk-based join if this limit is exceeded). By default, this limit is not set (value of 0).

_{This page is licensed: CC BY-SA / Gnu FDL}

Query Tuning

MariaDB ColumnStore query tuning optimizes analytics using data types, joins, projection elimination, WHERE clauses, and EXPLAIN for performance insights.

Collecting Statistics with ANALYZE TABLE

Overview

In MariaDB Enterprise ColumnStore 6, the ExeMgr process uses optimizer statistics in its query planning process.

ColumnStore uses the optimizer statistics to add support for queries that contain circular inner joins.

In Enterprise ColumnStore 5 and before, ColumnStore would raise the following error when a query containing a circular inner join was executed:

ERROR 1815 (HY000): Internal error: IDB-1003: Circular joins are not supported.

The optimizer statistics store each column's NDV (Number of Distinct Values), which can help the ExeMgr process choose the optimal join order for queries with circular joins. When Enterprise ColumnStore executes a query with a circular join, the query's execution can take longer if ColumnStore chooses a sub-optimal join order. When you collect optimizer statistics for your ColumnStore tables, the ExeMgr process is less likely to choose a sub-optimal join order.

Enterprise ColumnStore's optimizer statistics can be collected for ColumnStore tables by executing ANALYZE TABLE:

[[analyze-table|ANALYZE TABLE]] columnstore_tab;

Enterprise ColumnStore's optimizer statistics are not updated automatically. To update the optimizer statistics for a ColumnStore table, ANALYZE TABLE must be re-executed.

Enterprise ColumnStore does not implement an interface to show optimizer statistics.

Query Tuning Recommendations

When tuning queries for MariaDB Enterprise ColumnStore, there are some important details to consider.

Avoid Selecting Unnecessary Columns

Enterprise ColumnStore only reads the columns that are necessary to resolve a query.

For example, the following query selects every column in the table:

SELECT * FROM tab;

Whereas the following query only selects two columns in the table, so it requires less I/O:

SELECT col1, col2 FROM tab;

For best performance, only select the columns that are necessary to resolve a query.

Avoid Large Sorts

When Enterprise ColumnStore performs ORDER BY and LIMIT operations, the operations are performed in a single-threaded manner after the rest of the query processing has been completed, and the full unsorted result-set has been retrieved. For large data sets, the performance overhead can be significant.

Avoid Excessive Aggregations

When Enterprise ColumnStore 5 performs aggregations (i.e., DISTINCT, GROUP BY, COUNT(*), etc.), all of the aggregation work happens in-memory by default. As a consequence, more complex aggregation operations require more memory in that version.

For example, the following query could require a lot of memory in Enterprise ColumnStore 5, since it has to calculate many distinct values in memory:

SELECT DISTINCT col1 FROM tab LIMIT 10000;

Whereas the following query could require much less memory in Enterprise ColumnStore 5, since it has to calculate fewer distinct values:

SELECT DISTINCT col1 FROM tab LIMIT 100;

In Enterprise ColumnStore 6, disk-based aggregations can be enabled.

For best performance, avoid excessive aggregations or enable disk-based aggregations.

For additional information, see "Configure Disk-Based Aggregations".

Avoid Non-Distributed Functions

When Enterprise ColumnStore evaluates built-in functions and aggregate functions, it can often evaluate the function in a distributed manner. Distributed evaluation of functions can significantly improve performance.

Enterprise ColumnStore supports distributed evaluation for some built-in functions. For other built-in functions, the function must be evaluated serially on the final result set.

Enterprise ColumnStore also supports distributed evaluation for user-defined functions developed with ColumnStore's User-Defined Aggregate Function (UDAF) C++ API. For functions developed with Enterprise Server's standard User-Defined Function (UDF) API, the function must be evaluated serially on the final result set.

For best performance, avoid non-distributed functions.

Optimize Large Joins

By default, Enterprise ColumnStore performs all joins as in-memory hash joins.

If the joined tables are very large, the in-memory hash join can require too much memory for the default configuration. There are a couple options to work around this:

Enterprise ColumnStore can be configured to use more memory for in-memory hash joins.
Enterprise ColumnStore can be configured to use disk-based joins.
Enterprise ColumnStore can use optimizer statistics to better optimize the join order.

For additional information, see "Configure In-Memory Joins", "Configure Disk-Based Joins", and "Optimizer Statistics".

Load Ordered Data in Proper Order

Enterprise ColumnStore uses extent elimination to optimize queries. Extent elimination uses the minimum and maximum values in the extent map to determine which extents can be skipped for a query.

When data is loaded into Enterprise ColumnStore, it appends the data to the latest extent. When an extent reaches the maximum number of column values, Enterprise ColumnStore creates a new extent. As a consequence, if ordered data is loaded in its proper order, then similar values will be clustered together in the same extent. This can improve query performance, because extent elimination performs best when similar values are clustered together.

For example, if you expect to query a table with a filter on a timestamp column, you should sort the data using the timestamp column before loading it into Enterprise ColumnStore. Later, when the table is queried with a filter on the timestamp column, Enterprise ColumnStore would be able to skip many extents using extent elimination.

For best performance, load ordered data in proper order.

Enable Decimal Overflow Checks

When Enterprise ColumnStore performs mathematical operations with very big values using the DECIMAL, NUMERIC, and FIXED data types, the operation can sometimes overflow ColumnStore's maximum precision or scale. The maximum precision and scale depend on the version of Enterprise ColumnStore:

In Enterprise ColumnStore 6, the maximum precision (M) is 38, and the maximum scale (D) is 38.
In Enterprise ColumnStore 5, the maximum precision (M) is 18, and the maximum scale (D) is 18.

In Enterprise ColumnStore 6, applications can configure Enterprise ColumnStore to check for decimal overflows by setting the columnstore_decimal_overflow_check system variable, but only when the column has a decimal precision that is 18 or more:

SET SESSION columnstore_decimal_overflow_check=ON;

SELECT (big_decimal1 * big_decimal2) AS product
FROM columnstore_tab;

When decimal overflow checks are enabled, math operations have extra overhead.

When the decimal overflow check fails, MariaDB Enterprise ColumnStore raises an error with the ER_INTERNAL_ERROR error SQL, and it writes detailed information about the overflow check failure to the ColumnStore system logs.

User-Defined Aggregate Function (UDAF) C++ API

MariaDB Enterprise ColumnStore supports Enterprise Server's standard User-Defined Function (UDF) API. However, UDFs developed using that API cannot be executed in a distributed manner.

To support distributed execution of custom SQL, MariaDB Enterprise ColumnStore supports a Distributed User Defined Aggregate Functions (UDAF) C++ API:

The Distributed User Defined Aggregate Functions (UDAF) C++ API allows anyone to create aggregate functions of arbitrary complexity for distributed execution in the ColumnStore storage engine.
These functions can also be used as Analytic (Window) functions just like any built-in aggregate function.

Query Plans and Optimizer Trace

MariaDB ColumnStore's query plans and Optimizer Trace show how analytical queries run in parallel across its distributed, columnar architecture, aiding performance tuning.

Execution Plan (CSEP)

Overview

The ColumnStore storage engine uses a ColumnStore Execution Plan (CSEP) to represent a query plan internally.

When the select handler receives the SELECT_LEX object, it transforms it into a CSEP as part of the query planning and optimization process. For additional information, see "MariaDB Enterprise ColumnStore Query Evaluation."

Viewing the CSEP

The CSEP for a given query can be viewed by performing the following:

Calling the calSetTrace(1) function:

SELECT calSetTrace(1);

Executing the query:

SELECT column1, column2
FROM columnstore_tab
WHERE column1 > '2020-04-01'
AND column1 < '2020-11-01';

Calling the calGetTrace() function:

SELECT calGetTrace();

Job Steps

Overview

When Enterprise ColumnStore executes a query, the ExeMgr process on the initiator/aggregator node translates the ColumnStore execution plan (CSEP) into a job list. A job list is a sequence of job steps.

Enterprise ColumnStore uses many different types of job steps that provide different scalability benefits:

Some types of job steps perform operations in a distributed manner, using multiple nodes to operate to different extents. Distributed operations provide horizontal scalability.
Some types of job steps perform operations in a multi-threaded manner using a thread pool. Performing multi-threaded operations provides vertical scalability.

As you increase the number of ColumnStore nodes or the number of cores on each node, Enterprise ColumnStore can use those resources to more efficiently execute job steps.

For additional information, see "MariaDB Enterprise ColumnStore Query Evaluation.".

Batch Primitive Step (BPS)

Enterprise ColumnStore defines a batch primitive step to handle many types of tasks, such as scanning/filtering columns, JOIN operations, aggregation, functional filtering, and projecting (putting values into a SELECT list).

In calGetTrace() output, a batch primitive step is abbreviated BPS.

Batch primitive steps are evaluated on multiple nodes in parallel. The PrimProc process on each node evaluates the batch primitive step to one extent at a time. The PrimProc process uses a thread pool to operate on individual blocks within the extent in parallel.

Cross Engine Step (CES)

Enterprise ColumnStore defines a cross-engine step to perform cross-engine joins, in which a ColumnStore table is joined with a table that uses a different storage engine.

In calGetTrace() output, a cross-engine step is abbreviated CES.

Cross-engine steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Enterprise ColumnStore can perform cross-engine joins when the mandatory utility user is properly configured.

For additional information, refer to the "Mandatory Utility User Account"

Dictionary Structure Step (DSS)

Enterprise ColumnStore defines a dictionary structure step to scan the dictionary extents that ColumnStore uses to store variable-length string values.

In calGetTrace() output, a dictionary structure step is abbreviated DSS.

Dictionary structure steps are evaluated on multiple nodes in parallel. The PrimProc process on each node evaluates the dictionary structure step to one extent at a time. It uses a thread pool to operate on individual blocks within the extent in parallel.

Dictionary structure steps can require a lot of I/O for a couple of reasons:

Dictionary structure steps do not support extent elimination, so all extents for the column must be scanned.
Dictionary structure steps must read the column extents to find each pointer and the dictionary extents to find each value, so it doubles the number of extents to scan.

It is generally recommended to avoid queries that will cause dictionary scans.

For additional information, see "Avoid Creating Long String Columns".

Hash Join Step (HJS)

Enterprise ColumnStore defines a hash join step to perform a hash join between two tables.

In calGetTrace() output, a hash join step is abbreviated HJS.

Hash join steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Enterprise ColumnStore performs the hash join in memory by default. If you perform large joins, you may be able get better performance by changing some configuration defaults with mcsSetConfig:

Enterprise ColumnStore can be configured to use more memory for in-memory hash joins.
Enterprise ColumnStore can be configured to use disk-based joins.

For additional information, see "Configure in-memory joins" and "Configure Disk-Based Joins".

Having Step (HVS)

Enterprise ColumnStore defines a having step to evaluate a HAVING clause on a result set.

In calGetTrace() output, a having step is abbreviated HVS.

Subquery Step (SQS)

Enterprise ColumnStore defines a subquery step to evaluate a subquery.

In calGetTrace() output, a subquery step is abbreviated SQS.

Tuple Aggregation Step (TAS)

Enterprise ColumnStore defines a tuple aggregation step to collect intermediate aggregation prior to the final aggregation and evaluation of the results.

In calGetTrace() output, a tuple aggregation step is abbreviated TAS.

Tuple aggregation steps are primarily evaluated by the ExeMgr process on the initiator/aggregator node. However, the PrimProc process on each node also plays a role, since the PrimProc process on each node provides the intermediate aggregation results to the ExeMgr process on the initiator/aggregator node.

Tuple Annexation Step (TNS)

Enterprise ColumnStore defines a tuple annexation step to perform the final aggregation and evaluation of the results.

In calGetTrace() output, a tuple annexation step is abbreviated TNS.

Tuple annexation steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Enterprise ColumnStore 5 performs aggregation operations in memory. As a consequence, more complex aggregation operations require more memory in that version.

In Enterprise ColumnStore 6, disk-based aggregations can be enabled.

For additional information, see "Configure Disk-Based Aggregations".

Tuple Union Step (TUS)

Enterprise ColumnStore defines a tuple union step to perform a union of two subqueries.

In calGetTrace() output, a tuple union step is abbreviated TUS.

Tuple union steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Tuple Constant Step (TCS)

Enterprise ColumnStore defines a tuple constant step to evaluate constant values.

In calGetTrace() output, a tuple constant step is abbreviated TCS.

Tuple constant steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.

Window Function Step (WFS)

Enterprise ColumnStore defines a window function step to evaluate window functions.

In calGetTrace() output, a window function step is abbreviated WFS.

Window function steps are evaluated locally by the ExeMgr process on the initiator/aggregator node.