Lesson 15 of the SQL Optimization Course: Index Design (MySQL Index Structures)

Introduction

For relational databases, the design of tables and SQL is written are particularly crucial. It wouldn’t be an exaggeration to say that they account for 90% of performance. So this time, specifically targeting these two major knowledge areas, we’ll conduct a detailed analysis for you, peeling back the layers.

This Series uses plain and understandable language and selects a large number of examples to elaborate on the subtleties for you.

🧑‍💻 Target audience:

DBA
Database developers
Students

We will use MySQL as the demonstration database.

In the previous article (Episode 15: Index Design (B+ Tree Organization)), we discussed why databases predominantly use B+ trees for index storage: they are well-suited for disk storage, leverage the properties of balanced multi-way trees, support disk pre-reading, and efficiently handle equality, range, and sequential scans. This article focuses on the index organization methods of MySQL’s two commonly used engines: MyISAM and InnoDB. Understanding these storage mechanisms is crucial for database optimization.

Clustered Index: Also known as a Clustered Index, this type of index ensures that the physical order of records in a table matches the logical order of the index. Since a table can only be stored in one physical order, a table can have at most one clustered index. Compared to non-clustered indexes, clustered indexes offer faster retrieval speeds.

In MySQL, only InnoDB tables support clustered indexes. The data in InnoDB tables is itself a clustered index, often referred to as an Index-Organized Table (IOT). Non-leaf nodes are ordered by the primary key, and leaf nodes contain the primary key and the corresponding row data. As a result, performing a full table scan in sequence is very fast for InnoDB tables.

Non-Clustered Index: Also called a Secondary Index, this type of index stores non-leaf nodes ordered by the index key values, while leaf nodes contain the index key values and the corresponding primary key values. In MySQL, apart from the primary key in InnoDB tables, all other indexes are secondary indexes. Engines like MyISAM and memory tables use non-clustered indexes. In simple terms, the index and row data are stored separately. A table can have multiple secondary indexes.

MyISAM tables are typical examples of tables where data and indexes are stored separately, with no fundamental difference between the primary key and secondary indexes. For instance, in a MyISAM table, the primary key and unique indexes are essentially the same.

MyISAM B+ Tree Primary Key

The above figure illustrates a 3rd-order B+ tree where non-leaf nodes are stored in order based on the primary key values, and leaf nodes are similarly ordered by the primary key values and contain pointers to the physical data rows on disk.

MyISAM B+ Tree Secondary Index

The above figure shows the B+ tree for the age column index, which is also a 3rd-order B+ tree. Non-leaf nodes are stored in order based on the age column values, and leaf nodes store the age values and pointers to the physical data rows on disk.

From the above two figures, it is evident that the biggest drawback of MyISAM tables is the lack of physical data row order. This means that both primary key and secondary index retrievals require an additional sorting step.

Let’s demonstrate this with a simple example:

The following SQL 1 statement outputs records in an unordered manner. To output them in ID order, we need to use SQL 2 with an explicit ORDER BY clause.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- SQL 1: Default unordered output
mysql> select * from t1;
+-------+----------+--------+------+--------------+
| id    | username | gender | age  | phone_number |
+-------+-------+--------+------+--------------+
| 10001 | Alice   | Female |   18 | 18501877098  |
| 10005 | Lily    | Female |   21 | 15827654555  |
| 10006 | Jack    | Male   |   38 | 19929933000  |
| 10009 | Mike    | Male   |   35 | 19012378676  |
| 10002 | Tom     | Male   |   20 | 17760500293  |
| 10003 | Emma    | Female |   29 | 13581386000  |
| 10004 | Lucy    | Female |   25 | 13456712000  |
| 10007 | Harry   | Male   |   23 | 19800092354  |
| 10008 | Sarah   | Female |   22 | 18953209331  |
+-------+-------+--------+------+--------------+
9 rows in set (0.00 sec)

-- SQL 2: Ordered output with explicit ORDER BY
mysql> select * from t1 order by id;
+-------+-------+--------+------+--------------+
| id    | username | gender | age  | phone_number |
+-------+-------+--------+------+--------------+
| 10001 | Alice   | Female |   18 | 18501877098  |
| 10002 | Tom     | Male   |   20 | 17760500293  |
| 10003 | Emma    | Female |   29 | 13581386000  |
| 10004 | Lucy    | Female |   25 | 13456712000  |
| 10005 | Lily    | Female |   21 | 15827654555  |
| 10006 | Jack    | Male   |   38 | 19929933000  |
| 10007 | Harry   | Male   |   23 | 19800092354  |
| 10008 | Sarah   | Female |   22 | 18953209331  |
| 10009 | Mike    | Male   |   35 | 19012378676  |
+-------+-------+--------+------+--------------+
9 rows in set (0.00 sec)

Next, let’s examine the composition of primary key indexes and secondary indexes in InnoDB tables.

InnoDB tables are index-organized tables, meaning the index is the data. The following diagram illustrates the data rows of table T1 in a clustered index format. Non-leaf nodes store the primary key values, leaf nodes store the primary key values and the corresponding data rows, and each page includes pointers to the previous and next pages.

Unlike MyISAM, InnoDB tables have their own page management system with a default page size of 16KB. MyISAM tables rely on the file system for data management, which typically has a default block size of 4KB, and MyISAM blocks are also 4KB. MyISAM tables do not have their own crash recovery mechanism and depend entirely on the file system.

InnoDB B+ Tree Primary Key

The design of InnoDB tables has two main advantages:

Data is stored in primary key order: The physical order of records matches the primary key order, eliminating the need for additional sorting. This is particularly beneficial since sorting is a resource-intensive operation. For table t1, the data is directly retrieved in primary key ID order.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
mysql> select * from t1;
+-------+-------+--------+------+--------------+
| id    | username | gender | age  | phone_number |
+-------+-------+--------+------+--------------+
| 10001 | Alice   | Female |   18 | 18501877098  |
| 10002 | Tom     | Male   |   20 | 17760500293  |
| 10003 | Emma    | Female |   29 | 13581386000  |
| 10004 | Lucy    | Female |   25 | 13456712000  |
| 10005 | Lily    | Female |   21 | 15827654555  |
| 10006 | Jack    | Male   |   38 | 19929933000  |
| 10007 | Harry   | Male   |   23 | 19800092354  |
| 10008 | Sarah   | Female |   22 | 18953209331  |
| 10009 | Mike    | Male   |   35 | 19012378676  |
+-------+-------+--------+------+--------------+
9 rows in set (0.00 sec)

Leaf nodes contain pointers to previous and next nodes: This allows for efficient row insertion and page splitting by simply adjusting the corresponding pointers.

Now, let’s look at the secondary indexes in InnoDB tables, as shown in the following diagram:

InnoDB B+ Tree Secondary Index

In InnoDB secondary indexes, non-leaf nodes store the index field values (e.g., the age field in the above example), and leaf nodes contain the index field values and the corresponding primary key values.

The advantages of this design are:

When data rows move or pages split, secondary indexes do not require unnecessary maintenance. Updates only need to modify the clustered index.

However, there are also disadvantages:

Secondary indexes are larger in size: Since they store primary key values, this can be particularly problematic if the primary key is poorly designed (e.g., using UUIDs as primary keys). The next article will discuss how to design effective primary keys.
Secondary index lookups require two index tree searches: First, the secondary index leaf node is searched to find the filtered row’s primary key value. Then, the clustered index is searched using this primary key value to retrieve the corresponding row.

For example, consider the following SQL statement to retrieve records where age is 23:

1
mysql> select * from t1 where age = 23;

This operation is effectively split into two steps:

First, find the primary key value corresponding to the age field:

1
2

mysql> select id from t1 where age = 23;

Then, retrieve the required data row from the clustered index using the primary key ID = 10005:

1
2

mysql> select * from t1 where id = 10005;

Although this requires an additional lookup, MySQL optimizes this process with data preheating (details of which can be found in the MySQL manual).

This concludes the content of this article. To summarize, we have explored the index organization methods and their respective advantages and disadvantages in MySQL’s two common engines: MyISAM and InnoDB. Feel free to leave comments or questions, and in the next article, we will discuss how to design effective primary keys in MySQL.

👋 See you in the next lesson.

What is SQLFlash?

SQLFlash is your AI-powered SQL Optimization Partner.

Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.

Lesson 15 of the SQL Optimization Course: Index Design (MySQL Index Structures)

Introduction

Recommended reading

What is SQLFlash?

How to use SQLFlash in a database?

Ready to elevate your SQL performance?