Why do we sometimes want to keep duplicate indexes?
I’ve done dutiful DBA work in the past to identify and remove what are commonly called duplicate indexes. That is, those indexes that look like (a) and (a,b). The thought is that a query will utilize an index as easily on (a) as on (a,b), and removing (a) will save storage cost and write performance. I’ve had the experience, though, of removing (a) and seeing performance tank.
(As an aside, these are really redundant indexes. A duplicate index would be (a,b) and (a,b) by two different names – this can commonly be done by object relational mapping (ORM) or other automated schema creation tools. I’ll call (a) and (a,b) redundant indexes below.)
This test is on Percona Server 5.7.14 with the sys schema installed and performance schema enabled.
Given two tables with the same number of rows and with the same schema except an extra index on the second table:
mysql>show create table mysqlslap.t5\G
Table: t5
Create Table: CREATE TABLE `t5` (
`intcol1` int(11) DEFAULT NULL,
`intcol2` int(11) DEFAULT NULL,
`intcol3` int(11) DEFAULT NULL,
`charcol1` varchar(255) DEFAULT NULL,
`charcol2` varchar(255) DEFAULT NULL,
`charcol3` varchar(255) DEFAULT NULL,
KEY `one` (`intcol1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
mysql>show create table mysqlslap.t6\G
Table: t6
Create Table: CREATE TABLE `t6` (
`intcol1` int(11) DEFAULT NULL,
`intcol2` int(11) DEFAULT NULL,
`intcol3` int(11) DEFAULT NULL,
`charcol1` varchar(255) DEFAULT NULL,
`charcol2` varchar(255) DEFAULT NULL,
`charcol3` varchar(255) DEFAULT NULL,
KEY `one` (`intcol1`),
KEY `two` (`intcol1`,`charcol1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
Notice that table t6 has a redundant index on intcol1.
The tables were populated with a statement like the following:
sudo mysqlslap –concurrency=2 –iterations=10000 –query=“insert into mysqlslap.t5(intcol1,intcol2,intcol3,charcol1,charcol2,charcol3,charcol4) values(FLOOR(RAND() * 10000),FLOOR(RAND() * 10000),FLOOR(RAND() * 10000),SUBSTRING(MD5(RAND()) FROM 1 FOR 50),SUBSTRING(MD5(RAND()) FROM 1 FOR 50),SUBSTRING(MD5(RAND()) FROM 1 FOR 50),now());” –delimiter=“;” –verbose
And then for the remaining tables, like this:
mysql>insert into t6 select * from t5;
In both cases, the explain plan is showing the following query will use index one (look at the read_cost value, too):
mysql>explain format=json select intcol1 from mysqlslap.t5 where intcol1>1910858200\G
EXPLAIN: {
“query_block”: {
“select_id”: 1,
“cost_info”: {
“query_cost”: “7092.73”
},
“table”: {
“table_name”: “t5”,
“access_type”: “range”,
“possible_keys”: [
“one”
],
“key”: “one”,
“used_key_parts”: [
“intcol1”
],
“key_length”: “5”,
“rows_examined_per_scan”: 17670,
“rows_produced_per_join”: 17670,
“filtered”: “100.00”,
“using_index”: true,
“cost_info”: {
“read_cost”: “3558.73”,
“eval_cost”: “3534.00”,
“prefix_cost”: “7092.73”,
“data_read_per_join”: “13M”
},
“used_columns”: [
“intcol1”
],
“attached_condition”: “(`mysqlslap`.`t5`.`intcol1`>1910858200)”
}
}
}
mysql>explain format=json select intcol1 from mysqlslap.t6 where intcol1>1910858200\G
EXPLAIN: {
“query_block”: {
“select_id”: 1,
“cost_info”: {
“query_cost”: “7092.73”
},
“table”: {
“table_name”: “t6”,
“access_type”: “range”,
“possible_keys”: [
“one”,
“two”
],
“key”: “one”,
“used_key_parts”: [
“intcol1”
],
“key_length”: “5”,
“rows_examined_per_scan”: 17670,
“rows_produced_per_join”: 17670,
“filtered”: “100.00”,
“using_index”: true,
“cost_info”: {
“read_cost”: “3558.73”,
“eval_cost”: “3534.00”,
“prefix_cost”: “7092.73”,
“data_read_per_join”: “13M”
},
“used_columns”: [
“intcol1”
],
“attached_condition”: “(`mysqlslap`.`t6`.`intcol1`>1910858200)”
}
}
}
mysql>explain format=json select intcol1 from mysqlslap.t6 force index(two) where intcol1>1910858200\G
EXPLAIN: {
“query_block”: {
“select_id”: 1,
“cost_info”: {
“query_cost”: “7508.43”
},
“table”: {
“table_name”: “t6”,
“access_type”: “range”,
“possible_keys”: [
“two”
],
“key”: “two”,
“used_key_parts”: [
“intcol1”
],
“key_length”: “5”,
“rows_examined_per_scan”: 17368,
“rows_produced_per_join”: 17368,
“filtered”: “100.00”,
“using_index”: true,
“cost_info”: {
“read_cost”: “4034.84”,
“eval_cost”: “3473.60”,
“prefix_cost”: “7508.44”,
“data_read_per_join”: “12M”
},
“used_columns”: [
“intcol1”
],
“attached_condition”: “(`mysqlslap`.`t6`.`intcol1`>1910858200)”
}
}
}
But the explain tool only shows what the optimizer predicts it will do; is the optimizer predicting the best execution plan? What really happens?
mysqlslap –concurrency=2 –iterations=1000 –create-schema=mysqlslap –query=“select intcol1 from t5 where intcol1>FLOOR(RAND() * 10000000);select intcol1 from t6 where intcol1>FLOOR(RAND() * 10000000);select intcol1 from t6 force index(two) where intcol1>FLOOR(RAND() * 10000000);” –delimiter=“;” –verbose
mysql>select * from sys.schema_index_statistics;
------------------------------------------------------------------------------------------------------------
| table_schema | table_name | index_name | rows_selected | select_latency | rows_inserted | insert_latency | rows_updated | update_latency | rows_deleted | delete_latency |
------------------------------------------------------------------------------------------------------------
| mysqlslap | t6 | two | 200692044 | 5.32 m | 0 | 0 ps | 0 | 0 ps | 0 | 0 ps |
| mysqlslap | t6 | one | 200692044 | 4.92 m | 0 | 0 ps | 0 | 0 ps | 0 | 0 ps |
| mysqlslap | t5 | one | 200692044 | 4.87 m | 0 | 0 ps | 0 | 0 ps | 0 | 0 ps |
| sys | sys_config | PRIMARY | 0 | 0 ps | 0 | 0 ps | 0 | 0 ps | 0 | 0 ps |
------------------------------------------------------------------------------------------------------------
The select_latency is higher for the index on (intcol1,charcol1). That value is coming from the performance schema via the sys schema:
…`sys`.`format_time`(`performance_schema`.`table_io_waits_summary_by_index_usage`.`SUM_TIMER_FETCH`) AS `select_latency`…
The full sys schema view is:
CREATE ALGORITHM=MERGE DEFINER=`mysql.sys`@`localhost` SQL SECURITY INVOKER VIEW `sys`.`schema_index_statistics` AS select `performance_schema`.`table_io_waits_summary_by_index_usage`.`OBJECT_SCHEMA` AS `table_schema`,`performance_schema`.`table_io_waits_summary_by_index_usage`.`OBJECT_NAME` AS `table_name`,`performance_schema`.`table_io_waits_summary_by_index_usage`.`INDEX_NAME` AS `index_name`,`performance_schema`.`table_io_waits_summary_by_index_usage`.`COUNT_FETCH` AS `rows_selected`,`sys`.`format_time`(`performance_schema`.`table_io_waits_summary_by_index_usage`.`SUM_TIMER_FETCH`) AS `select_latency`,`performance_schema`.`table_io_waits_summary_by_index_usage`.`COUNT_INSERT` AS `rows_inserted`,`sys`.`format_time`(`performance_schema`.`table_io_waits_summary_by_index_usage`.`SUM_TIMER_INSERT`) AS `insert_latency`,`performance_schema`.`table_io_waits_summary_by_index_usage`.`COUNT_UPDATE` AS `rows_updated`,`sys`.`format_time`(`performance_schema`.`table_io_waits_summary_by_index_usage`.`SUM_TIMER_UPDATE`) AS `update_latency`,`performance_schema`.`table_io_waits_summary_by_index_usage`.`COUNT_DELETE` AS `rows_deleted`,`sys`.`format_time`(`performance_schema`.`table_io_waits_summary_by_index_usage`.`SUM_TIMER_INSERT`) AS `delete_latency` from `performance_schema`.`table_io_waits_summary_by_index_usage` where (`performance_schema`.`table_io_waits_summary_by_index_usage`.`INDEX_NAME` is not null) order by `performance_schema`.`table_io_waits_summary_by_index_usage`.`SUM_TIMER_WAIT` desc
Just to prove the results aren’t being confounded by the redundant index itself, here are two new tables with the same data:
mysql>show create table mysqlslap.t7\G
Table: t7
Create Table: CREATE TABLE `t7` (
`intcol1` int(11) DEFAULT NULL,
`intcol2` int(11) DEFAULT NULL,
`intcol3` int(11) DEFAULT NULL,
`charcol1` varchar(255) DEFAULT NULL,
`charcol2` varchar(255) DEFAULT NULL,
`charcol3` varchar(255) DEFAULT NULL,
KEY `one` (`intcol1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
mysql>show create table mysqlslap.t8\G
Table: t8
Create Table: CREATE TABLE `t8` (
`intcol1` int(11) DEFAULT NULL,
`intcol2` int(11) DEFAULT NULL,
`intcol3` int(11) DEFAULT NULL,
`charcol1` varchar(255) DEFAULT NULL,
`charcol2` varchar(255) DEFAULT NULL,
`charcol3` varchar(255) DEFAULT NULL,
KEY `two` (`intcol1`,`charcol1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
(Restarted to reset the performance schema and sys tables.)
mysqlslap –concurrency=2 –iterations=1000 –create-schema=mysqlslap –query=“select intcol1 from t7 where intcol1>FLOOR(RAND() * 10000000);select intcol1 from t8 where intcol1>FLOOR(RAND() * 10000000);” –delimiter=“;” –verbose
mysql>select * from sys.schema_index_statistics;
------------------------------------------------------------------------------------------------------------
| table_schema | table_name | index_name | rows_selected | select_latency | rows_inserted | insert_latency | rows_updated | update_latency | rows_deleted | delete_latency |
------------------------------------------------------------------------------------------------------------
| mysqlslap | t8 | two | 200012000 | 5.16 m | 0 | 0 ps | 0 | 0 ps | 0 | 0 ps |
| mysqlslap | t7 | one | 200012000 | 4.67 m | 0 | 0 ps | 0 | 0 ps | 0 | 0 ps |
------------------------------------------------------------------------------------------------------------
Remember, this is the same data and schema except for the indexes.
Conclusion: Sometimes redundant indexes are helpful.
Thank you to Abhinav Gupta for the technical review.
6 Comments. Leave new
Check out mysqlindexcheck from the MySQL Utilities (https://dev.mysql.com/doc/mysql-utilities/1.6/en/mysqlindexcheck.html) to help find redundant indexes.
Thank you, Dave! I will give this a try and report back.
Hi!
Because InnoDB indexes are clustered and also have the primary key included, there are some subtle cases where (a,b) is not the superset of (a) because of the left most rule :) Consider this test case:
https://www.tocker.ca/files/redundant-indexes-useful.txt
I agree with your clarification. These are really redundant* indexes – not duplicate indexes. In some specific cases, redundancy is fine (that’s what indexes are after all.)
That’s a good one, Morgan. Thank you for sharing it! I learned a new trick for inserting data there, too.
Hi Valerie,
I like to make a difference between duplicate indexes which is something like:
PRIMARY KEY(ID), UNIQUE(ID), KEY(ID)
Which I have seen multiple time with developers explaining what they do not only ID to be primary key but also be unique and be and key because it is going to be used by lookups not understanding it is all the same thing.
MySQL should have just created one index in this case but it does not.
Duplicate Indexes are always bad in this definition as they are truly duplicate and produce no value.
Redundant indexes are different keys something like (A) and (A,B) are typically called as such and as you describe indeed these might be helpful for more reasons than one :)
Thank you for the details, Peter — it’s an important distinction between duplicate and redundant indexes. The tools to detect duplicate indexes can be misleading sometimes. I’m currently working out a way to combine the tables in the sys schema with one of the existing tools to provide some guidance for the DBA into which indexes should be removed.