I was reading a brilliant book Relational Database Index Design and the Optimizers by Tapio Lahdenmaki and Mike Leach. At the end of one of the chapters I came across the exercise to design two indexes for a given query: one to minimize the number of index rows scanned and second one to eliminate sorting.
Using algorithm described in the book I quickly came up with two indexes and while first one looked fine, I was really confused by the second one for the elimination of the sort. Let me show an example, not copy one from the book, but rather show a test I did with MySQL.
Here is a table that I have:
CREATE TABLE `index_test1` ( `A` int(11) default NULL, `B` int(11) default NULL, `C` int(11) default NULL, `D` int(11) default NULL, `F` int(11) default NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1
The query that I was designing index for:
SELECT D FROM index_test1 WHERE C=4 AND F=6 ORDER BY A,B,C,F LIMIT 10;
Again this is not exactly the same query as in the book, but a simplified version I used for testing. Following the algorithm from the book, I quickly came up with the following index:
KEY `idx_1` (`C`,`F`,`A`,`B`)
According to the authors of the book this index eliminates the sort for the query above. But I was confused: “Wait, how can this index eliminate sorting, if the order of the columns in index is different from one in the ORDER BY?”. So I decided to do a test, I populated the test table with about 200k rows of sample data and made sure columns selectivity is such that it makes sense to MySQL to choose my index.
Here is what I saw, when I ran the EXPLAIN on the query:
mysql> EXPLAIN SELECT D FROM index_test1 WHERE C=4 AND F=6 ORDER BY A,B,C,F LIMIT 10; +----+-------------+-------------+------+---------------+-------+---------+-------------+-------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------------+------+---------------+-------+---------+-------------+-------+-------------+ | 1 | SIMPLE | index_test1 | ref | idx_1 | idx_1 | 10 | const,const | 13288 | Using where | +----+-------------+-------------+------+---------------+-------+---------+-------------+-------+-------------+ 1 row in set (0.00 sec)
So, the sorting is indeed eliminated! It took me some time and a couple of drawings to actually understand why this is happening. The main reason, why additional sorting is not required in this case is the type of the predicate for columns C and F. In this query, we compare the columns against constants (see ref column of explain), this means that only one value for C and F will be in the resulting index slice:
C | F | A | B |
… | … | … | … |
4 | 6 | 1 | 10 |
4 | 6 | 1 | 20 |
4 | 6 | 2 | 15 |
4 | 6 | 2 | 80 |
… | … | … | … |
This is an example of index slice for index on (C,F,A,B). As you can see, when a matching predicate is used (column=const), there is only 1 possible value selected and thus there is absolutely no need to sort by this column. In our case we have 2 matching columns, C and F with only one value for each column, there is no difference in which order will you place this values. What matters is next index columns after matching columns. In this case the next columns are A and B and as you can see they provide a sorting order we need.
This will becomes more clear if we take a look at what happens if there is a range predicate instead of matching one:
SELECT D FROM index_test1 WHERE C BETWEEN 4 AND 5 AND F=6 ORDER BY A,B,C,F LIMIT 10;
The index slice for this one may look like this:
C | F | A | B |
… | … | … | … |
4 | 6 | 1 | 10 |
4 | 6 | 1 | 20 |
4 | 6 | 2 | 5 |
5 | 6 | 0 | 11 |
5 | 6 | 10 | 7 |
5 | 6 | 10 | 8 |
… | … | … | … |
In this case, we have two values for C: 4 and 5. All index columns following this one will have two sorted blocks, one for 4 and one for 5. But these index columns are not sorted for the whole result set. So if there is a range predicate, the rest of the index can not be used for sorting.
Why is eliminating the sort is important? The answer is that it really depends on the situation. With modern hardware, sorting a small amount of data does not produce significant impact on performance especially if sort operation can be done in memory. But in some cases having data presorted can reduce the number of rows the query has to scan and thus improve performance greatly.
Let’s take a look at the query once again:
SELECT D FROM index_test1 WHERE C=4 AND F=6 ORDER BY A,B,C,F LIMIT 10;
Predicated C=4 and F=6 identify the index slice which has to be scanned to produce the result set. Lets assume that selectivity of these two columns is such that for these parameters the size of index slice would be 10% of the whole data set, or in my case around 2000 rows. So, if index on fields (C,F,A,B)
could not be used for sorting, MySQL would have to get the 2000 rows, sort them, and then return top 10 rows to the client.
But the index produces the result set already in the order we need, so what happens in reality for this query is that instead of scanning 2000 rows only the first 10 rows will be scanned.
No comments