Some Fun with MySQL's History List

Pythian Marketing

April 27, 2012

Why this article?

First of all, because I had fun digging in the code. Then, I was reading a lot about the improvements we will have in MySQL 5.6 and about some already present in 5.5. Most of them are well covered by people that certainly know more than me, so I read and read, but after a while became curious. I began reading the code and performing tests. I started to compare versions, like 5.1 - 5.5. - 5.6. One of the things I was looking for was how the new Purge thread mechanism works and what were its implications. I have to say that it seems to work better than the previous versions, and the Dimitry blog (see reference) seems to confirm that. So again, why the article? Because I think there are some traps here and there, and I feel the need to write about them. The worse behavior is with MySQL 5.5. This is because in 5.5 we have an intermediate situation, where the purge is not fully rewritten as in 5.6, but also not bound to the main thread.

What is the history list?

MySQL uses (from Innodb 1.1 MySQL 5.5) 1 to 128 Rollback segments, each able to support 1023 transactions. Each transaction is assigned to one of the rollback segments and remains tied to that rollback segment for the duration of the transaction. This enhancement improves both scalability (higher number of concurrent transactions) and performance (less contention when different transactions access the rollback segments). History list is tightly bound to the undo log representing the number of Header Pages related to undo log segments, segments that are related to finalize transactions, commit, or roll back. That's it: History list represents the number of not-yet-flushed segments in the undo log.

Old and new

Previously, the purge thread was directly controlled by the main thread in InnoDB, causing serious side effects (more info here). The main change was moving out the purge thread and allowing it to run it isolated so that it doesn't impact other processes. The move was done in Innodb 1.1 (present in 5.1 and 5.5). But that is not all: The new version of purge has a mechanism that allows it to be more or less aggressive in relation to the increase or decrease of the History List length. This option is enabled in the 5.6 release and is taking the innodb_purge_threads as the "up to value" it can use to increase the number of threads for purging.

Different behavior

An obvious fact of the new Purge mechanism is that its behavior is quite different from the previous one, i.e. 5.0 (innoDB 1.0) or 5.1. In the previous versions of InnoDB, we were taught to consider the History List as something that should always be close to 0; not always, but as much as possible. Frankly speaking, that was always a little bit confusing, but the MANUAL suggested in several place to keep it between reduced numbers:

"If you insert and delete rows in smallish batches at about the same rate in the table, the purge thread can start to lag behind and the table can grow bigger and bigger because of all the “dead” rows, making everything disk-bound and very slow In such a case, throttle new row operations, and allocate more resources to the purge thread by tuning the innodb_max_purge_lag system variable." --InnoDB Multi-Versioning

A classic scenario for such issue is an application taking track of the activity on the network that requires to write a huge number of small inserts into the database. From my tests, I have seen an incredible number of entries in the history list in 5.5 that were not present in the previous InnoDB version and that are not present again in 5.6. The point is that it could happen to have so many transactions, doing INSERT, UPDATE, or DELETE, that the History and un-flushed undo log grow too much. To prevent issues, we should tune the value of the innodb_max_purge_lag in order to allow InnoDB to complete the PURGE operations. Innodb_max_purge_lag is the highest number we want to have in history list, above which Innodb will start to apply an indiscriminate delay in pushing the operations. The formula is quite simple: [code] ((purge_lag/innodb_max_purge_lag)×10)–5 milliseconds. [/code] Or following the code we have: [code] float ratio = (float) trx_sys->rseg_history_len/ srv_max_purge_lag; ((ratio - .5) * 10000); [/code] If we have a History list of 1200 and have set innodb_max_purge_lag to 1000, the result should be: [code] ((1200/1000)X10)-5= 7 ms delay for operation. following the manual ((1200/100) -5) * 10000 = 7000 <-- Microsecond following the code [/code] All match and delay will be 7 ms. Also, the max limit in previous versions for the delay was of 4295 seconds!! In 5.5/5.6, we see a different way of managing the max number of seconds: [code] if (delay > srv_max_purge_lag_delay) { delay = srv_max_purge_lag_delay; } [/code] Srv_max_purge_lag_delay max value is 10 seconds. So in the worse case scenario, the max delay will be 10 seconds. In the past, we were used to see the History list going up and down (relatively) fast, so the force delay was playing its role efficiently. At the same time, we knew that all operations in the Main threads where slowed down, so the forced delay was something we had to leave with or worse things could happen, like the need to perform an emergency page flush from the buffer pool to allow REDO to recover space. But something has changed... For the better, obviously, but we must be careful. It's better because now the purge thread works independently and can scale, and pending undo flushes does not slow down the other operations. Also in 5.6, MySQL could be more or less aggressive in relation to the History list to purge. But these operations remain something we should monitor and tune, for two main reasons: - Space taken by undo log segments is still an issue, and now that the number is increased, it could be even worse. - Setting a wrong value for innodb_max_purge_lag could kill our performance.

Let us start digging a bit

First of all, History list and purge log thread are still very badly documented in the InnoDB Status Report. In 5.5, we can count on the History list information, number of transaction purge has being up to, and the number of purged records. Finally, in 5.6 we have the generic state. There isn't too much here; better info like number or running threads, real segments used (1-128), number of Header Pages in the segments, and dimension (at least pages as Percona does) would be nice to have. Undo log is currently stored inside the system tablespace. This means that IT WILL BE IMPOSSIBLE to shrink the size of the table space once undo log has taken a huge amount of space. That's it: 80% or more of the size of a system table space is for the undo log when using innodb_file_per_table. This was already true when InnoDB was using a single set of segments (1023), but now that it can scale up to 130944 and supports better and more transactions, the size on the disk can explode. Some numbers to have better understanding: -History list: 359585 -Insert/s: 35649.67 -Pages to flush in the undo log: 429126 -Means that in the undo log it's a total size of: ~ 6.7GB Considering that the normal block size is 4K and each page is 4 operations, we will have 1,716,504 operations, assuming that each page is be sequential. This means 3 ms for Seek+half rotation then, 1 ms transfer for the first operation, then 2ms for the remaining will result in 12ms for each page delete on disk. This means it will take 5149.512 seconds (85 minutes) at 6.7 Mb/s given the partial random write to flush the whole. Obviously, this number changes in respect to the available cache and available spindles. Also, more threads equals more capacity in write and less time, so the option innodb_purge_threads is more than welcome.

Setting the right value for innodb_max_purge_lag

When setting this value, we must keep into consideration the following: - The value is the number of head pages representing an undo log relation to a running/executed transaction. - Purge can apply only to complete transaction. - Delay applies to all write operations inside InnoDB. Nowadays it is not uncommon to have high numbers in history list. In this case, "just" 359,585 head pages is very important to correctly balance the delay point of start with the real load, like transactions/sec and the kind of operations that are ongoing on the server. To clarify the relevance in the case of reads, and not only of writes, let me cite:

Running Very Long Transaction If you’re running very long transaction, be it even SELECT, Innodb will be unable to purge records for changes which are done after this transaction has started, in default REPEATABLE-READ isolation mode. This means very long transactions are very bad causing a lot of garbage to be accommodated in the database. It is not limited to undo slots. When we’re speaking about Long Transactions the time is a bad measure. Having transaction in read only database open for weeks does no harm, however if database has very high update rate, say 10K+ rows are modified every second even 5 minute transaction may be considered long as it will be enough to accumulate about 3 million of row changes. (Peter Z.)

But what can really harm your the system is the delay defined by the purge lag to improve the flushing. Assume we define it to innodb_max_purge_lag=200,000, and we do have the number of pending flush as for the above, 359585. Doing calculation as for the previous formula: [code] ((359585/200000)X10)-5= 12.97925 ms delay for operation. [/code] Hey that's not too bad. I will delay only 12.97925 ms to operation/Insert to help the purge. But what is unclear is what an operation is for the delay, or rather, where does the delay really apply? Ready? Are you sitting comfortably? 2) row0mysql.c [code] /*******************************************************************//** Delays an INSERT, DELETE or UPDATE operation if the purge is lagging. */ static void row_mysql_delay_if_needed(void) /*===========================*/ { if (srv_dml_needed_delay) { os_thread_sleep(srv_dml_needed_delay); } } [/code] 3)os0thread.c [code] /*****************************************************************//** The thread sleeps at least the time given in microseconds. */ UNIV_INTERN void os_thread_sleep( /*============*/ ulint tm) /*!< in: time in microseconds */ { #ifdef __WIN__ Sleep((DWORD) tm / 1000); #else struct timeval t; t.tv_sec = tm / 1000000; t.tv_usec = tm % 1000000; select(0, NULL, NULL, NULL, &t); #endif } [/code] Do you get it? The delay is per ROW. So assume that you have a system checking connections status and traffic, collecting statistics every minute for your 100000 connected users, each user generating at least 8 inserts, plus a consolidation operation to get on average 10 minutes for each insert. Each insert per user takes 0.002 second, and all traffic is managed by 300 threads. 100000 x 8 = 800000/300 = 2666 insert for transactions x 0.002ms each = 5.3sec to run all the operations. Now, what if we apply the delay given we have History list quite high? I have to add the 12ms to the 0.002, which will give us 0.122 sec. This gives us 325 seconds (5.4 minutes) for each thread!!!! Do not forget the consolidation process, which needs to run every 10 minutes. So it has to process (1 reads for user per minute x 100000 users) x 10 minutes and split in 10 threads doing it by user id set, assuming each read per user will take 0.001 ms (given already in memory) and 0.002 ms for write. Without delay, we will have = 1,000,000 / 10 = 100,000 x 0.003 = 300sec (5 minutes) for thread. With delay, it will be 0.122 = 12200 seconds (203 minutes) for each operation. Last but not least, the consolidation will collide with the inserts, causing possible increase of the delay because of the REPEATBLE_READ. Another possible issue is that the consolidation will cause pages to remain in a dirty state for too long, possibly causing serious issues in the REDO log in case we need to free space. I did push some numbers a little bit just to make the scenario more clear. Now, just to remove some possible doubt:

Are you sure that it really pushed it by ROW?

Let us create a working test: [code] CREATE TABLE `City_test2` ( `ID` int(11) NOT NULL DEFAULT '0', `Name` char(35) NOT NULL DEFAULT '', `CountryCode` char(3) NOT NULL DEFAULT '', `District` char(20) NOT NULL DEFAULT '', `Population` int(11) NOT NULL DEFAULT '0', `satisfaction_grade` char(50) DEFAULT NULL, `previous_satgrade` varchar(50) DEFAULT NULL, KEY `satisfaction_grade` (`satisfaction_grade`,`previous_satgrade`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 insert into City_test2 select * from city; mysql> truncate table City_test2;insert into City_test2 select * from city limit 1000; Query OK, 0 rows affected (0.24 sec) Query OK, 1000 rows affected (0.10 sec) Records: 1000 Duplicates: 0 Warnings: 0 [/code] So, running the query, we see the History going up and down, but never to 0 as before. However, if we change the innodb_max_purge_lag, set it to a lower value than the history (say history is 137), and set the lag to 100, in theory there should be the following delay as for instructions: [code] ((purge_lag/innodb_max_purge_lag)×10)–5 milliseconds. [/code] The result should be [code] ((137/100)X10)-5=8.7 ms [/code] [code] mysql> set global innodb_max_purge_lag=100; Query OK, 0 rows affected (0.00 sec) mysql> truncate table City_test2;insert into City_test2 select * from city limit 1000; Query OK, 0 rows affected (0.10 sec) Query OK, 1000 rows affected (7.40 sec) Records: 1000 Duplicates: 0 Warnings: 0 [/code] Close enough from my point of view, and it increases at the increasing number of rows. Changing innodb_purge_batch_size, innodb_rollback_segments, and innodb_purge_threads will change how the purge will work but not the delay. Innodb_max_purge_lag is the only possible parameter to use, and it's not enough.

Conclusions

The new mechanism for the purge is much more flexible and is able to scale. The fact that it's now separate from the main thread reduces a lot of the negative effects. Nevertheless, the capacity that the undo log purge now has also presents possible risks. Huge pending flushes means possible huge space on disk and/or huge delay. The delay behavior is not new and was already present MySQL 5.0, but the new capacity brings it to a higher level of risk, especially in 5.5. In 5.6 the purging thread, is much more efficient, and getting a huge history list was difficult, but when I did get it I saw the same behavior. Whenever we need to tune the purge lag settings, the value needs to be set not in relation to the history list, but in relation to the maximum acceptable delay in the inserts as a cumulative value.

Insight and analysis of technology and business strategy