Exposing Innodb Internals via System Variables: Part 3, I/O (Table data)

Posted in: Technical Track


This is part three of a five part blog series to explore InnoDB internals by looking at the related tunable system variables. In part 2 we covered variables that had the greatest impact on the file structure of InnoDB as well as how data is written to logs. In this section we will continue looking at I/O but more specifically looking at the mechanics on how data gets written to table files as well as how background threads read from them.

Just like in part two, I would like to emphasize something that was written in part one of this blog post series.

“I should note that while tuning recommendations are provided, this objective of this blog post series was NOT meant to be a tuning primer, but instead to explore the mechanics that each variable interacts with. As such I would like to advise against reading this guide and trying to fine tune all of the available InnoDB variables. System variable tuning is an exercise in diminishing returns, the most benefit you’ll get out of tuning your MySQL server will occur within the first 15 minutes of configuration. In order to help prevent excess tuning I will provide a list of variables out of each section that should be set when configuring a new system.”


Okay! Let’s have a look at the rest of the I/O related InnoDB system variables.

Definition: Innodb has to flush dirty pages from the buffer pool back to the table space as part of the checkpointing process, which ensures that you don’t run out of redo log space. Note, the redo log is cyclical so data information must be purged in order to make space for transaction information that’s coming in. Removing entries from the redo log space requires flushing the associated dirty pages back to the table space.
If adaptive flushing is enabled then the rate of dirty page flushing adjusts itself automatically based on the current system workload. Flushing too many pages at once may cause contention for other operations, both read and write. Flushing too few pages may cause InnoDB to fall behind in its checkpointing process and it may run out of redo log space. Adaptive flushing ensures that there is a maintained balance.
Default: 1 (ON)
Tuning notes: I would recommend leaving this enabled, but if you would like InnoDB to take on characteristics of previous MySQL versions, you can disable it. Without adaptive flushing InnoDB has a tendency to flush data based solely on the amount of remaining space in the redo log as well as the maximum dirty page percentage and can cause spike filled I/O traffic patterns as opposed to something more balanced. You can see examples of spiky flushing patterns that are typically observed with disabling adaptive flushing by looking at this blog post by Devananda van der Veen. Considering the amount of effort that’s been placed on advancing the algorithms that drive adaptive flushing I would recommend that you exercise caution when disabling it and make sure you’ve done extensive testing to support disabling this feature should you choose to explore that option.
Associated Metrics: global status: innodb_buffer_pool_pages_dirty and innodb_buffer_pool_bytes_dirty, this will show you what is pending to be flushed.

Definition: If you have chosen to disable adaptive flushing, this is the ‘low water mark’ of the redo log where adaptive flushing will be enabled. IE: When only x% of the redo log space is remaining, adaptive flushing will be enabled, even if you set it to be disabled.
Default: 10 (percent)
Tuning notes: Change this when you have adaptive flushing disabled and either want it to never be enabled (set to 0) or you want it to be enabled sooner (set higher than 10)
Per the MySQL reference guide this should only be modified on servers that run write-heavy workloads for long periods of time and have little reduced load time to catch up with changes waiting to be written to disk.

Definition: This is one of the elements that feeds into the flushing algorithm (adaptive or not) that determines how much data should be flushed. The algorithm is going to try and keep the buffer pool dirty pages below this percent based on the number of dirty pages in the buffer pool and the rate of accumulation of the redo log. It should be noted that this does not impact the rate of flushing, but is used to determine a target for flushing activity at each flush iteration.
Default: 75 (percent)
Tuning notes: Reduce this if you want to lower the amount of dirty pages in order to support more unwritten/unchanged pages for read heavy environments. You can also impact the target for flushing activity by adjusting this variable, lowering it will increase the activity per flush, raising it will decrease the activity. Also, reducing this variable will speed up the process of shutting down MySQL as InnoDB has to flush dirty pages to disk as part of the MySQL shutdown process.
Associated Metrics: global status: innodb_buffer_pool_pages_dirty and innodb_buffer_pool_bytes_dirty, this will show you what is pending to be flushed.

Definition: When the dirty page percentage of the buffer pool reaches the number set by this variable, it will begin ‘pre-flushing’ dirty pages out of the buffer pool (essentially more aggressively flushing at each flush iteration) in hopes of avoiding the innodb_max_dirty_pages_pct value.
Default: 0 (percent) (disabled)
Tuning notes: Adjust this variable when you want InnoDB to more aggressively flush when a certain dirty page percentage is hit. Per the MySQL reference guide this should only be modified on servers that run write-heavy workloads for long periods of time and have little reduced load time to catch up with changes waiting to be written to disk.

Definition: When InnoDB performs flushes, it keeps a snapshot of the flushing ‘state’, IE: how is the current rate of flushing working when we compare this against the growth of the redo log, dirty pages, etc. It keeps a history of these states in order to develop a trend in which it will use to forecast the needed rate / target for flushing. The value of this variable is the number of snapshots that are kept in history.
Default: 30 (states)
Tuning notes: This is like any kind of trending in that the more data you have, the smoother the trend line is going to be. If you have spiky IO loads on a server where the workload quickly changes, you can increase this value to smooth out the I/O pattern. However, if you have a system where the load is predictable, but need to respond faster (think milliseconds) to sudden spikes in activity, you should lower this value.
Per the MySQL reference guide this should only be modified on servers that run write-heavy workloads for long periods of time and have little reduced load time to catch up with changes waiting to be written to disk.
Associated Metrics: You need to have access to historical trending data in order to determine the effectiveness of this variable.You will want to see how spikes in transactions compare against the content of dirty pages and how quickly spikes are handled.

Definition: InnoDB has a process that runs once a second called the ‘page_cleaner’ and this feeds into the algorithm that’s used to determine rate and target of flushing. This value specifies how far down the LRU list (least recently used, see section 1 in the the innodb_old_blocks_pct section for how this applies to the InnoDB buffer pool) page_cleaner process should scan when looking for dirty pages to flush. Note that in the buffer pool, the LRU can consist of both old and dirty pages, and since this process is looking for dirty pages, the deeper into the LRU list the cleaner scans the more pages it will find to flush.
Default: 1024 (pages)
Tuning notes: If you have a large buffer pool, and if you have a write intensive workload that’s saturating I/O, then decrease this value. It’s also theorized that if you increase innodb_io_capacity, you should increase this variable as well, which you can read about in this blog post by Mark Callaghan.

Definition: If this variable is set to enabled, when InnoDB performs a flush of a page from the buffer pool to persistent storage, it will check to see if any other dirty pages in the buffer pool belong in the same extent, and if so it will flush those as well in order to group writes and avoid disk seek time latency. The options for this variable are as follows…

1 = Enabled, but only flushes dirty neighbor pages from the buffer pool if they are in the same extent and the are considered contiguous.

2 = Enabled, but the requirement for contiguous pages is not present in this setting, which means neighbor flushing is more likely to occur.

0 = Disabled

Default: 1
Tuning notes: Disable when you are using SSDs or fast IO solution.

Definition: This defines the system calls the InnoDB will use to push data to physical storage for both data and log files. I would recommend reading Ivan Groenewold’s blog post ‘Innodb flushing and linux I/O’ to get a better understanding of what each option means and how it relates to the usage of fsync(), fdatasync(), O_SYNC, and O_DIRECT.
Default: Null (On linux this defaults to fsync, on windows it defaults to async_unbuffered)
Tuning notes: I agree with Ivan’s recommendation that O_DIRECT should be considered for most modern workloads, especially if you have a properly configured InnoDB buffer pool.
It should also be noted that if you are using InnoDB file per table you may see overhead using fsync, which makes use of OS level write buffers to help restructure the order of write operations to disk. When InnoDB file per table is in use, these write operations cannot be merged and each file will need to have it’s own fsync process. As such O_DIRECT should be considered when using InnoDB file per table. If you are not using InnoDB file per table, then you may see better performance with fsync.

Definition: The double write buffer is used to ensure that a request to write a page to disk actually makes it to the disk before the associated change records is purged from the InnoDB redo log. More specifically it’s there to protect against partial page writes. Pages are written to the double write buffer first before being written to table space. When InnoDB performs recovery it will check page integrity by verifying the data against the associate checksum at the end of the page. It will do this for both pages in the table and in the double write buffer. If it find a page in the table space that’s inconsistent with its checksum it will recover it from the doublewrite buffer. If it finds a pages in the double write buffer that is inconsistent with its checksum, it discards it. I would recommend reading Peter Zaitsev’s blog post for more details on how this works.
Default: On
Tuning notes: The double write buffer is really small, only large enough to support 100 pages, and is sequential in nature. Despite having to write the page twice, the I/O overhead is not double, but it is notable.
It’s been stated in the past that you can consider disabling the double write buffer if you’re using a file system that supports block level journaling. However, if you take a closer look in the comments section of this blog post by Yves Trudeau, you’ll notice that there have been several individuals that have reported issues with disabling the doublewrite buffer in this case. As such it’s only advisable to disable the double write buffer if you are using a file system that has atomic writes, such as FusionIO or ZFS, or if you are willing to sacrifice data integrity for performance. Otherwise my recommendation is to leave is enabled.
Associated Metrics: Performance schema: wait/synch/mutex/InnoDB/buf_dblwr_mutex

Definition: Innodb has 2 different methods of read ahead and this variable ties directly to the linear read ahead method. This setting dictates how many pages must be read sequentially in one extent (1 meg of page data, typically 64 pages unless you’re using the compression row format) in order for read ahead to occur. When it does occur, it automatically pulls in all of the pages from the next extent. For example, if you stay with the default of 56, then when InnoDB reads data from an extent if 56 pages are pulled in sequentially, then it will automatically pull in the full contents of the next extent.
Default: 56 (pages)
Tuning notes: Read ahead was originally implemented to reduce the amount of seek time that was needed to pull information from disk to the buffer pool. If you have SSDs you may want to consider increasing the value of this variable to make linear read ahead less likely.
Associated Metrics: In SHOW ENGINE INNODB STATUS under BUFFER POOL AND MEMORY section there is Pages read ahead and evicted without access. You will also see the values innodb_buffer_pool_read_ahead and innodb_buffer_pool_read_ahead_evicted in SHOW GLOBAL STATUS. The read ahead shows how many pages have been pulled into the buffer pool via read ahead requests, and the evicted status shows how many of them were evicted from the buffer pool without ever having been accessed.

Definition: This is like innodb_read_ahead_threshold, but this variable is associated with random read ahead. When this is enabled InnoDB will check the buffer pool and if 13 consecutive pages are found in the buffer pool it will make an I/O request for the rest of the pages in the associated extent.
Default: OFF
Tuning notes: Unlike linear read ahead where you can tune the likelihood of read head, random read ahead can only be enabled or disabled. It’s generally considered best to leave this at it’s default value of ‘off’, especially if you have SSDs, but if you have have a lot of read requests that pull data from a table in a way that doesn’t always coincide with the order of the clustered index, you may see improvement from enabling this feature. However you should keep a close eye on the associated metric to make sure you’re benefiting from having this enabled.
Associated Metrics: In the output of SHOW ENGINE INNODB STATUS, in the BUFFER POOL AND MEMORY section you will find random read aheads and evicted without access. Much like linear read aheads, SHOW GLOBAL STATUS has a status to show how many pages have been pulled in from the random read ahead requests, innodb_buffer_pool_read_ahead_rnd. Unfortunately there is only one evicted status, innodb_buffer_pool_read_ahead_evicted, and that leaves you unable to know whether the pages that were evicted came from the linear or random read ahead.

Definition: When dirty pages are flushed in InnoDB, the flushing typically only pertains to the clustered index of the table, which is the main table data for InnoDB. Secondary indexes are typically not written as part of this process unless the associated secondary index page is already in the buffer pool as updates to these values typically leave the secondary index in a highly unsorted order. These modified secondary index pages are saved in the change buffer until the secondary index page is pulled into the buffer pool by other means. This process prevents InnoDB from performing unnecessary expensive I/O operations. This variable denotes whether the change buffer is enabled or disabled, and what types of data changes should be handled by the change buffer.
Default: All (see the reference guide for all available options)
Tuning notes: You can disable this if you have a very read heavy workload with minimal updates and want to preserve space in the buffer pool to allow for more unchanged pages. However I would recommend revising the innodb_change_buffer_max_size instead of disabling change buffering entirely.
Associated Metrics: In the output of SHOW ENGINE INNODB STATUS in the INSERT BUFFER AND ADAPTIVE HASH INDEX section, you will see some basic information pertaining to the current state of the buffer. Please note that in the output the change buffer is referred to as the insert buffer. The reason for this is due to the fact that in previous versions of MySQL, changes to the secondary index were part of the buffer only if they were the result of an insert operation. This is no longer the case. Here are some of the more important metrics to consider.

  • Seg size: The total number of change buffer pages allocated. This includes 1 page for the change buffer header.
  • Free list len: The number of available pages in the change buffer.
  • Ibuf: size: The number of populated pages in the change buffer.
  • Inserts: Number of inserts in the buffer waiting to be merged. It’s important to know that since the change buffer persists to the system tablespace, this is NOT the number of inserts that have gone into the change buffer since the server started and instead notes the total number of changes waiting to be merged.
  • Merged Recs: The number of records that have been merged since the server started
  • Merges: The number of merges that occurred since the server started.
  • Merged operations: The represents records that were merged by change buffering
    • Insert: number of inserts merged
    • Delete mark: number of deleted records merged
    • Delete: number of purge records merged
  • Discarded operations: Number of operations in the change buffer that were discarded
    • Insert: number of insert merged operations discarded
    • Delete mark: number of deleted merged operations discarded
    • Delete: number of purge merged operations discarded

Definition: The change buffer (noted above) uses space in the buffer pool as well as at the system tablespace, which allows the change buffer to persist between MySQL restarts. This value indicates the maximum percent of space that the change buffer can occupy in the buffer pool.
Default: 25 (percent)
Tuning notes: The reference guide notes that this should be increased if you are running a system with high write volume, but remember that the change buffer only houses pending changes for secondary indexes. If the tables you’re writing to do not have secondary indexes then adjusting this variable may not be beneficial. If you have a reporting server where the the utilization is mostly read oriented, you should reduce this setting in order to make more room in the buffer pool for data pages. Also, you can consider reducing the value of this variable if you are using SSDs where the cost of pulling in arbitrary secondary index pages is lesser than what you would have on magnetic storage.
Associated Metrics: See the entry above for ‘innodb_change_buffering’.

innodb_undo_logs (innodb_rollback_segments)
Definition: When InnoDB makes changes to records, it creates an undo record for rollback purposes and to support MVCC. By default there are 128 undo logs in InnoDB stored in the system table space (though you can move these to their own table space), each able to support 1024 data modifying transaction. You can adjust this if you see mutex contention with the undo log.
Default: 128 (undo logs)
Tuning notes: Start with a low number and use a full stack benchmark, slowly increasing the number on each benchmark iteration until no further performance gain is observed. However, you can also use the associated metric to help gauge modification needs.
Associated Metrics: global status: innodb_available_undo_logs. This shows the available number of undo logs. If trending shows this is always over or under capacity, adjust as needed.

Definition: Originally the undo tablespace was part of the system tablespace, however now you have an option of moving the undo logs to their own table space.
Default: 0 (tablespaces) (disabled, undo logs stored in system tablespace)
Tuning notes: If you have fast storage available, you can move the undo logs to that fast storage location using the innodb_undo_directory variable and set this to a value of 1 or greater. It’s recommend that if you are going to have undo logs outside of the system tablespace that you set this variable to 2 or greater to reduce contention on a single file
Associated Metrics: If you have undo tablespaces enabled, you can use the performance schema table file_summary_by_instance and look up read and write information for the files associated with undo tablespace files.

Definition: A purge threads is responsible for removing undo records from memory and disk. The number of rollback segments, undo logs, and records pending to be removed makes up the history list length in SHOW ENGINE INNODB STATUS. Originally, the process of purging was part of the InnoDB master thread, but now has been isolated into its own thread. This variable allows you to specify how many of these purge threads should be operating simultaneously.
Default: 1 (threads)
Tuning notes: If your history list length is growing and you have available I/O capacity, you may want to consider increasing the number of running purge threads. However, this is going to be most effective when you’re working on an instance that has several tables with frequent deletes or updates where purging would be necessary. If you have few tables where purging is necessary and have multiple purge threads running, you may see a decrease in performance due to index lock contention on those tables.
Associated Metrics: Check the history list length in the SHOW ENGINE INNODB STATUS output in the transactions section to see how many items are waiting to be purged. In the same section you can also find information about the purge thread and what trx id it has purges up to in the Purge done for trx’s n:o output.

Definition: When the purge thread begins its task, it starts by looking at the InnoDB redo log and checks the number of entries specified by this variable to determine what needs to be cleaned up via purge operations and ultimately allows changed pages in the buffer pool to be flushed to disk.
Default: 20 (redo log records)
Tuning notes: If you find that you’re adjusting the innodb_purge_threads variable, you may want to consider adjusting this variable as well.
Associated Metrics: See the innodb_purge_threads entry above for metrics associated with checking lag in purge. However you may want to also see the performance schema instrument wait/io/file/InnoDB/innodb_log_file in order to ensure that larger reads from the InnoDB redo log file aren’t causing increased mutex contention, though this is less likely to occur if you’re working with multiple InnoDB redo logs.

Definition: When the purge thread lags behind, it can become costly. So you can set this value to state that once the lag (history list length) hits a certain number, it should delay inserts and updates in order to allow the purge thread to catch up.
Default: 0 (history list length) (disabled)
Tuning notes: Enable this if you need to throttle incoming requests in order to allow the purge thread to catch up. However I would recommend exploring making modifications to innodb_purge_threads first to see if you can increase purge throughput as enabling this feature is more likely to cause problems than it is to fix them. For a more detailed description of what’s going on with the source code and tuning advice, I would recommend reading this blog post by Marco Tusa.

Definition: This is the maximum delay that any DML statement will wait when a purge lag is put in place. See innodb_max_purge_lag for definition of the purge lag.
Default: 0 (milliseconds) (no limit is set, DML statements will wait indefinitely)
Tuning notes: If you have enabled the purge lag, but want to ensure that DML statements do not have to wait indefinitely, then set this to the maximum value. However enabling this will only temporarily delay spike activity. For example. If you have a number of inserts being performed at a rate that exceeds what the purge thread can handle and you have this set to 10000 (10 seconds) it just means all those inserts are going to wait 10 seconds before being put back into the InnoDB kernel queue. In that 10 seconds of time the purge thread may have been able to catch up, but likely will lag again when all of that delayed traffic hits, which can cause a flapping delay issue.
Associated Metrics: Indirectly, the history list length in SHOW ENGINE INNODB STATUS

innodb_read_io_threads / innodb_write_io_threads
Definition: Innodb has a series of background threads that handle background I/O requests and separates these threads by reads and writes. Each thread can handle 256 pending I/O requests. Please note that these are just for background threads and are not used for incoming requests as you can see in the MySQL reference guide here. For reads, this would be something along the line of read ahead requests (see innodb_read_ahead_threshold and innodb_random_read_ahead above), for writes this would apply to flushing, etc.
Default: 4 (threads) (each)
Tuning notes: Previously each read and write thread could handle single I/O requests, however now they each handle 256 pending requests, meaning that increasing this variable has less impact then it would have in previous releases of MySQL, but only if you’re using linux asynchronous I/O (if you are on a linux system with innodb_use_native_aio enabled, which is the default). This is due to the fact that the asynchronous I/O requests move to the OS much faster since they are stored in OS level caches. That said, you may want to considering increasing this when you have a high performance I/O system. This should be tested with a full stack benchmark in order to determine where an increase is no longer effective. Be sure to note that if you are using multiple MySQL instances on a single server, you may need to lower this value in order to ensure that you’re not surpassing OS asynchronous I/O limitations as noted in the MySQL reference guide under the innodb_read_io_threads and innodb_use_native_aio sections.
Associated Metrics: In the output of SHOW ENGINE INNODB STATUS under the FILE I/O sections you will see the status and pending I/O requests for each thread. Increasing the number of threads may introduce contention which would be detectable in the performance_schema.file_summary_by_instance table.

Definition: This is the upper limit in I/O operations that InnoDB will use for background I/O tasks like flushing dirty pages and moving secondary index data from the change buffer into its associated table.
Default: 200 (I/O operations)
Tuning notes: This will need to be increased for a majority of servers where MySQL is being installed. Particularly on systems with fast I/O or SSDs. When configuring, you should check the manufacturer specs for IOPS and use that as a baseline, however I would not recommend setting this to the max value, and instead you should consider 66% the max available IOPS for the drive. The algorithm that dictates what pages should be written will abandon write combining in order to flush more rapidly when it extends beyond innodb_io_capacity, which can undo the performance gains of increasing the InnoDB redo log size. So save the max available IOPS on the drive for innodb_io_capacity_max, but make sure that innodb_io_capacity is large enough to work with the common system load. Also, setting this too high can negate the effects of the change buffer which is documented in this blog post which is found on chriscalender.com.

Definition: When InnoDB needs to more aggressively flush in order to avoid running out of redo log space, it will extend beyond the innodb_io_capacity variable setting, but will never exceed beyond the setting of innodb_io_capacity_max. Arguably this is the more important of the two variables to set because it sets the hard upper limit, and prevents InnoDB from getting to a point where it will reach saturation at physical storage.
Default: 2000 (I/O operations)
Tuning notes: For most systems this should be increased along with the value of the innodb_io_capacity variable. If you don’t specify a value for innodb_io_capacity_max it will default to double the value of innodb_io_capacity, which may not be your intention.

Definition: This determines what InnoDB it going to do when you prompt MySQL to stop.
0 = Slow shutdown. This will perform all the steps in a fast shutdown (variable setting: 1) as well as a full purge of all records that are pending deletion on disk and a full change buffer merge.

1 = Fast shutdown. This will skip the full purge and change buffer merge. But dirty pages are still flushed to disk and tables are checkpointed.

2 = Crash shutdown. Stop the InnoDB engine as if MySQL had just crashed. No pages are flushed to disk, no purge occurs, no change buffer merge occurs. When you start up, InnoDB will have to recover and may experience lost data or may not be able to start up at all without being put into force recovery mode.
Default: 1
Tuning notes: Leave this at 1. Set temporarily to 0 before shutting down MySQL for a version upgrade.

Closing Thoughts

Out of the variables listed above, I would recommend you set the following when configuring a new MySQL instance.

  • innodb_flush_method
  • innodb_io_capacity
  • innodb_io_capacity_max

In the next installment we’ll be looking at variables that will show us more about the mechanics of concurrency and CPU usage for InnoDB.


Interested in working with Peter? Schedule a tech call.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *