Exposing Innodb Internals via System Variables: Part 2, I/O (structure and logs)

Posted in: Technical Track


This is part two of a five part blog series to explore InnoDB internals by looking at the related tunable system variables. In part one we covered variables that had the greatest impact on memory, and in this instalment we will cover the variables that have the greatest impact on the file structure of InnoDB as well as how data is written to logs, which can have a large impact on transaction commit overhead. Originally I was considering putting all the I/O related InnoDB related variable in one post, however that’s a lot of ground to cover. Instead you can look forward to an upcoming instalment of this blog series that will cover I/O as it relates to table data.

Before we begin, I would like to emphasize something that was written in part one of this blog post series.

“I should note that while tuning recommendations are provided, this objective of this blog post series was NOT meant to be a tuning primer, but instead to explore the mechanics that each variable interacts with. As such I would like to advise against reading this guide and trying to fine tune all of the available indoor variables. System variable tuning is an exercise in diminishing returns, the most benefit you’ll get out of tuning your MySQL server will occur within the first 15 minutes of configuration. In order to help prevent excess tuning I will provide a list of variables out of each section that should be set when configuring a new system.”


Let’s see what we can learn from taking a closer look at I/O structure and log related variables!

Definition: There are 2 options for how to store your InnoDB data files, you can store them all in the system tablespace (ibdata1), or you can store each table in it’s own file. It should be noted that if you have file per table enabled there will be 2 files files per table, one frm file that holds the definition and metadata for the table, and an ibd file that stores the tables contents.
Default: 1 (on)
Tuning notes: There are certain performance considerations for tuning this variable. You can speed up operations of statements like ‘truncate table’ and help relieve file descriptor contention by spreading out data over multiple files, but the real advantage to using file per table is easing maintenance of the database. You can easily move tables spaces and reclaim tables space where you otherwise couldn’t before when everything was stored in the system tablespace. I would advise leaving this option enabled.
Associated Metrics: Performance schema: wait/io/file/innodb/innodb_data_file. Use tables file_summary_by_event_name and file_summary_by_instance in order to determine the amount of I/O wait on the system tablespace prior to enabling file per table and then again after making the modification and converting your tables to measure wait time reduction.

Definition: Currently the two file formats for InnoDB are ‘antelope’ and ‘barracuda’. Barracuda was a new file format that was made available for MySQL 5.1 as part of the InnoDB plugin, and was shipped as part of MySQL 5.5. The Barracuda file format supports the existing ‘compact’ and ‘redundant’ row formats, but includes new row formats ‘dynamic’ and ‘compressed’.
Default: antelope (format)
Tuning notes: There are two main reasons for switching to barracuda
You want to enable compression for your data in order to reduce I/O overhead at the cost of increasing CPU usage. If that’s the case you’ll want to use barracuda with the compressed row format. Or your table has off-page data such as BLOB or TEXT that is too large to fit with the page, and the whole contents of the data is typically accessed if it’s going to be read. In this case you may want to use the dynamic row format.
If you’re using the antelope file format, both of its row formats (compact and redundant) will store the first 768 bytes of the blob/text data in page and then establish a pointer that refers to where the rest of the data can be found. This is good if you’re only going to typically look at or search the contents of the first 768 bytes, but if that’s not the case then you’re wasting page space that could be used for non blob/text data. The dynamic row format will only store a 20-byte pointer to the off page data, so use this if you are going to infrequently access your off-page data, or need to access more than 768 bytes of it more frequently than not.
I would recommend switching this to barracuda unless you have concerns about backwards compatibility with earlier MySQL versions as you will still be able to use the redundant and compact row formats found in the antelope file format even with barracuda enabled.

Definition: In order to have a really good understanding of what is and what isn’t being configured here I would recommend you watch this video on file descriptors and read this page on how MySQL opens and closes tables. For other storage engines the ‘open table’ is not the opening of a file, so much as the establishment of a file descriptor that’s being used to parse the file data that contains the table contents. A good way to think of this is, a file descriptor is to a file as the MySQL socket is to MySQL as a server when you are accessing it from the command line. It’s an abstract that allows you to access a set (stationary or stream) of data.
This variable, being that it relates specifically to innodb, is not quite like table_open_cache, which limits the number of file descriptors that can be used by mysql. As you’ll see in this blog post by Peter Zaitsev the open status of an InnoDB file does not require maintaining an open file descriptor when the InnoDB table is open. An InnoDB table being open is merely a logical state that the table can be in. This variable simply notes how many of these tables can be in that open state. Please note that while the aforementioned blog post by Peter Zaitsev is 7 years old, not much about this concept has changed since then.
If innodb_file_per_table is disabled = 300 (number of tables allowed to be in open status)
If innodb_file_per_table is enabled = 300 or the value of table_open_cache, whichever is greater
Tuning notes: Going back to the blog post by Peter Zaitsev, there may not be much need to tune this variable assuming you’re working with a local file system. You may need to experiment when working with something external.
Associated Metrics: global status: innodb_num_open_files

Definition: This sets the size that the InnoDB system tablespace (ibdata1) should expand if it runs out of space. This only applies to the system tablespace and does not have any impact on the growth of table files created by InnoDB file per table, nor other explicitly created tables spaces which each expand by 4mb.
Default: 64 (mb)
Tuning notes: If you are not using InnoDB file per table and expect rapid growth of your system tablespace, consider increasing this value. Every time the table space needs to be expanded it’s going to use a mutex while the file is being expanded, which could be interruptive to ongoing storage engine activity. However, don’t tune this variable amount to such a high value that you won’t get any warning about disk space capacity. For example, if you have a 10G volume and you set this value to 2000, the file will expand by 2G every time and if your disk monitor is only set to alert at 85% capacity, it’s likely that you won’t get an alert on disk capacity until after the file has expanded and consumed all the disk space.

Definition: This sets the page size for data in innodb. This is set once when the MySQL instance is created and cannot be changed
Default: 16k
Tuning notes: When you are setting up a MySQL instance you can reduce the InnoDB page size to that of the page size of the disk itself in order to prevent unchanged data being re-written. This is common for SSDs as they typically use smaller blocks. However, this reduces the maximum row width of your InnoDB record and cannot be changed after the instance has been created, you will have to use a logical backup and restore in order to adjust the page size at a later date. I would not recommend changing this from the default unless your have been VERY thorough in your investigation of the data set and are 100% sure that no row width will need to exceed the specified page size.
One other thing to consider is block alignment, which is is covered in detail in this blog post by Aurimas Mikalauskas. You’ll want to ensure that your page size is not greater than the disk stripe, which will ensure that unneeded excess pages are not pulled from disk. It should be noted that you will have to adjust your disks appropriately, as specified in the blog post.

innodb_checksum_algorithm / innodb_checksums (deprecated)
Definition: When InnoDB writes a page to disk, it calculates a checksum of the page data and stores that checksum at the end of the page. When the page is retrieved from disk, InnoDB checks the checksum at the end of the page in order to ensure that the data in the page matches what was originally written. This is to help detect issues with storage like bit rot. This is also the basis of the tool innochecksum, which recalculates the checksum for each page and checks it against the stored value.
This variable allows you to specify the checksum algorithm that’s used, or specify that no algorithm should be used at all. The variable ‘innodb_checksums’ at one time was used to specify whether checksumming would be used, but has been deprecated and is part of the ‘innodb_checksum_algorithm’ variable.
Default: InnoDB (proprietary checksumming method specific to mysql)
Note: for a short period of time this was crc32, but went back to the proprietary InnoDB checksum method in order to ensure that downgrades would still be possible.
Tuning notes: While you have the option of disabling checksumming, and while that would improve performance, I would not recommend it as you want to ensure you’re not reading corrupted data.
If you’re SURE that you will not have to downgrade your version of mysql, switch it to crc32. It’s a faster algorithm and it scans blocks 32 bits at a time, which is a lot faster than the InnoDB method which can only scan blocks 8 bits at a time. Again, if you switch to crc32, you will in no way be able to downgrade to a version of MySQL that does not support crc32 without having to do a logical backup and restore.
You’ll also notice that there are ‘strict’ versions of each option. Let’s say you start a new instance with the default value of innodb. If you switch the checksumming to crc32 you’re fine as it’s going to be able to work with whatever checksum was used when it’s reading the page from disk, however when it writes it’s going to write the crc32 checksum. If you use the strict_crc32 option and it encounters a page with the InnoDB proprietary checksum during a read operation, it will result in an error. Strict mode checksumming is faster than non-strict methods, but should only be used on new MySQL installations.

Definition: The size of each redo log in the log group. The redo log stores transaction information about changes that have been made to InnoDB data, but have not been copied from the buffer pool to the table yet. Pages in the buffer pool that are modified but not yet flushed to disk are considered ‘dirty’. These dirty pages can remain in the buffer pool so long as the associated change is stored in the redo log. By using redo logs, changes do not have have to be flushed to the table on disk immediately upon transaction commit, which increases performance while allowing InnoDB to remain ACID compliant.
Default: 50331648 (48 megs)
Tuning notes: The more redo log space you have, the less aggressively InnoDB has to flush. Make sure that the combination of log file size and log files in group (log file size * log files in group) are large enough to accommodate at least one hour of traffic. In the SHOW ENGINE INNDOB STATUS output, in the ‘log’ section, you’ll find Log sequence number. Collect that information then check it again in one hour. The difference between the two collected numbers is the amount of data, in bytes, that has been written to the InnoDB redo log over that period of time.
Keep in mind increasing log size will increase the time it takes to do recovery on server crash.

Definition: This denotes how many redo log files will be present in the log group. This, in combination with innodb_log_file_size (innodb_log_files_in_group * innodb_log_file_size) dictates how large the redo log is.
Default: 2 (files)
Tuning notes: This variable was originally added to allow for file systems that were not capable of handling large files. In the case where a large redo log was necessary, you could span it over multiple smaller files instead of a small number of larger files. The only benefit you could stand to gain from having multiple smaller files would be to decrease the possibility of contention between the flushing process that writes to the redo log vs the purge processes that remove data from the redo log, but both of these processes work sequentially so if everything is working with high efficiency it’s likely that these processes will be working with the same file. It’s generally considered best practice to leave this at the default value of 2.

Definition: This is the buffer that’s used for collecting transaction information that will be written to the redo log. Data will be collected here as a transaction runs, and then will be written to the redo log at an interval designated by the value of the Innodb_flush_log_at_trx_commit variable. If you have a large transaction that’s writing more changes than the log buffer can hold, the buffer will have to be flushed to the redo log, potentially multiple times, prior to the transaction committing so it will not exceed its maximum capacity.
Default: 8388608 (8MB)
Tuning notes: Make sure this is large enough so large transactions don’t have to flush multiple times before the commit. It’s a good idea to make sure that the amount of data in the log buffer is 30% or less than the available log buffer size. Historical monitoring would be best to capture this data and look for spikes, however if you don’t have this option you can fall back to using the associated metrics noted below.
Associated Metrics: In SHOW ENGINE INNODB STATUS in the LOG section there are 2 relevant metrics, Log Sequence Number and Log flushed up to. The difference between these two numbers (in bytes) is the amount of data currently in the log buffer. Innodb_log_waits from show global status keeps count of the number of times the log buffer space was too small and had to be flushed before continuing. If this keeps increasing it is a good idea to increase the log buffer size.

Definition: This determines the method that InnoDB will use to write/flush recently changed data to the redo log and allows you to reduce ACID compliance properties of redo logging in order to gain performance. There are three values to pick from.

1 = (lowest performance, ACID compliant) When a transaction commits, the contents of the log buffer are written and flushed to the redo log file on disk.

2 = (better performance, not ACID compliant) The contents of the log buffer are written to the redo log file after every transaction commit, but the file is only flushed to disk on a time interval, in seconds, specified by the innodb_flush_log_at_timeout variable.

0 = (best performance, not ACID compliant) The contents of the log buffer are written to the redo log on a time interval, in seconds, specified by the innodb_flush_log_at_timeout variable, then the write is flushed to disk.

Up until recently, if you used a value other than 1 the rate of log flush to disk was every 1 second, but now the rate of flushing is now configurable using the variable innodb_flush_log_at_timeout. Should you choose to set this variable to anything other than 1, a crash of MySQL can cause you to lose N seconds of transactions where N is the value of innodb_flush_log_at_timeout
Default: 1
Tuning notes: I would only recommend tuning this variable away from its default if you are able to live without full ACID compliance. I’ve seen values of 0 and 2 used to move past unexpected DML spikes, overcoming temporary replication lag, and even OLAP systems where data can be reloaded from OLTP if it’s lost. However I would advise keeping this value at the default value of 1 for all production OLTP systems. It should be noted that setting this to 2 will only cause data loss in the case of unexpected shutdown of the host, if only MySQL crashes there will be no data loss.
Associated Metrics: Performance schema: wait/io/file/innodb/innodb_log_file

Definition: This is the number of seconds that InnoDB will wait to flush changes occurring in the redo log to disk if innodb_flush_log_at_trx_commit is set to a value other than 1.
Default: 1 (second)
Tuning notes: If you have set innodb_flush_log_at_trx_commit to a value other than 1, you can increase performance further by increasing this value. However, in the case of a MySQL crash, this is the number of seconds of InnoDB transactions that will potentially be lost.
If you have innodb_flush_log_at_trx_commit set to 0, then you need to make sure the innodb_log_buffer_size is set to a value large enough to accommodate for the number of transactions that it needs to hold. Remember, log buffer data will get written to the redo log when the log buffer is nearing capacity. For example, if you have innodb_flush_log_at_trx_commit set to 0, and have this value set to 5 then that means the log buffer will only get written to the redo log once every 5 seconds, however, if the log buffer is only large enough to hold 1 second worth of transactions, then it’s still going to write to the redo log once a second in order to make more room to store future transaction data.
Associated Metrics: See innodb_log_buffer_size noted above. Also see Pending flushes (fsync) log under File I/O section of InnoDB status.

Definition: When InnoDB completes a transaction, the binary log is updated as part of the commit process. The DML portion (inserts, updates, deletes) of the transaction are placed in a session level memory buffer called the transaction cache until a commit occurs. At this point all of the transaction data is formed into group commits with other committing transaction and is written to the binary log before the commit is considered to be complete. This variable specifies how many transactions group commits can be processed before fdatasync() is called and the pending binary log entries are flushed to disk. Despite not technically being an InnoDB variable, it does have an impact on the InnoDB commit process and as such I figured it best that it be included in this blog.
Default: 0 (commit groups) (Off)
Tuning notes: The only way to ensure that your binary log will stay completely in sync with your InnoDB transactions is to set the value of this variable to 1. This contradicts the MySQL reference guide system server variable page which specifies that you can lose up to N number of transaction group commits where N is the value of this variable, however the binary log section of the reference guide confirms that the binary log write is part of the transaction commit process for InnoDB. So if you have this variable set to 1 then when InnoDB commits, the binary log will be updated, the changes will be flushed to disk, and then the commit will complete. So the actual number of potential transaction group commits that could be lost is N – 1 where N is the value of this variable.
There is one important thing to note, there is a small chance that MySQL could crash between the phase where the binary log entry is flushed to disk and the InnoDB transaction is committed. If that’s the case you will have an entry in the binary log for a transaction that never completed in InnoDB. When MySQL is started and InnoDB recovers, it will rollback the transaction, but may not roll back the binary log entry. You can avoid this scenario by ensuring that innodb_support_xa is enabled, which it is by default. That way, when InnoDB recovers and rolls back the transaction, the entry in the binary log will be removed as well, so if you have this variable set to 1, make sure you have innodb_support_xa enabled as well.
From a performance perspective, the higher you set this variable, the less frequently you will flush to disk. So performance goes up, but the risk of transaction loss in the binary log also goes up. 0 is considered to be the most performant setting, but the potential number of transactions lost in the binary logs would be impossible to determine in the event of a crash without reading the binary logs manually.
Associated Metrics: Performance schema: wait/io/file/sql/binlog

Closing Thoughts

Out of the variables listed above, I would recommend you set the following when configuring a new MySQL instance.

  • innodb_file_per_table (ensure it’s at the default value of 1, in my opinion)
  • innodb_file_format
  • innodb_checksum_algorithm
  • innodb_log_file_size
  • Sync_binlog

Stay tuned for the next installment of this blog series where we will cover the variables that are related to I/O, specifically in relation to table data.


Interested in working with Peter? Schedule a tech call.

About the Author

Internal Principal Consultant
Peter Sylvester is one of the Internal Principal Consultants in the Open Source Database Consulting Group at Pythian. He has been with Pythian since January of 2015 and has been working with MySQL since 2008. Apart from work, Peter is an avid ice hockey player to stay in keeping with the stereotypical Canadian lifestyle, playing typically no less than twice a week!

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *