Since documentation is not very clear to me on the topic of InnoDB flushing in combination with Linux IO (specifically the write system call), I decided to put together this article in hopes of shedding some light on the matter.
How Linux does I/O
By default, the write() system call returns after all data has been copied from the user space file descriptor into the kernel space buffers. There is no guarantee that data has actually reached the physical storage.
The fsync() call is our friend here. This will block and return only after the data and metadata (e.g. file size, last update time) is completely transferred to the actual physical storage.
There is also fdatasync() which only guarantees the data portion will be transferred, so it should be faster.
There are a few options that we can specify at file open time, that modify the behaviour of write():
O_SYNC
In this case, the write() system call will still write data to kernel space buffers, but it will block until the data is actually transferred from the kernel space buffers to the physical storage. There is no need to call fsync() after.
O_DIRECT
This completely bypasses any kernel space buffers, but requires that the writes are the same size as the underlying storage block size (usually 512 bytes or 4k). By itself, it does not guarantee that the data is completely transferred to the device when the call returns.
O_SYNC + O_DIRECT
As stated above, we would need to use both options together guarantee true synchronous IO.
Relation with InnoDB flushing
Innodb_flush_method parameter controls which options will be used by MySQL when writing to files:
At the time of this writing, we have the following options:
NULL
This is the default value, and is equivalent to specifying fsync option.
fsync
Both data and redo log files will be opened without any special options, and fsync() will be used when the db needs to make sure the data is transferred to the underlying storage.
O_DSYNC
This one is confusing, as O_DSYNC us actually replaced with O_SYNC within the source code before calling open(). It is mentioned this is due to some problems on certain Unix versions. So O_SYNC will be used to open the log files, and no special options for the datafiles. This means fsync() needs to be used to flush the data files only.
O_DIRECT
Data files are opened with O_DIRECT. Log files are opened with no extra options. Some filesystems (e.g. XFS) do not guarantee metadata without the fsync() call, so it is still used as safety measure.
O_DIRECT_NO_FSYNC
InnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call afterwards. This can provide some performance benefits if you are using a filesystem that does not require the fsync() to sync metadata.
I am deliberately not mentioning the experimental options littlesync and nosync.
There is also an extra option in Percona Server:
ALL_O_DIRECT
It uses O_DIRECT to open the log files and data files and uses fsync() to flush both the data and the log files.
Which InnoDB flushing method should I use?
The general consensus if you have a battery backed write cache or fast IO subsystem (e.g. SSD’s) is to use the O_DIRECT method. However it is a better practice to run tests to determine which method provides a better performance for each particular environment.
One downside of using O_DIRECT is that it requires the innodb-buffer-pool-size to be configured correctly. For example, if you accidentally leave your buffer pool size at the default value of 128M, but have 16G of RAM, the buffer pool contents will at least sit in the filesystem cache. This will not be true if you have O_DIRECT enabled (I would like to thank Morgan Tocker for his contribution regarding this section of the post).
3 Comments. Leave new
Hi Ivan, Thank you for writing about this. I would like to clarify your summary on “Which InnoDB flushing method should I use?”:
– I would say that it is not just for battery backed write caches, but anyone who has fast IO systems (i.e. SSDs count too).
– But the downside of turning on O_DIRECT is that it requires the innodb-buffer-pool-size to also be configured correctly. What I mean by this, is that if you accidentally leave your buffer pool size at 128M, but have 16G of RAM, the buffer pool contents will at least sit in the filesystem cache. This will not be true if you have O_DIRECT enabled.
One idea that we are floating is to have InnoDB configured for fast-io systems by default. You can see the proposal here (‘InnoDB IO Configuration’): https://mysqlserverteam.com/planning-the-defaults-for-mysql-5-8/
Hi Morgan, thanks for the feedback! I am including your input in the summary as well.
There’s another important detail which isn’t clear in the above description: Not using O_DIRECT causes the kernel to cache the written pages… just in case they get read again in the near future. So let’s say you’re writing a 1 GB file without O_DIRECT and only have e.g. 200 MB in kernel memory for file page caching. In this example after using write() without O_DIRECT then 20% of the file will end up not only on disk but also cached in memory. If this caching is not what you want then O_DIRECT is the only way to go AFAIK. To find out how much of your file on disk is cached then use e.g. this utility [1].
[1] https://hoytech.com/vmtouch/