Since documentation is not very clear to me on the topic of InnoDB flushing in combination with Linux IO (specifically the write system call), I decided to put together this article in hopes of shedding some light on the matter.
How Linux does I/O
By default, the write() system call returns after all data has been copied from the user space file descriptor into the kernel space buffers. There is no guarantee that data has actually reached the physical storage.
The fsync() call is our friend here. This will block and return only after the data and metadata (e.g. file size, last update time) is completely transferred to the actual physical storage.
There is also fdatasync() which only guarantees the data portion will be transferred, so it should be faster.
There are a few options that we can specify at file open time, that modify the behaviour of write():
In this case, the write() system call will still write data to kernel space buffers, but it will block until the data is actually transferred from the kernel space buffers to the physical storage. There is no need to call fsync() after.
This completely bypasses any kernel space buffers, but requires that the writes are the same size as the underlying storage block size (usually 512 bytes or 4k). By itself, it does not guarantee that the data is completely transferred to the device when the call returns.
O_SYNC + O_DIRECT
As stated above, we would need to use both options together guarantee true synchronous IO.
Relation with InnoDB flushing
Innodb_flush_method parameter controls which options will be used by MySQL when writing to files:
At the time of this writing, we have the following options:
This is the default value, and is equivalent to specifying fsync option.
Both data and redo log files will be opened without any special options, and fsync() will be used when the db needs to make sure the data is transferred to the underlying storage.
This one is confusing, as O_DSYNC us actually replaced with O_SYNC within the source code before calling open(). It is mentioned this is due to some problems on certain Unix versions. So O_SYNC will be used to open the log files, and no special options for the datafiles. This means fsync() needs to be used to flush the data files only.
Data files are opened with O_DIRECT. Log files are opened with no extra options. Some filesystems (e.g. XFS) do not guarantee metadata without the fsync() call, so it is still used as safety measure.
InnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call afterwards. This can provide some performance benefits if you are using a filesystem that does not require the fsync() to sync metadata.
I am deliberately not mentioning the experimental options littlesync and nosync.
There is also an extra option in Percona Server:
It uses O_DIRECT to open the log files and data files and uses fsync() to flush both the data and the log files.
Which InnoDB flushing method should I use?
The general consensus if you have a battery backed write cache or fast IO subsystem (e.g. SSD’s) is to use the O_DIRECT method. However it is a better practice to run tests to determine which method provides a better performance for each particular environment.
One downside of using O_DIRECT is that it requires the innodb-buffer-pool-size to be configured correctly. For example, if you accidentally leave your buffer pool size at the default value of 128M, but have 16G of RAM, the buffer pool contents will at least sit in the filesystem cache. This will not be true if you have O_DIRECT enabled (I would like to thank Morgan Tocker for his contribution regarding this section of the post).