As the amount of data companies are interested in collecting grows, life becomes all the more difficult for IT staff at all levels within an organization. SAS Enterprise storage devices that were once considered giants are now being phased out in favor of SSD Arrays with features such as de-duplication, tape storage has pretty much been abandoned and the same goes without saying for database engines.
For many customers just storing data is not enough because of the CAPEX and OPEX that is involved, smarter ways of storing the same data are required and since databases generally account for the greatest portion of storage requirements across an application stack. Lately they are used not only for storing data but also for storing logs in many cases. IT managers, developers and system administrators very often turn to the DBA and pose the time old question “is there a way we can cut down on the space the database is taking up?” and this question seems to be asked all the more frequently as time goes by.
This is a dilemma that cannot easily be solved for a MySQL DBA. What would the best way to resolve this issue be? Should I cut down on binary logging? Hmm… I need the binary logs in case I need to track down the transactions that have been executed and perform point in time recovery. Perhaps I should have a look at archiving data to disk and then compress this using tar and gzip? Heck if I do that I’ll have to manage and track multiple files and perform countless imports to re-generate the dataset when a report is needed from historical data. Maybe, just maybe, I should look into compressing the data files? This seems like a good idea… that way I can keep all my data, and I can just take advantage of a few extra CPU cycles to keep my data to a reasonable size – or does it?
Inspired by the time old dilemma I decided to take the latest version of TokuDB for test run and compare it to InnoDB compression which has been around a while. Both technologies promise a great reduction in disk usage and even performance benefits – naturally if data resides on a smaller portion of the disk access time and seek time will decrease, however this isn’t applicable to SSD disks that are generally used in the industry today. So I put together a test system using an HP Gen8 Proliant Server with 4x Intel® Xeon® E3 Processors, 4GB ECC RAM & the Samsung EVO SATA III SSD rated at 6G/s and installed the latest version of Ubuntu 14.04 to run some benchmarks. I used the standard innodb-heavy configuration from the support-files directory adding one change – innodb_file_per_table = ON. The reason for this is that TokuDB will not compress the shared tablespace hence this would affect the results of the benchmarks. To be objective I ran the benchmarks both on MySQL and MariaDB using 5.5.38 which is the latest bundled version for TokuDB.
The databases were benchmarked for speed and also for the space consumed by the tpcc-mysql dataset generated with 20 warehouses. So lets first have a look at how much space was needed by TokuDB vs. InnoDB (using both compressed and uncompressed tables):
|InnoDB Compressed Tables||4,2|
|InnoDB Regular Tables||4,8|
TokuDB was a clear winner here, of course the space savings depend on the type of data stored in the database however with the same dataset it seems TokuDB is in the lead. Seeing such a gain in storage requirements of course will make you wonder how much overhead is incurred in reading and writing this data, so lets have a look at the “tpm-C” to understand how many orders can be processed per minute on each. Here I have also included results for MariaDB vs. MySQL. The first graph shows the amount of orders that were processed per 10 second interval and the second graph shows the total “tpm-C” after the tests were run for 120 seconds:
Figure 1 – Orders processed @ 10 sec interval
|Interval||MariaDB 5.5.38||MariaDB 5.5.38 InnoDB Compressed||TokuDB on MariaDB 5.5.38||MySQL 5.5.38||MySQL 5.5.38 InnoDB Compressed||TokuDB on MySQL 5.5.38|
Figure 2 – “tpm-C” for 120 test run
|TokuDB on MySQL 5.5.38||32669.5|
|TokuDB on MariaDB 5.5.38||30827.5|
|MySQL 5.5.38 InnoDB Compressed Tables||4151|
|MariaDB 5.5.38 InnoDB Compressed Tables||3437|
Surprisingly enough however, the InnoDB table compression results were very low – perhaps this may have shown better results on regular SAS / SATA disks with traditional rotating disks. The impact on performance was incredibly high and the savings on disk space were marginal compared to those of TokuDB so once again again it seems we have a clear winner! TokuDB on MySQL outperformed both MySQL and MariaDB with uncompressed tables. The findings are interesting because in previous benchmarks for older versions of MariaDB and MySQL, MariaDB would generally outperform MySQL however there are many factors should be considered.
These tests were performed on Ubuntu 14.04 while the previous tests I mentioned were performed on CentOS 6.5 and also the hardware was slightly different (Corsair SSD 128GB vs. Samsung EVO 256GB). Please keep in mind these benchmarks reflect the performance on a specific configurations and there are many factors that should be considered when choosing the MySQL / MariaDB edition to use in production.
As per this benchmark, the results for TokuDB were nothing less than impressive and it will be very interesting to see the results on the newer versions of MySQL (5.6) and MariaDB (10)!
Nikolaos, nice write up. As you describe, TokuDB is both smaller on disk and faster. Also worth noting is that if you look at your results individually the story becomes even more interesting.
TokuDB vs. InnoDB Compressed : TokuDB is much smaller on disk, and dramatically faster (8x to be exact).
TokuDB vs. InnoDB Uncompressed : The performance of both is similar, but the disk savings is huge.
As you mention in your write up, the on-disk size of TokuDB is largely dependent on your actual data. While TPC-C data can be compressed, we usually see much higher compression ratios on real-world data.
I was surprised at the level of compression that was achieved on the tpcc test data which is largely numeric and since the greater portion of data in DBs is generally string it would result in a higher compression ratio of course.
This always needs to be tested against the specific schema to identify the savings and speed benefits that can result.
To answer your final question.. yes, InnoDB compression should work much better in MySQL 5.6. The InnoDB team was able to work with the MySQL team at Facebook, and incorporate dynamic padding to reduce compression failures.
It would be great to see a retest with MySQL 5.6 and how much dynamic padding improves performance by reducing compression failures.
To be fair, the dynamic padding means that your rows are larger by the amount of dynamic padding, which means less data fits in the InnoDB cache, and thus more IO is required. This is an often overlooked trade-off of dynamic padding.
I hope you can run the tests, or at least one for Toku & Inno, for a longer time — 12+ hours — to show how performance changes as the tree (b-tree, fractal-tree) is subject to random updates (b-tree fragments, fractal gets old versions of rows to GC).
Yes, ideally these tests should be run for a longer period of time however I’m always somewhat hesitant doing heavier benchmarks on brand new hardware… the EVO was a recent purchase and the MTBF for SSDs leaves much to be desired.
That being said I’ll look into running this on some rotational SATA disks running at 3Gb/s to see how the IO performance trend lines fair over a longer period of time as well as to see the performance gains from dynamic padding for MySQL 5.6 vs. MySQL 5.5 (greatly anticipating the official TokuDB 5.6 release).