(In the previous post, Part 3, we covered Compaction.)
In this blog post, we continue on our series of exploring MyRocks mechanics by looking at the configurable server variables and column family options. In our last post, I explained at a high level how data moves from its initial disk-written files into the full data set structure of MyRocks using a process called compaction. In this post, we’re going to look a little closer at two important features that are leveraged as data cascades down through this compaction process: bloom filters and compression.
Before we approach how bloom filters are used in MyRocks, we need to know what a bloom filter is. The short definition is that a bloom filter is a space-efficient data structure used to tell you if an element is present in a set. Make sense? No? No problem! When I read that I didn’t really understand what it was either, at least not to a useful extent. For a better and more complete description of what a bloom filter is I’m going to call for a three-and-a-half minute time-out so you can go to YouTube and check out this awesome video by cube-drone on bloom filters.
Makes a bit more sense now, right? Okay, let’s get back to it.
When a data file is written, there are a number of other metadata components that are written in the file as well and one of those things, if enabled, is a bloom filter. During the process of reading data, the bloom filter can be leveraged to determine if the key that you’re looking for is in a data file. This can be incredibly beneficial considering that, given how compaction works, information about the present state of the value of the key (what’s currently in the record) can be in multiple compaction layers. Speeding up the process of invalidating suspect data files can be crucial to increasing the efficiency of data reads.
Another interesting thing to consider is that in the cube-drone video Curtis explains that removing data from the data file cannot result in an update to the bloom filter hash array given that you have a greater than 0 chance of a hash collision for that element in the array. This would make bloom filters for engines like InnoDB almost completely useless, as data within the index is being updated all the time by DML statements.
With MyRocks, we already know that this isn’t the case given that files are never updated and are only created via the compaction process. When the new data file is written, it will come with a brand new bloom filter. The whole concept of being able to make better use of bloom filters due to compaction is one key way the MyRocks engine stands out.
There is one serious shortcoming to the bloom filter approach that you should be aware of. Using a bloom filter only works if you are searching for a record based on a const operator, like ‘=’ or ‘IN’. If you’re looking for a single record based on the key value, the bloom filter can be used to tell you if that key is not in the data file. However, if you’re looking for data in a range, looking for data using a bloom filter suddenly doesn’t make any sense as you really don’t want to waste time checking a bloom filter for all possible values in the range.
Keeping this important shortcoming in mind, you will need to determine if bloom filtering will work for you.
Compression is far from a new topic in the tech world. Taking large data and making it smaller via compression is something that we’ve been doing for a very long time so I don’t think we need to talk about how compression works, but we do need to take a close look at how it works with MyRocks and how it can be configured.
You can enable compression for MyRocks, but you will want to consider being more detailed in how you configure it as you can actually specify the compression algorithm you want to use, or none at all, for each compaction layer. For your bottom-most compaction layer you’ll likely want to use the highest level of compression, but for higher levels like L0 and L1, you really don’t want to use compression at all as you’ll want to make it as easy on the engine as possible to initially flush data to disk as well as do the single job sorting and deduplication that we noted before in a previous post. We’ll go into this in more detail when we look at the variables and column family options that are available to you for compression.
One very important thing to note here is that compression should absolutely not be the only factor taken into consideration when selecting a storage engine. I have actually run into issues with real client production systems where decisions were made in critical systems to use non-standard storage engines simply because they compress better than InnoDB compressed row format.
Do not do this.
I think my previous blog series on the mechanics of InnoDB and this blog series clearly demonstrate that there is a lot more to a storage engine beyond any single feature. You don’t need to know the detailed inner workings of a storage engine in order to select it, but you should at least be familiar at a high level and understand what the advantages and drawbacks are instead of accepting blind adoption.
The reason I’m mentioning this in the compression section is because compression seems to be one of those hot button features that people tend to gravitate toward.
Variables and CF_OPTIONS
With all that said, let’s take a closer look at the mechanics surrounding compression and bloom filters by looking at the variables that control them.
The easiest way to enable compression for your data set is to use the compression column family option. Assuming that no other compression options are set, setting this variable will establish what compression is used for every level of compaction.
MyRocks will let you know what compression methods are available to you in its info log. You can find this file in the MyRocks data directory and by default, it is simply named ‘LOG’. You can use the following command to display what the supported compression methods are.
[root@centos7-1 .rocksdb]# cat ./LOG | grep -A 10 "Compression algorithms supported" 2019/03/01-09:28:38.437724 7ff6cfd44880 Compression algorithms supported: 2019/03/01-09:28:38.437727 7ff6cfd44880 kZSTDNotFinalCompression supported: 1 2019/03/01-09:28:38.439318 7ff6cfd44880 kZSTD supported: 1 2019/03/01-09:28:38.439324 7ff6cfd44880 kXpressCompression supported: 0 2019/03/01-09:28:38.439326 7ff6cfd44880 kLZ4HCCompression supported: 1 2019/03/01-09:28:38.439327 7ff6cfd44880 kLZ4Compression supported: 1 2019/03/01-09:28:38.439329 7ff6cfd44880 kBZip2Compression supported: 0 2019/03/01-09:28:38.439330 7ff6cfd44880 kZlibCompression supported: 1 2019/03/01-09:28:38.439332 7ff6cfd44880 kSnappyCompression supported: 0 2019/03/01-09:28:38.439339 7ff6cfd44880 Fast CRC32 supported: Supported on x86
Default: kSnappyCompression (Snappy) / kNoCompression (No compression used) if Snappy is not available on your system.
Assuming your system supports it, I would recommend using kLZ4Compression (LZ4) as it’s lightweight, fast, and recommended by the RocksDB team. If you are going to use the same compression algorithm across the entire data set, compression and decompression speeds are something you’re going to want to consider.
You can make your compression configuration a little more detailed by specifying that you want a different compression algorithm used for the bottom-most compaction layer, L6, assuming you are using the default column family option num_levels.
Default: kSnappyCompression (Snappy) / kNoCompression (No compression used) if Snappy is not available on your system.
Your bottom-most compaction layer will contain the vast majority of the data set, so this is where you would want to get optimal compression in order to reduce storage. Assuming your system supports it, I would recommend using kZSTD (ZSTD) as it will get more compression and it’s also recommended by the RocksDB team.
You can make your compression configuration even more detailed by specifying a compression algorithm for each compaction layer. When configuring this variable, you will need to specify a compression algorithm for each compaction layer, colon delimited. An example of this can be found in Mark Callaghan’s Small Datum blog post on Default options in MyRocks.
I’m inclined to agree with the RocksDB team recommendation to use no compression for compaction layers L0 and L1, use a fast compression algorithm for compaction layers, L2+, and use a heavy compression algorithm for the bottom-most compaction layer. If I were configuring this in the my.cnf it may look something like this:
Further, in my opinion, I would rather simply work with the compression_per_level column family option than compression or bottommost_compression. However, you need to be mindful of the fact that if you change the number of compaction layers with the column family option num_levels, then you’ll need to update this setting as well.
CF_OPTION block_based_table_factory: Filter_policy
The filter policy of a column family is what enables or disables bloom filters for the column family. Changing the value for this variable differs from other column family options because it’s part of the block-based table factory group of settings. See below for an example of how this would be configured.
Default: nullptr (disabled)
I would highly encourage you to enable bloom filtering. Much like compression, you can fine-tune bloom filtering to work under certain conditions and limitations, but without enabling the filter policy, you cannot work with bloom filtering at all.
You can enable bloom filtering by including the following in my.cnf.
The above setting will enable bloom filtering, specify that 10 bits should be stored in the bloom filter blocks for every key that’s hashed, and that by noting ‘false’ as the last parameter we’re stating that we want a full filter, which means there will be one bloom filter for every data file. Had we specified ‘true’ in this case, it would mean that we want to use a block-based filter, which means that there would be a bloom filter for every data block.
As previously mentioned, you can configure the size of each bloom filter when you configure the filter policy for your column family using column family option block_based_table_factory: filter_policy. The common recommendation for this is to use 10 bits per key in order to reduce the likelihood of false positives. This means that the closer you get to the bottom-most compaction layer, the most space that’s being used to store filter blocks. By enabling the column family option optimize_filters_for_hits you can disable the creation of bloom filters for the bottom most layer of compaction.
Eventually, about 85% of your data set will be found in the bottom-most compaction layer while newly added data will be found in compaction layers above it. If you believe that you will mostly be working with the newest data in your data set, then enabling this feature may be helpful, but I would be inclined to leave this feature disabled by default until it’s proven that there will not be a lot of data reads from the bottom-most compaction layer.
Once bloom filter data is created you have options as to whether you would like leverage caching for access. The first option would be to cache the bloom filter blocks in block cache (also used to store uncompressed data for reads, we’ll cover this in more detail in a later post in this series), if you have a heavy local access pattern this may work well for you, but apart from that, it’s been suggested that this may actually degrade performance. If you attempt to look at a filter block and it’s not in the cache, the filter block needs to be pulled from disk.
The other option would be to not cache this data and read from disk when filter blocks need to be read. This would be limited by your configuration of the variable rocksdb_max_open_files as you would need to leverage potentially many file descriptors to access this filter data as file as being read.
If you check the RocksDB reference guide you will see that there are considerations to account for with the pros and cons of caching filter blocks in the block cache as this can create additional overhead if you are constantly going to the cache to look for blocks that may be evicted due to LRU. Despite stating that this feature is disabled by default in RocksDB (not MyRocks where it’s enabled by default), I would be inclined to leave this enabled unless the associated metrics lead me to believe it may be better to disable.
One of the things I noted about the caching of filter blocks in the block cache is that old pages may get evicted and then have to be pulled back into memory if they are requested. Part of this overhead can be avoided for the L0 compaction layer specifically by enabling the system variable rocksdb_pin_l0_filter_and_index_blocks_in_cache which will ensure that filter blocks associated with the L0 compaction layer do not get evicted from the block cache.
Default: 1 (ON)
Given that data overlapping is allowed in the L0 compaction layer, I feel like bloom filtering could be a serious advantage when it comes to data reads and having those filter blocks readily available in the block cache could further extend that advantage. Also, considering that L0 should be the smallest compaction layer, the cost of storing the filter blocks would not be anywhere near as great as compaction layers closer to the bottom so the amount of memory needed to store these filter blocks should be nominal.
Bloom filter partitioning
When it comes to caching bloom filters in the block cache, you have to be cognizant of the fact that it’s taking space from data blocks. One bloom filter stored in the cache could be stealing the space of as many as a thousand data blocks. Plus, bloom filters have to be stored in their entirety by default, despite the possibility of only a small part of it being used on a single read, so that can cause a lot of other data to be evicted.
There are two ways to help with this issue, the first of which is partitioning your bloom filters. If this feature is enabled, it will break a bloom filter down into several sub filters, each handling a subset of the range of keys that’s located in the data file. A top-level index noting the range of each subfilter is also created.
When data is read from a data file, it will first load the top-level filter index into memory and use that to determine which sub filter to also pull into memory, which is subsequently used to determine if actual data blocks in the data file should be read much like any other bloom filter.
The hope here is that the top-level filter index and the subfilter would both occupy less space in the block cache, thus causing less data eviction. However, it’s possible that you run into issues where you could see increased IO to repeatedly get more and more subfilters for the data in the associated data file.
Enabling this feature requires configuring several variables
- CF_OPTION block_based_table_factory: index_type
- Set to IndexType::kTwoLevelIndexSearch
- CF_OPTION block_based_table_factory: filter_policy
- Set to use a full filter (see CF_OPTION block_based_table_factory: Filter_policy above)
- CF_OPTION block_based_table_factory: partition_filters
- Set to true
- CF_OPTION block_based_table_factory: metadata_block_size
- Set to 4096
- CF_OPTION block_based_table_factory: cache_index_and_filter_blocks
- Can be either true or false
- CF_OPTION block_based_table_factory: pin_top_level_index_and_filter
- Must be set to true is cache_index_and_filter_blocks is also set to true
- CF_OPTION block_based_table_factory: cache_index_and_filter_blocks_with_high_priority
- Set to true
- Set to ON
You can reach more about filter partitioning in the RocksDB wiki.
This feature is disabled by default, but if your data set is start to expand to the point where you can no longer hold your entire active data set in memory, you may want to consider enabling this feature.
Bloom filter prefixing
The second option to reduce the footprint of filter blocks in the block cache would be to leverage filter prefixing. Instead of hashing the entire key of each data record, you can specify that you would rather hash just a prefix of the key.
I find this particular feature to be more on the advanced side so I didn’t want to go into too much detail on it in this blog series, but I wanted to make sure that you knew that it exists. You can read more about filter prefixing in the RocksDB wiki.
Here are some of the metrics you should be paying attention to when it comes to bloom filters.
You can find the following information using system status variables
- Rocksdb_bloom_filter_useful: The number of times that the usage of a bloom filter resulted in the avoidance of a file read since the last MySQL restart.
- Rocksdb_block_cache_filter_add: The number of filter blocks added to the block cache since the last MySQL restart.
- Rocksdb_block_cache_filter_bytes_evict: The number of filter blocks evicted from the block cache since the last MySQL restart
- Rocksdb_block_cache_filter_hit: The number of times a filter block was accessed in the block cache without having to go to disk since the last MySQL restart.
- Rocksdb_block_cache_filter_miss: The number of times a filter block was not found in the block cache resulting in a disk read since the last MySQL restart.
In this post, we covered the important details regarding bloom filters and compression. The importance of these features should be clear at this point, as well as how they are implemented. Further, we know that compression alone does not dictate storage engine adoption as we need to be aware of all the advantages and shortcomings.
This concludes the blog posts that focus on writing data into the system. Stay tuned for the next post in the series where we’re going to cover how that data is read.
In case you missed the previous posts, here they are:
Part 1: Data Writing
Part 2: Data Flushing
Part 3: Compaction