(In the previous post, Part 5, we covered Data Reads.)
In this blog post, we continue our series of exploring MyRocks mechanics by looking at the configurable server variables and column family options. In our last post, I explained at a high level how reads occur in MyRocks, concluding the arc of covering how data moves into and out of MyRocks. In this post, we’re going to explore replication with MyRocks, more specifically read-free replication.
Some of you may already be familiar with the concepts of read-free replication as it was a key feature of the TokuDB engine, which leveraged fractal tree indexing. TokuDB was similar to MyRocks in the sense that it had a pseudo log-based storage mechanism, but unlike MyRocks, that change log eventually flushed to a data page similar to what you would see in InnoDB. The log structure of the engine means that we achieve additional efficiency by simply forwarding data change requests into the log structure without having to pull data into memory and then let compaction sort out the update later on.
This feature can be of great benefit to you when it comes to keeping replication lag to a minimum, but you have to be absolutely sure that no writes are occurring on the replica as these features require confidence that the replica’s data will be consistent with the master.
There is also another key feature of replication optimization that appears to be included in the base MyRocks project from Facebook, but does not appear to be included in the Percona implementation at this point and that’s skipping unique index lookups. Controlled by the variable rocksdb_skip_unique_check_tables, this would allow you to take read-free replication a step further and skip the check for uniqueness as part of the process that replays data changes as part of replication. Just like the main component of read-free replication, this is highly dependent on the replica staying in sync with the master. We’ll need to keep an eye out for future releases of Percona server that might include this option.
Variables and CF_OPTIONS
Let’s have a closer look at these features and the variables that apply to them.
Changes to the data record originate on the master and are then logged locally in the binary log. Data in the binary log is requested by the replica, which copies binary log data from the master into its own relay log and then replays the events locally.
Binlog-format allows you to control the format of the binary log, allowing you to specify statement, row, or mixed based binary logging formats. For a full description of all of these binary logging formats you can check this page in the MySQL community reference. In order to use read-free replication in MyRocks, you must use the row-based binary logging format.
The default for this variable is dependent on the MySQL fork / version, but as of 5.7 the default is row. MyRocks became available in Percona Server starting with 5.7, so it’s unlikely you would have to change this variable.
As noted above, you need to use row-based binary logging if you want to take advantage of read-free replication. But regardless of whether or not you plan to use read-free replication, you still want to avoid using statement-based binary logging given the fact that it doesn’t guarantee that data on the replicas won’t fall out of sync with the master. You can read more about this on this page of the MySQL community reference.
MyRocks recognizes the dangers of using statement-based binary logging for replication and as such it leverages the variable rocksdb_unsafe_for_binlog to ensure that statement-based binary logging is not used.
The default value ensures that statement based binary logging cannot be used. I would recommend leaving this variable at its default value.
Row-based binary logs will provide you the before and after images of the data change for the row in question. Depending on how binary logging is configured, this information can contain just the columns that changed, or all the columns in the row. This configuration is controlled by the variable binlog_row_image. For a full list of all row image options that are available you can see this page of the MySQL community reference.
In order to use read-free replication in MyRocks, you must use full binlog row images.
Read_only & Super_read_only
Read-free replication works under the assumption that data on the replicas is the same as the data on the master within the transaction boundaries of the data change that is being applied via replication. This is commonly referred to as “cluster level data consistency”. When working with read-free replication it’s imperative that we maintain cluster level consistency, as failure to do so can result in index level corruption as documented on this page in the MyRocks wiki.
One of the easiest ways to ensure cluster level data consistency is to ensure that no writing occurs directly on your replicas. Up until version 5.7, the only way to protect your replica from direct data writes was to use the read_only system server variables which, if enabled, signaled an error to the user in the case that they attempted to make data changes on the instance.
The problem with this feature is that it did not prevent data changes or signal any kind of error if the user in question had super level privileges. I’ve seen many cases where super has been granted to application accounts which wrote to replicas with the read_only variable enabled due to improper traffic routing. As soon as this happens, you’ve lost cluster level data consistency.
With 5.7 a new variable called super_read_only was made available which works basically the same as read_only, but it applies to all users regardless of what privileges they have. This also has other handy features like not allowing replication changes via the change master to command. This assures that data changes will come only from replication and that users cannot accidentally change the replication status, offering the greatest level of assurance of cluster level data consistency.
Default: Off (for both variables)
Regardless of whether you are using MyRocks or any other storage engine in MySQL, I would highly advise that you enable both of these variables on your replicas so long as you are not using an architecture that is dependent on replica writes.
Once the binary logs have been collected from the master and placed in the relay logs of the replica they will be replayed by the replication sql_thread. As this occurs, checks will be made against the list of tables provided in the variable rocksdb_read_free_rpl_tables to see if read-free replication should occur or not. If the table is in the list, there will be no row lookup as part of the process of replaying the row change.
The following instruments can be found in the performance schema:
There isn’t a lot of documentation when it comes to setup instruments in performance schema, but in this case, I would make an educated guess that read_free_rpl_tables is for the locking associated with maintaining of metadata for read-free replication tables. Skip_unique_check_tables would fall out of scope for now given that skipping unique checks is not currently available as noted in the first section of this blog post.
In this post, we covered the replication specific mechanics of MyRocks and the basics of how to work with read-free replication. Stay tuned for the next and final post where we’re going to cover the use cases where you’ll want to consider including MyRocks as part of your data platform as well as some final high-level considerations when implementing.
In case you missed the previous posts, here they are:
Part 1: Data Writing
Part 2: Data Flushing
Part 3: Compaction
Part 4: Compression and Bloom Filters
Part 5: Data Reads