BTRFS performance compared to LVM+EXT4 with regards to database workloads

Posted in: Technical Track

Introduction

In many database builds, backups pose a very large problem. Most backup systems require an exclusive table lock and don’t have any support for incremental backups; they require a full backup every time. When database sizes grow to several terabytes, this becomes a huge problem. The normal solution to this is to rely on snapshots. In the cloud this is quite easy, since the cloud platform can take snapshots while still guaranteeing a certain level of performance. In the datacenter, few good solutions exist. One method frequently used is utilizing LVM on Linux to perform the snapshot at the block device layer.

 

LVM snapshots

LVM is a Linux technology that allows for advanced block device manipulation including splitting block devices into many smaller ones, and combining smaller block devices into larger ones through either concatenation or striping methods, which include redundant striping commonly referred to as RAID. In addition to this, it also supports a copy on write (CoW) feature that allows for snapshots. The method used to implement this is to allocate a section of the underlying physical volumes that the original data is copied to before updating the main logical volume.

BTRFS

According to the Btrfs kernel wiki: “Btrfs is a modern copy-on-write (CoW) filesystem for Linux aimed at implementing advanced features while also focusing on fault tolerance, repair and easy administration.” It is an inherently CoW filesystem, which means it supports snapshotting at the filesystem level in addition to many more advanced features.

 

Experiment 1: a simple benchmark

The hypothesis

Since both LVM with snapshotting and Btrfs are CoW, it would stand to reason that the solution providing the features at a higher layer will be more performant and provide more flexibility compared to one at a lower layer that has less information to work with for optimization. Because of this, Btrfs should perform better, or at least similarly, and provide more flexibility and simplify management.

The experiment

The experiment consisted of a custom-written script that would allocate a large block of data, pause to allow for a snapshot to be taken, then randomly update sections of the large block of data. A custom script was chosen because there are few benchmarks that allow for one to pause between initialization and testing stages. LVM had an EXT4 filesystem on top of it created using the following flags: -E lazy_itable_init=0,lazy_journal_init=0. Btrfs was created using the default options. The script is produced below:

 

import multiprocessing
import datetime
import random

EXTENT_SIZE = 4000
EXTENTS = 100000000000 / EXTENT_SIZE
THREADS = 8
FRAGMENT_EXTENTS = 250000

def thread_setup(file):
   global urandom
   global output
   urandom = open(‘/dev/urandom’, ‘rb’)
   output = open(file, ‘w+b’)

def fill_random(args):
   output.seek(args[‘start’] * EXTENT_SIZE)
   for i in range(args[‘size’]):
       output.write(urandom.read(EXTENT_SIZE))
   output.flush()

def fill_random_list(extents):
   for extent in extents:
       output.seek(extent * EXTENT_SIZE)
       output.write(urandom.read(EXTENT_SIZE))
   output.flush()

if __name__ == ‘__main__’:
   p = multiprocessing.Pool(THREADS, thread_setup(‘test’))
   args = []
   for i in range(THREADS):
       args.append({‘start’: int((EXTENTS/THREADS)*i), ‘size’: int(EXTENTS/THREADS)})
   start = datetime.datetime.now()
   # Fill a test file
   p.map(fill_random, args, chunksize=1)
   end = datetime.datetime.now()
   print(end – start)
   print(“File made, please make a snapshot now.”)
   input(“Press enter when snapshot made.”)
   # Randomly fragment X pages
   extents = list(range(EXTENTS))
   random.shuffle(extents)
   extents = extents[:FRAGMENT_EXTENTS]
   start = datetime.datetime.now()
   p.map(fill_random_list, extents)
   end = datetime.datetime.now()
   print(end – start)
   # Finally, a big linear seek
   start = datetime.datetime.now()
   with open(‘test’, ‘rb’) as f:
       for i in range(EXTENTS):
           f.read(EXTENT_SIZE)
   end = datetime.datetime.now()

 

This was tested on a dedicated server to remove as much abstraction that could lead to measurement errors as possible. It was also performed on a single spinning disk, meaning that the random seeking caused by CoW should be amplified compared to SSDs. In addition, fragmentation was collected via the filefrag utility both before the update test and after.

 

The results

The results are tabulated below:

 

Value LVM BTRFS Ratio
Initial Creation Time 0:22:09.089155 0:28:43.236595 0.7712749130655504
Time to Randomly Update 0:03:22.869733 0:01:55.728375 1.7529817816935562
Linear Read After Update 0:16:46.113980 0:04:54.382375 3.4177113354697273
Fragmentation before Update 69 extents 100 extents 0.69
Fragmentation after Update 70576 extents 63848 extents found 1.1053752662573613

 

Btrfs took slightly longer to do an initial create, which is expected since CoW was not in place at this time for LVM, meaning Btrfs has more overhead. For the core part of the hypothesis, Btrfs was 75% faster than LVM in an update workload, which aligns with the hypothesis. What is more surprising is that Btrfs was 342% faster than LVM on a single threaded linear read after the update test. This could be explained by Btrfs having more aggressive readahead policies than either EXT4 or LVM. Another surprising find was that after updating, Btrfs had 9.5% less fragmented extents than EXT4, which could explain part of the slowdown. If fragmentation was solely responsible for the slowdown, then using Ahmdal’s law, an operation on a fragmented extent would have to be on average 3,600% slower than on an unfragmented extent.

Experiment 2: a real world benchmark

With the success of the previous experiment, a more real world benchmark was warranted.

 

The hypothesis

The hypothesis is that the previous findings would be maintained with the more mainstream benchmarking tool.

The experiment

I chose blogbench as the test platform since it provides a good mix of both linear and random writes and reads.  I targeted 10GB of space being used, which equated to 136 iterations. Blogbench 1.1 was used for the benchmark. The following script was utilized to automate the testing process:

 

#!/bin/sh
iterations=30

# Do BTRFS
mkfs.btrfs /dev/sdb
mount /dev/sdb /mnt
cd /root/blogbench-1.1/src
./blogbench -d /mnt -i $iterations | tee ~/btrfs.blogbench.initial
btrfs subvolume snapshot /mnt/ /mnt/snapshot
./blogbench -d /mnt -i $iterations | tee ~/btrfs.blogbench.snapshot
umount /mnt
wipefs -a /dev/sdb

# Do LVM
pvcreate /dev/sdb
vgcreate vg0 /dev/sdb
lvcreate -l 75%FREE -n lv0 vg0
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vg0/lv0
mount /dev/vg0/lv0 /mnt
cd /root/blogbench-1.1/src
./blogbench -d /mnt -i $iterations | tee ~/lvm.blogbench.initial
lvcreate -l +100%FREE –snapshot -n lv0snap vg0/lv0
./blogbench -d /mnt -i $iterations | tee ~/lvm.blogbench.snapshot
umount /mnt
lvremove -f /dev/vg0/lv0snap
lvremove -f /dev/vg0/lv0
vgremove /dev/vg0
wipefs -a /dev/sdb

 

The results

The results are tabulated below:

 

Value LVM BTRFS Ratio
Initial Read Score 167695 346567 0.4838746908967097
Initial Write Score 1155 1436 0.8043175487465181
Post-snapshot Read Score 88398 233204 0.37905867823879524
Post-snapshot Write Score 848 964 0.8796680497925311

 

In this test, Btrfs outperformed LVM in every benchmark. Higher scores are better. Btrfs was 107% faster in initial read scores and 24% faster in initial write scores. It was also 164% faster in post-snapshot reads and 17% faster in post-snapshot writes. This correlates with the previous experiment and the hypothesis. Another thing to note is that LVM post-snapshot suffered greatly from locking issues where for several iterations nothing happened, as shown in the below output:

 

[[email protected] ~]# cat lvm.blogbench.snapshot

Frequency = 10 secs
Scratch dir = [/mnt]
Spawning 3 writers…
Spawning 1 rewriters…
Spawning 5 commenters…
Spawning 100 readers…
Benchmarking for 30 iterations.
The test will run during 5 minutes.

 Nb blogs   R articles   W articles R pictures    W pictures R comments W comments
      351   255030     17729 222611         19185 174246  354
      519    38783    8539 32165          8203 20519   0
      521    91712     195 75868           225 52156 486
      524   265205        44 219897          61 147229 0
      524      312   0 257            0 264 0
      524        0 0             0 0 0             0
      524        0 0             0 0 0             0
      524        0 0             0 0 0             0
      524        0 0             0 0 0             0
      524        0 1             0 0 0             0
      524        0 0             0 0 0             0
      524        0 49             0 44 0            61
      542   204263       869 170643          1062 113274 2803
      576   263147      1805 218163          1715 142694 1409
      601   223393      1474 186252          1326 120374   0
      630   229142      1252 191061          1876 122406   0
      658   230185      1437 191368          1241 117970   0
      693   294852      2044 240333          1635 144919 488
      737   330354      2093 272406          2153 174214 805
      778   379635      1635 313989          1963 184188   0
      812   302766      1697 248385          1608 151070   0
      814   385820        97 316903         143 184704   0
      814   275654         0 228639         0 132450 0
      814   412152         0 340600         0 195353 0
      814   276715         0 227402         0 131327 0
      842   230882      1243 191560          1226 113133 1314
      848   274873       209 226790           296 126418 257
      848   355217         0 291825         0 168253 0
      848   237893         0 196491         0 110130 0
      848   396703         0 323357         0 179002 0

Final score for writes:           848
Final score for reads :         88398

 

Reasoning and final notes

Both of these benchmarks show that Btrfs outperforms LVM in terms of performance in the presence of snapshots. The reason for this is actually fairly intuitive and has to do with the method of implementing CoW in the systems. In LVM, CoW is achieved by first copying the block from the main logical volume to the snapshot logical volume, then updating the main logical volume. This operation requires one read and two writes to update a single block! Btrfs does this better by utilizing a log-structured data structure for writes, which means that an update requires only a single linear write. This explains why the initial create time in Experiment 1 was so similar overall: the overhead was not in CoW but in data checksuming and other features. It also explains why Btrfs was so much faster than LVM in CoW mode. Using a CoW system when one isn’t necessary leads to severe performance degradation, especially in database workloads. But if you will be implementing CoW anyway, it would stand to reason to use a CoW system that operates on the filesystem layer or higher. An example of a higher than filesystem form of CoW would be one that utilizes CoW in the database engine to create snapshots. A sort of named, persistent transaction that can be referenced.

 

Further work

I would like to perform a similar benchmark on MySQL. Initial work was done on this, but due to time limitations, I could not complete a benchmark using SysBench and MySQL for this. It would be interesting to see the results from a real database, which traditionally has seen terrible performance on top of a CoW filesystem.

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

2 Comments. Leave new

LVM thin provisioning CoW snapshots in theory should be a lot more efficient than the convention (thick) provisioned method. Any possibility for retesting on thinp?

Reply
Warwick Chapman
May 22, 2020 2:40 pm

Did you do the MySQL test? I would be interested in the result.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *