How to set up flashback for MongoDB testing

Posted in: MongoDB, Technical Track

 

After you’ve upgraded your database to a new version, it’s common that the performance degrades in some cases. To prevent this from happening, we could capture the production database operations and replay them in the testing environment which has the new version installed.

Flashback is a MongoDB benchmark framework that allows developers to gauge database performance by benchmarking queries. Flashback records the real traffic to the database and replays operations with different strategies. The framework is comprised of a set of scripts that fall into 2 categories:

  1. Records the operations(ops) that occur during a stretch of time
  2. Replays the recorded ops

Installation

The framework was tested on Ubuntu 10.04.4 LTS

Prerequisites

-go 1.4

-git 2.3.7

-python 2.6.5

-pymongo 2.7.1

-libpcap0.8 and libpcap0.8-dev

 

  1. Download Parse/Flashback source code

# go get github.com/ParsePlatform/flashback/cmd/flashback

  1. Manually modify the following file to workaround a mongodb-tools compatibility issue

In pass_util.go file:

func GetPass() string {
–    return string(gopass.GetPasswd())
+    if data, errData := gopass.GetPasswd(); errData != nil {
+        return “
+    } else {
+        return string(data)
+    }

 

  1. Compile the go lang part of the tool

# go build -i ./src/github.com/ParsePlatform/flashback/cmd/flashback

 

Configuration

Suppose you have to two shards, Shard a and Shard b. Each shard has 3 nodes. In each shard a, primary is a1. In shard b, primary is b2.

1. copy sample config file for editing

# cp ./src/github.com/ParsePlatform/flashback/record/config.py.example  config.py

2. Change config for testing

DB_CONFIG = {

# Indicates which database(s) to record.

“target_databases”: [“test”],

# Indicates which collections to record. If user wants to capture all the

# collections’ activities, leave this field to be `None` (but we’ll always

# skip collection `system.profile`, even if it has been explicit

# specified).

“target_collections”: [“testrecord”],

 

“oplog_servers”: [

{ “mongodb_uri”: “mongodb://mongodb.a2.com:27018” },

{ “mongodb_uri”: “mongodb://mongodb.b1.com:27018” }

 

],

 

# In most cases you will record from the profile DB on the primary

# If you are also sending queries to secondaries, you may want to specify

# a list of secondary servers in addition to the primary

“profiler_servers”: [

{ “mongodb_uri”: “mongodb://mongodb.a1.com:27018” },

{ “mongodb_uri”: “mongodb://mongodb.b2:27018” }

],

 

“oplog_output_file”: “./testrecord_oplog_output”,

“output_file”: “./testrecord_output”,

 

# If overwrite_output_file is True, the same output file will be

# overwritten is False in between consecutive calls of the recorer. If

# it’s False, the recorder will append a unique number to the end of the

# output_file if the original one already exists.

“overwrite_output_file”: True,

 

# The length for the recording

“duration_secs”: 3600

}

 

APP_CONFIG = {

“logging_level”: logging.DEBUG

}

 

duration_secs indicates the length for the recording. For production capture, should set it at least to 10-12 hrs.

Make sure has write permission to the output dir

Recording

  1. Set all primary servers profiling level to 2

db.setProfilingLevel(2)

2. Start operations recording

./src/github.com/ParsePlatform/flashback/record/record.py

3. The script starts multiple threads to pull the profiling results and oplog entries for collections and databases that we are interested in. Each thread works independently. After fetching the entries, it will merge the results from all sources to get a full picture of all operations as one output file.

4. You can run the record.py from any server as long as the server has flashback installed  and can connect to all mongod servers.

5. As a side note, running mongod in replica set mode is necessary (even when there is only one node), in order to generate and access the oplogs

 

Replay

  1. Run flashback. Style can be “real” or ”stress”

        Real: replay ops in accordance to their original timestamps, which allows us to imitate regular traffic.

        Stress: will preload the ops to the memory and replay them as fast as possible. This potentially limits the number of  ops played back per session to the             available memory on the Replay host.

For sharded collections, point the tool to a mongos. You could also point to a single shard primary for non-sharded collections.

./flashback -ops_filename=”./testrecord_output” -style=”real” -url=”localhost:27018″ -workers=10

Observations

  • Several pymongo (python’s MongoDB driver) arguments in the code are deprecated causing installation and running errors.
  • Need to define a faster restore method (ie. LVM snapshots) to rollback the test environment after each replay.
  • Need to capture execution times for each query included in the test set to be able to detect excecution plan changes.
  • In a sharded cluster, record can be executed from a single server with access to all primaries and/or secondaries.
  • Pulling oplogs from secondaries is recommended if we are looking to reduce load on the primaries.
  • Memory available would dramatically affect operation’s merge process after recording
  • Memory available would also affect replay times (see Tests summary)

Tests summary

 

Record test scenario 1

 

Record server: mongos server (8G RAM)

Time : about 2 hours to finish the recording

Details: Ran record while inserting and updating 1000 documents

 

Record test scenario 2

 

Record server: shard a primary node a1 (80G RAM)

Time: about 2 minutes to finish the recording

Details: Ran record while inserting and updating 1000 documents

Record test scenario 3

 

Record server: shard a primary node a1 (80G RAM)

Time: it took about 20 minutes to finish the recording

Details: Ran record while inserting and updating 100,000 documents

Replay test scenario 1

Replay server: mongos server (8G RAM)

Time: it took about 1 hour to finish the replay

Details: replayed 1000 operations in “real” style

 

Replay test scenario 2

Replay server: shard a primary node a1 (80G RAM)

Time: about 5 minutes to finish the replay

Details: replayed 1000 operations in “real” style

Replay test scenario 3

Replay server: mongos server (8G RAM)

Time: failed due to insufficient memory

Details: replayed 1000 operations in “stress” style

 

Replay test scenario 4

Replay server: shard a primary node a1 (80G RAM)

Time: about 1minute to finish the replay

Details: replayed 1000 operations in “stress” style

 

Replay test scenario 5

Replay server: shard a primary node a1 (80G RAM)

Time: about 20 minutes to finish the replay

Details: replayed 50,000 operations in “stress” style

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *