After you’ve upgraded your database to a new version, it’s common that the performance degrades in some cases. To prevent this from happening, we could capture the production database operations and replay them in the testing environment which has the new version installed.
Flashback is a MongoDB benchmark framework that allows developers to gauge database performance by benchmarking queries. Flashback records the real traffic to the database and replays operations with different strategies. The framework is comprised of a set of scripts that fall into 2 categories:
- Records the operations(ops) that occur during a stretch of time
- Replays the recorded ops
Installation
The framework was tested on Ubuntu 10.04.4 LTS
Prerequisites
-go 1.4
-git 2.3.7
-python 2.6.5
-pymongo 2.7.1
-libpcap0.8 and libpcap0.8-dev
- Download Parse/Flashback source code
# go get github.com/ParsePlatform/flashback/cmd/flashback
- Manually modify the following file to workaround a mongodb-tools compatibility issue
In pass_util.go file:
func GetPass() string {
– return string(gopass.GetPasswd())
+ if data, errData := gopass.GetPasswd(); errData != nil {
+ return “
+ } else {
+ return string(data)
+ }
- Compile the go lang part of the tool
# go build -i ./src/github.com/ParsePlatform/flashback/cmd/flashback
Configuration
Suppose you have to two shards, Shard a and Shard b. Each shard has 3 nodes. In each shard a, primary is a1. In shard b, primary is b2.
1. copy sample config file for editing
# cp ./src/github.com/ParsePlatform/flashback/record/config.py.example config.py
2. Change config for testing
DB_CONFIG = {
# Indicates which database(s) to record.
“target_databases”: [“test”],
# Indicates which collections to record. If user wants to capture all the
# collections’ activities, leave this field to be `None` (but we’ll always
# skip collection `system.profile`, even if it has been explicit
# specified).
“target_collections”: [“testrecord”],
“oplog_servers”: [
{ “mongodb_uri”: “mongodb://mongodb.a2.com:27018” },
{ “mongodb_uri”: “mongodb://mongodb.b1.com:27018” }
],
# In most cases you will record from the profile DB on the primary
# If you are also sending queries to secondaries, you may want to specify
# a list of secondary servers in addition to the primary
“profiler_servers”: [
{ “mongodb_uri”: “mongodb://mongodb.a1.com:27018” },
{ “mongodb_uri”: “mongodb://mongodb.b2:27018” }
],
“oplog_output_file”: “./testrecord_oplog_output”,
“output_file”: “./testrecord_output”,
# If overwrite_output_file is True, the same output file will be
# overwritten is False in between consecutive calls of the recorer. If
# it’s False, the recorder will append a unique number to the end of the
# output_file if the original one already exists.
“overwrite_output_file”: True,
# The length for the recording
“duration_secs”: 3600
}
APP_CONFIG = {
“logging_level”: logging.DEBUG
}
duration_secs indicates the length for the recording. For production capture, should set it at least to 10-12 hrs.
Make sure has write permission to the output dir
Recording
- Set all primary servers profiling level to 2
db.setProfilingLevel(2)
2. Start operations recording
./src/github.com/ParsePlatform/flashback/record/record.py
3. The script starts multiple threads to pull the profiling results and oplog entries for collections and databases that we are interested in. Each thread works independently. After fetching the entries, it will merge the results from all sources to get a full picture of all operations as one output file.
4. You can run the record.py from any server as long as the server has flashback installed and can connect to all mongod servers.
5. As a side note, running mongod in replica set mode is necessary (even when there is only one node), in order to generate and access the oplogs
Replay
- Run flashback. Style can be “real” or ”stress”
Real: replay ops in accordance to their original timestamps, which allows us to imitate regular traffic.
Stress: will preload the ops to the memory and replay them as fast as possible. This potentially limits the number of ops played back per session to the available memory on the Replay host.
For sharded collections, point the tool to a mongos. You could also point to a single shard primary for non-sharded collections.
./flashback -ops_filename=”./testrecord_output” -style=”real” -url=”localhost:27018″ -workers=10
Observations
- Several pymongo (python’s MongoDB driver) arguments in the code are deprecated causing installation and running errors.
- Need to define a faster restore method (ie. LVM snapshots) to rollback the test environment after each replay.
- Need to capture execution times for each query included in the test set to be able to detect excecution plan changes.
- In a sharded cluster, record can be executed from a single server with access to all primaries and/or secondaries.
- Pulling oplogs from secondaries is recommended if we are looking to reduce load on the primaries.
- Memory available would dramatically affect operation’s merge process after recording
- Memory available would also affect replay times (see Tests summary)
Tests summary
Record test scenario 1
Record server: mongos server (8G RAM)
Time : about 2 hours to finish the recording
Details: Ran record while inserting and updating 1000 documents
Record test scenario 2
Record server: shard a primary node a1 (80G RAM)
Time: about 2 minutes to finish the recording
Details: Ran record while inserting and updating 1000 documents
Record test scenario 3
Record server: shard a primary node a1 (80G RAM)
Time: it took about 20 minutes to finish the recording
Details: Ran record while inserting and updating 100,000 documents
Replay test scenario 1
Replay server: mongos server (8G RAM)
Time: it took about 1 hour to finish the replay
Details: replayed 1000 operations in “real” style
Replay test scenario 2
Replay server: shard a primary node a1 (80G RAM)
Time: about 5 minutes to finish the replay
Details: replayed 1000 operations in “real” style
Replay test scenario 3
Replay server: mongos server (8G RAM)
Time: failed due to insufficient memory
Details: replayed 1000 operations in “stress” style
Replay test scenario 4
Replay server: shard a primary node a1 (80G RAM)
Time: about 1minute to finish the replay
Details: replayed 1000 operations in “stress” style
Replay test scenario 5
Replay server: shard a primary node a1 (80G RAM)
Time: about 20 minutes to finish the replay
Details: replayed 50,000 operations in “stress” style
No comments