Small Datum: MongoDB 3.2 vs Linkbench

I used LinkbenchX to compare performance and efficiency for MongoDB 3.2.0rc0 vs 3.0.7 with the RocksDB and WiredTiger engines. The Linkbench test has two phases: load and query. The test was run in three configurations: cached with data on disk, too big to cache with data on disk and too big to cache with data on SSD. My summary:

Performance:

load rates are similar for disk and SSD with RocksDB and WiredTiger
load rate for WiredTiger is ~2X better in 3.2 versus 3.0
load rate for WiredTiger is more than 2X better than RocksDB
query rate for WiredTiger is ~1.3X better than RocksDB for cached database
query rate for RocksDB is ~1.5X better than WiredTiger for not cached database

Efficiency:

disk space used is ~1.33X higher for WiredTiger vs RocksDB
disk bytes written per document during the load is ~5X higher for RocksDB
disk bytes written per query is ~3.5X higher for WiredTiger
RocksDB uses ~1.8X more CPU during the load
WiredTiger uses ~1.4X more CPU during the query phase
with a 32G block cache mongod RSS is ~42G with WiredTiger vs ~34G with RocksDB

LinkbenchX

LinkbenchX is a fork of Linkbench. Source for Linkbench and LinkbenchX is in github. LinkbenchX adds support for MongoDB and an option to sustain a fixed request arrival rate. For this test I use the MongoDB support but not the fixed request arrival rate by using bin/linkbench from Linkbench. I am grateful to Percona for porting Linkbench to MongoDB.

The Linkbench workload requires transactions to update the count collection when a document is added to or removed from the link collection. MongoDB doesn't support per-shard transactions and the Linkbench results will be incorrect. I understand that cross-shard transactions are hard, but per-shard transactions and per-shard consistent read are valuable for Linkbench and for selling to the enterprise. I hope they arrive in MongoDB 3.4.

Linkbench is run in two phases: load and query. I configured Linkbench to use 12 threads for both phases. The query phase was done as a sequence of 1-hour tests to measure whether performance and efficiency changed over time.

For the cached database test I set the value of maxid1 in the Linkbench file config/FBWorkload.properties to 50,000,001 and the database block cache to 32G. The database was cached by the OS but not WiredTiger or RocksDB in this setup as the compressed database was ~50G.

For the not cached database test I set the value of maxid1 to 250,000,001, the database block cache to 8G and started a background process to use mlock to leave at most 40G for mongod and the OS page cache. The database was at least 150G and the workload used lot of storage IO.

For all tests I changed the Linkbench file config/LinkConfigMongoDBv2.properties to use transaction_support_level=1, requesters=12, loaders=12, maxtime=3600, requestrate=0 and requests=100000000000. Fsync is not done on commit.

The test server has 24 cores with hyperthreading enabled, 144G of RAM and either one 400G SSD (Intel DC s3700) or 6 SAS disks with HW RAID 0. The server uses Fedora release 19, Linux 3.14.27-100, gcc 4.8.3 and MongoDB was linked with tcmalloc.

Snappy compression was used for both RocksDB and WiredTiger. For all tests this was in the mongod configuration file:

processManagement:
fork: true
systemLog:
destination: file
logAppend: true
storage:
syncPeriodSecs: 600
journal:
enabled: true
storage.wiredTiger.collectionConfig.blockCompressor: snappy
storage.wiredTiger.engineConfig.journalCompressor: none

For the not cached database this was in the mongod configuration file:

replication.oplogSizeMB: 4000
storage.wiredTiger.engineConfig.cacheSizeGB: 8
storage.rocksdb.cacheSizeGB: 8

For the cached database this was in the mongod configuration file:

replication.oplogSizeMB: 8000
storage.wiredTiger.engineConfig.cacheSizeGB: 32
storage.rocksdb.cacheSizeGB: 32

Legend

The legend for the data in the following sections. The disk and CPU metrics are collected from iostat and vmstat. Most rates below are normalized by the rate for operations/second where operation is an insert during the load phase and query during the query phase. The real insert rate across all collections is reported for ops below but I use the number of inserts to the node collection (50M or 250M) when normalizing the iostat and vmstat rates.

ops - average rate for operations/second (inserts or queries per second)
db.gb - database size in GB (two numbers are from du without and with --apparent-size)
r/o - disk reads per operation
wKB/o - disk KB written per operation
cs/o - context switches per operation
cpu/o - CPU/operation from vmstat us+sy divided by ops multiplied by 1M
rss - RSS from ps for the mongod process
setup - wt (wiredtiger), rx (rocksdb), 307 (mongo 3.0.7), 320 (mongo 3.2.0rc0), op0/op1 - oplog off/on

Cached database

For this test there were 50M, ~220M and X docs in the node, link and count collections after the load phase. In addition to the conclusions listed at the top of this post, WiredTiger 3.2 vs 3.0 uses less CPU and has fewer context switches per insert. It is good to see it become more efficient and the insert rate for WiredTiger has almost doubled from 3.0 to 3.2.

The context switch rate per insert is much larger for RocksDB because of the global mutex that serializes inserts into the memtable. There are no disk reads during this test because the database fits in the OS page cache. The CPU rate for WiredTiger is also much higher during the load. That might be a side effect of more mutex contention.

The difference in database sizes for WiredTiger vs RocksDB is small after the load but grows quickly during the run phases. I did not try to debug it but the growth for WiredTiger could be a problem. WiredTiger also uses much more memory than RocksDB. But I don't know whether that is a fixed overhead (~8G) or a relative overhead (30% of the block cache size).

Using the oplog doubles the wKB/o rate because writes are done twice -- once to the oplog, once to the database. The internal write-amplification reported by RocksDB for rx.320.op1 is 6.1.

--- load
ops db.gb r/o wKB/o cs/o cpu/o setup
51416 53/36 0.0 1.355 2.8 2171 wt.320.op0
46156 44/40 0.0 2.316 4.0 2460 wt.320.op1
28171 47/41 0.0 1.358 0.9 3161 wt.307.op0
28080 46/35 0.0 2.304 1.8 3520 wt.307.op1
26654 31/31 0.0 5.318 16.0 3787 rx.320.op0
19033 36/36 0.0 11.643 18.4 4881 rx.320.op1

--- run, 2nd hour
ops db.gb r/o wKB/o cs/o cpu/o rss setup
14483 86/72 0.0 2.486 3.6 2170 42G wt.320.op1
14312 78/71 0.0 2.338 3.6 2259 43G wt.307.op1
10794 38/38 0.0 1.357 3.9 2470 35G rx.320.op1

--- run, 12th hour
ops db.gb r/o wKB/o cs/o cpu/o rss setup
13042 100/90 0.0 2.588 4.1 2378 36G wt.320.op1
12742 94/88 0.0 2.623 4.0 2414 43G wt.307.op1
10550 45/45 0.0 1.491 4.1 2533 35G rx.320.op1

Not cached, disk array

For this test there were 250M, 1B and X documents in the node, link and count collections after the load phase. The database did not fit in RAM and the disk array can do ~1200 IOPs. The query phase was run for 24 hours.

The new result here is that RocksDB sustained a much higher QPS rate during the query phase. From the response times listed at the end of this post the difference appears to be a better response time for the most frequent operation -- GET_LINKS_LIST -- which is a short range scan. RocksDB also benefits from a better cache hit rate because the database is smaller and the r/o rate is slightly smaller. It also uses less IO capacity for writes (wKB/o is smaller and writes are less random) leaving more IO capacity for reads.

-- load
ops db.gb r/o wKB/o cs/o cpu/o setup
40667 191/177 0.0 2.319 2.8 2803 wt.320.op1
25149 191/178 0.0 2.306 1.8 4041 wt.307.op1
18725 153/153 0.0 11.568 18.7 4968 rx.320.op1

--- run, 2nd hour
ops db.gb r/o wKB/o cs/o cpu/o rss setup
504 199/188 0.842 5.326 8.0 5481 12G wt.320.op1
507 206/187 0.798 5.171 7.5 8013 13G wt.307.op1
850 153/153 0.746 1.743 5.3 3684 11G rx.320.op1

--- run, 12th hour
ops db.gb r/o wKB/o cs/o cpu/o rss setup
491 196/195 0.831 5.173 8.1 5700 12G wt.320.op1
488 195/195 0.794 5.273 7.6 8380 12G wt.307.op1
864 155/155 0.725 1.588 5.4 3967 11G rx.320.op1

--- run, 24th hour
ops db.gb r/o wKB/o cs/o cpu/o rss setup
494 199/197 0.799 5.404 8.1 5615 10G wt.320.op1
471 197/197 0.814 5.303 7.8 8564 12G wt.307.op1
833 156/156 0.738 1.721 5.5 4301 10G rx.320.op1

Not cached, SSD

RocksDB sustained a higher QPS than WiredTiger for the query phase similar to the result for a disk array with a not cached database. I didn't expect that result here or for the disk array.

--- load

ops db.gb r/o wKB/o cs/o cpu/o setup

40742 195/180 0.0 2.318 2.9 2798 wt.320.op1

25308 188/177 0.0 2.306 1.8 4007 wt.307.op1

18603 155/154 0.0 11.458 18.6 5005 rx.320.op1

--- run, 2nd hour

ops db.gb r/o wKB/o cs/o cpu/o rss setup

3870 229/213 0.821 4.869 6.3 4179 11G wt.320.op1

3814 228/210 0.813 4.620 6.1 4163 13G wt.307.op1

6146 155/155 0.735 1.344 5.3 3063 11G rx.320.op1

--- run, 12th hour

ops db.gb r/o wKB/o cs/o cpu/o rss setup

3715 232/221 0.855 4.810 6.6 4449 9G wt.320.op1

3415 223/217 0.825 4.582 6.4 4332 11G wt.307.op1

5827 162/162 0.776 1.356 5.6 3239 11G rx.320.op1

Response time metrics

These are response time metrics for each configuration. These are printed by Linkbench at the end of each run. The most interesting result is for the GET_LINKS_LIST operation which uses a short range scan (~10 rows on average). For the cached database, the p99 for RocksDB is ~14ms vs ~3ms for WiredTiger. I think the problem for RocksDB is from too many tombstones. We recently came up with a more efficient way to remove them in MyRocks. The p99 for RocksDB in the not cached databases (disk & SSD) is better than WiredTiger and ~12ms for ssd, ~47ms for disk.

RocksDB compaction stats

This is the compaction IO statistics from RocksDB at the end of the 24th 1-hour query run for the no cached, disk configuration.

2 comments:

Alex MalafeevNovember 8, 2015 at 11:51 PM
Two comments: looking at the insert ops during the load phase, it looks like the insert rate limits are not hw bound but are db bound. The 3700 supposedly can do 36k oops at 4k random - if you are seeing 40k 2.3 Kb ops, does that mean that wt 3.2.0rc was hw io bound or index etc creation bound?

Also, 1.6~x performance on 3.2.0rc vs 3.0.7 is very impressive on the ssd non cached config.

Rocks query performance is solid, and so is the 99th percentile latency on every op -- 3.2.0rc showing 2k+ ms on max latency on add or update link is bizarre though.

Mark, do you think a pcie drive like Intel p3608 would fare better or are we not hw io bound?
Mark CallaghanNovember 9, 2015 at 8:58 AM
I was surprised and impressed by improvements in WiredTiger for the 3.2 release.

Secondary index maintenance for WiredTiger is read-modify-write (read page, make change). For RocksDB that is write-only for non-unique indexes -- no need to read old keys.

For the load phase there are few reads, the write rate to SSD is ~43 MB/s for WiredTiger and ~97 MB/s for RocksDB. RocksDB has worse write-amp during the load, so it writes more per insert. Given the higher rate for RocksDB I don't think WT is IO bound. From experience I suspect the RocksDB bottleneck is the mutex on the memtable. Average CPU utilization during the load was 52% for WT and 19% for RocksDB.

For the query phase looking at the not cached database and the 12th 1-hour run I see:
* WiredTiger: 32% CPU, 6247 r/s, 91 MB/s read, 35 MB/s write -> maybe IO bound
* RocksDB: 5.6% CPU, 8880 r/s, 79 MB/s read, 16 MB/s write -> maybe IO bound

Large outliers (p99, p99.9, p99.99) for SSD reads don't surprise me when reads can queue behind a flash block erase.

Repeating this on HW with a modern SSD would be interesting.

Wednesday, November 4, 2015

MongoDB 3.2 vs Linkbench