Wednesday, November 4, 2015

MongoDB 3.2 vs Linkbench

I used LinkbenchX to compare performance and efficiency for MongoDB 3.2.0rc0 vs 3.0.7 with the RocksDB and WiredTiger engines. The Linkbench test has two phases: load and query. The test was run in three configurations: cached with data on disk, too big to cache with data on disk and too big to cache with data on SSD. My summary:

Performance:
  • load rates are similar for disk and SSD with RocksDB and WiredTiger
  • load rate for WiredTiger is ~2X better in 3.2 versus 3.0
  • load rate for WiredTiger is more than 2X better than RocksDB
  • query rate for WiredTiger is ~1.3X better than RocksDB for cached database
  • query rate for RocksDB is ~1.5X better than WiredTiger for not cached database
Efficiency:
  • disk space used is ~1.33X higher for WiredTiger vs RocksDB
  • disk bytes written per document during the load is ~5X higher for RocksDB
  • disk bytes written per query is ~3.5X higher for WiredTiger
  • RocksDB uses ~1.8X more CPU during the load
  • WiredTiger uses ~1.4X more CPU during the query phase
  • with a 32G block cache mongod RSS is ~42G with WiredTiger vs ~34G with RocksDB

LinkbenchX


LinkbenchX is a fork of Linkbench. Source for Linkbench and LinkbenchX is in github. LinkbenchX adds support for MongoDB and an option to sustain a fixed request arrival rate. For this test I use the MongoDB support but not the fixed request arrival rate by using bin/linkbench from Linkbench. I am grateful to Percona for porting Linkbench to MongoDB.

The Linkbench workload requires transactions to update the count collection when a document is added to or removed from the link collection. MongoDB doesn't support per-shard transactions and the Linkbench results will be incorrect. I understand that cross-shard transactions are hard, but per-shard transactions and per-shard consistent read are valuable for Linkbench and for selling to the enterprise. I hope they arrive in MongoDB 3.4.

Linkbench is run in two phases: load and query. I configured Linkbench to use 12 threads for both phases. The query phase was done as a sequence of 1-hour tests to measure whether performance and efficiency changed over time.

For the cached database test I set the value of maxid1 in the Linkbench file config/FBWorkload.properties to 50,000,001 and the database block cache to 32G. The database was cached by the OS but not WiredTiger or RocksDB in this setup as the compressed database was ~50G.

For the not cached database test I set the value of maxid1 to 250,000,001, the database block cache to 8G and started a background process to use mlock to leave at most 40G for mongod and the OS page cache. The database was at least 150G and the workload used lot of storage IO.

For all tests I changed the Linkbench file config/LinkConfigMongoDBv2.properties to use transaction_support_level=1, requesters=12, loaders=12, maxtime=3600, requestrate=0 and requests=100000000000. Fsync is not done on commit.

The test server has 24 cores with hyperthreading enabled, 144G of RAM and either one 400G SSD (Intel DC s3700) or 6 SAS disks with HW RAID 0. The server uses Fedora release 19, Linux 3.14.27-100, gcc 4.8.3 and MongoDB was linked with tcmalloc.

Snappy compression was used for both RocksDB and WiredTiger. For all tests this was in the mongod configuration file:
processManagement:
  fork: true
systemLog:
  destination: file
  logAppend: true
storage:
  syncPeriodSecs: 600
  journal:
    enabled: true
storage.wiredTiger.collectionConfig.blockCompressor: snappy
storage.wiredTiger.engineConfig.journalCompressor: none
For the not cached database this was in the mongod configuration file:
replication.oplogSizeMB: 4000
storage.wiredTiger.engineConfig.cacheSizeGB: 8
storage.rocksdb.cacheSizeGB: 8
For the cached database this was in the mongod configuration file:
replication.oplogSizeMB: 8000
storage.wiredTiger.engineConfig.cacheSizeGB: 32
storage.rocksdb.cacheSizeGB: 32

Legend

The legend for the data in the following sections. The disk and CPU metrics are collected from iostat and vmstat. Most rates below are normalized by the rate for operations/second where operation is an insert during the load phase and query during the query phase. The real insert rate across all collections is reported for ops below but I use the number of inserts to the node collection (50M or 250M) when normalizing the iostat and vmstat rates.

  • ops - average rate for operations/second (inserts or queries per second)
  • db.gb - database size in GB (two numbers are from du without and with --apparent-size)
  • r/o - disk reads per operation
  • wKB/o - disk KB written per operation
  • cs/o - context switches per operation
  • cpu/o - CPU/operation from vmstat us+sy divided by ops multiplied by 1M 
  • rss - RSS from ps for the mongod process
  • setup - wt (wiredtiger), rx (rocksdb), 307 (mongo 3.0.7), 320 (mongo 3.2.0rc0), op0/op1 - oplog off/on

Cached database

For this test there were 50M, ~220M and X docs in the node, link and count collections after the load phase. In addition to the conclusions listed at the top of this post, WiredTiger 3.2 vs 3.0 uses less CPU and has fewer context switches per insert. It is good to see it become more efficient and the insert rate for WiredTiger has almost doubled from 3.0 to 3.2. 

The context switch rate per insert is much larger for RocksDB because of the global mutex that serializes inserts into the memtable. There are no disk reads during this test because the database fits in the OS page cache. The CPU rate for WiredTiger is also much higher during the load. That might be a side effect of more mutex contention.

The difference in database sizes for WiredTiger vs RocksDB is small after the load but grows quickly during the run phases. I did not try to debug it but the growth for WiredTiger could be a problem. WiredTiger also uses much more memory than RocksDB. But I don't know whether that is a fixed overhead (~8G) or a relative overhead (30% of the block cache size).

Using the oplog doubles the wKB/o rate because writes are done twice -- once to the oplog, once to the database. The internal write-amplification reported by RocksDB for rx.320.op1 is 6.1.

--- load
ops    db.gb   r/o     wKB/o   cs/o    cpu/o   setup
51416  53/36   0.0     1.355    2.8    2171    wt.320.op0
46156  44/40   0.0     2.316    4.0    2460    wt.320.op1
28171  47/41   0.0     1.358    0.9    3161    wt.307.op0
28080  46/35   0.0     2.304    1.8    3520    wt.307.op1
26654  31/31   0.0     5.318   16.0    3787    rx.320.op0
19033  36/36   0.0     11.643  18.4    4881    rx.320.op1

--- run, 2nd hour
ops     db.gb   r/o    wKB/o   cs/o    cpu/o   rss   setup
14483   86/72   0.0    2.486    3.6    2170    42G   wt.320.op1
14312   78/71   0.0    2.338    3.6    2259    43G   wt.307.op1
10794   38/38   0.0    1.357    3.9    2470    35G   rx.320.op1

--- run, 12th hour
ops     db.gb   r/o    wKB/o   cs/o    cpu/o   rss   setup
13042   100/90  0.0    2.588    4.1    2378    36G   wt.320.op1
12742   94/88   0.0    2.623    4.0    2414    43G   wt.307.op1
10550   45/45   0.0    1.491    4.1    2533    35G   rx.320.op1

Not cached, disk array

For this test there were 250M, 1B and X documents in the node, link and count collections after the load phase. The database did not fit in RAM and the disk array can do ~1200 IOPs. The query phase was run for 24 hours.

The new result here is that RocksDB sustained a much higher QPS rate during the query phase. From the response times listed at the end of this post the difference appears to be a better response time for the most frequent operation -- GET_LINKS_LIST -- which is a short range scan. RocksDB also benefits from a better cache hit rate because the database is smaller and the r/o rate is slightly smaller. It also uses less IO capacity for writes (wKB/o is smaller and writes are less random) leaving more IO capacity for reads.

-- load
ops    db.gb    r/o     wKB/o   cs/o    cpu/o   setup
40667  191/177  0.0     2.319    2.8    2803    wt.320.op1
25149  191/178  0.0     2.306    1.8    4041    wt.307.op1
18725  153/153  0.0    11.568   18.7    4968    rx.320.op1

--- run, 2nd hour
ops    db.gb    r/o     wKB/o   cs/o    cpu/o   rss   setup
504    199/188  0.842   5.326    8.0    5481    12G   wt.320.op1
507    206/187  0.798   5.171    7.5    8013    13G   wt.307.op1
850    153/153  0.746   1.743    5.3    3684    11G   rx.320.op1

--- run, 12th hour
ops    db.gb    r/o     wKB/o   cs/o    cpu/o   rss   setup
491    196/195  0.831   5.173    8.1    5700    12G   wt.320.op1
488    195/195  0.794   5.273    7.6    8380    12G   wt.307.op1
864    155/155  0.725   1.588    5.4    3967    11G   rx.320.op1

--- run, 24th hour
ops    db.gb    r/o     wKB/o   cs/o    cpu/o   rss   setup
494    199/197  0.799   5.404    8.1    5615    10G   wt.320.op1
471    197/197  0.814   5.303    7.8    8564    12G   wt.307.op1
833    156/156  0.738   1.721    5.5    4301    10G   rx.320.op1

Not cached, SSD

RocksDB sustained a higher QPS than WiredTiger for the query phase similar to the result for a disk array with a not cached database. I didn't expect that result here or for the disk array.

--- load
ops    db.gb    r/o     wKB/o   cs/o    cpu/o   setup
40742  195/180  0.0     2.318    2.9    2798    wt.320.op1
25308  188/177  0.0     2.306    1.8    4007    wt.307.op1
18603  155/154  0.0    11.458   18.6    5005    rx.320.op1

--- run, 2nd hour
ops    db.gb    r/o     wKB/o   cs/o    cpu/o   rss   setup
3870   229/213  0.821   4.869   6.3     4179    11G   wt.320.op1
3814   228/210  0.813   4.620   6.1     4163    13G   wt.307.op1
6146   155/155  0.735   1.344   5.3     3063    11G   rx.320.op1

--- run, 12th hour
ops    db.gb    r/o     wKB/o   cs/o    cpu/o   rss   setup
3715   232/221  0.855   4.810   6.6     4449    9G    wt.320.op1
3415   223/217  0.825   4.582   6.4     4332    11G   wt.307.op1
5827   162/162  0.776   1.356   5.6     3239    11G   rx.320.op1

Response time metrics

These are response time metrics for each configuration. These are printed by Linkbench at the end of each run. The most interesting result is for the GET_LINKS_LIST operation which uses a short range scan (~10 rows on average). For the cached database, the p99 for RocksDB is ~14ms vs ~3ms for WiredTiger. I think the problem for RocksDB is from too many tombstones. We recently came up with a more efficient way to remove them in MyRocks. The p99 for RocksDB in the not cached databases (disk & SSD) is better than WiredTiger and ~12ms for ssd, ~47ms for disk. 

RocksDB compaction stats

This is the compaction IO statistics from RocksDB at the end of the 24th 1-hour query run for the no cached, disk configuration.

2 comments:

  1. Two comments: looking at the insert ops during the load phase, it looks like the insert rate limits are not hw bound but are db bound. The 3700 supposedly can do 36k oops at 4k random - if you are seeing 40k 2.3 Kb ops, does that mean that wt 3.2.0rc was hw io bound or index etc creation bound?

    Also, 1.6~x performance on 3.2.0rc vs 3.0.7 is very impressive on the ssd non cached config.

    Rocks query performance is solid, and so is the 99th percentile latency on every op -- 3.2.0rc showing 2k+ ms on max latency on add or update link is bizarre though.

    Mark, do you think a pcie drive like Intel p3608 would fare better or are we not hw io bound?

    ReplyDelete
  2. I was surprised and impressed by improvements in WiredTiger for the 3.2 release.

    Secondary index maintenance for WiredTiger is read-modify-write (read page, make change). For RocksDB that is write-only for non-unique indexes -- no need to read old keys.

    For the load phase there are few reads, the write rate to SSD is ~43 MB/s for WiredTiger and ~97 MB/s for RocksDB. RocksDB has worse write-amp during the load, so it writes more per insert. Given the higher rate for RocksDB I don't think WT is IO bound. From experience I suspect the RocksDB bottleneck is the mutex on the memtable. Average CPU utilization during the load was 52% for WT and 19% for RocksDB.

    For the query phase looking at the not cached database and the 12th 1-hour run I see:
    * WiredTiger: 32% CPU, 6247 r/s, 91 MB/s read, 35 MB/s write -> maybe IO bound
    * RocksDB: 5.6% CPU, 8880 r/s, 79 MB/s read, 16 MB/s write -> maybe IO bound

    Large outliers (p99, p99.9, p99.99) for SSD reads don't surprise me when reads can queue behind a flash block erase.

    Repeating this on HW with a modern SSD would be interesting.

    ReplyDelete