Small Datum: 2016

Friday, December 30, 2016

The MyRocks community in 2016

The MyRocks community grew in 2016 with significant contributions from early evaluators. I appreciate their help and have been busy documenting the problems that they reported. I look forward to more contributions in 2017.

MariaDB Corporation and Percona announced plans to include MyRocks in their distributions. You will be able to get MyRocks with expert support. Equally big have been the contributions from Sergey Pertunya to MyRocks.

The MySQL team at Alibaba reported poor performance from range scans with sysbench. The problem was memory system contention on a performance counter and the fix made long range scans 2X faster on concurrent workloads.

I have been using sysbench more and created a script to run a sequence of tests with it. During my tests I noticed too much variance during the read-only tests and the problem is non-determinism from the state of the memtable and level 0 of the LSM tree. The overhead for a search of the LSM tree depends on the amount of data in the memtable and the number of files in level 0. This state is not deterministic when a read-only test follows a read-write test. I filed issue 427 to see if we can get RocksDB to adapt and flush data earlier than normal from the memtable and level 0 when a workload becomes read-only.

Sysbench is extremely useful even though the workloads provided by it are synthetic -- especially the read-only tests with a small (multi-GB) database and most tests use a uniform distribution for keys. There is a talk on sysbench at FOSDEM 2017 and I look forward to using it more next year.

Justin Swanhart pointed out the lousy performance MyRocks provided with a default configuration. As a result we improved the default configuration by increasing the memtable to 64M and block cache to 512M and I will share results at FOSDEM. See issues 369, 375, and 441.

Justin Swanhart and Rick Pizzi reported problems with too-large transactions leading to lousy performance and OOM. MyRocks buffers in memory all changes from a transaction until commit and at commit time the changes are briefly double buffered when copied into the memtable. Until we make this better the workaround is to use rocksdb_commit_in_the_middle for bulk loads. Even after we make this better that option will be useful for bulk loads. Today we have a poor limit on the max size of a transaction via rocksdb_max_row_locks. This is a limit on the number of locked rows rather than on the memory used by a transaction. Even worse, the default is 1B. We are changing the limit to be on the amount of memory used by a transaction. Issues for this include 341, 347, 348 and 365.

The MySQL team at Alibaba reported a performance regression from MyRocks group commit when running sysbench. They also reported a correctness problem that I ran into with sysbench. Auditing the code found another bug. A related problem is that rocksdb_disable_2pc was enabled by default, meaning engine & binlog crash safety wasn't guaranteed. We have changed the default configuration to be crash safe and renamed the option to avoid double negatives. We also used YSCB and sysbench to reproduce the performance regression and have work in progress to make this better. That requires collaboration between the MyRocks and RocksDB teams. It is fun to watch the teams solve problems. Issues for this problem include 474, 481, 482, 488.

Wednesday, November 23, 2016

MyRocks: use less IO on writes to have more IO for reads

Holiday is almost here and I wrote a long blog post on write-efficiency yesterday so this one will be short. A longer version of this is in progress because this is an interesting result for me to explain. We assume that an LSM is less efficient for reads because it is more efficient for writes and it is hard to be optimal for all of read, write & space efficiency.

For real workloads it is complicated and for now I include benchmarks in "real workloads". Here is one interesting result from my IO-bound tests of Linkbench. The summary is that when you spend less on IO to write back changes then you can spend more on IO to handle user queries. That benefit is more apparent on slower storage (disk array) than on faster storage (MLC NAND flash) because slower storage is more likely to be the bottleneck.

IO-bound Linkbench means that I used a server with 50G of RAM and ran Linkbench with maxid1=1B (1B nodes). The MyRocks database was ~400G and the InnoDB database was ~1.6T. Both MyRocks and InnoDB used MySQL 5.6.26. The workload is IO-heavy and the database working set is not cached.

The interesting result is that the difference between MyRocks and InnoDB becomes larger as storage gets slower. Another way to describe this is that InnoDB loses more performance than MyRocks when moving from faster to slower storage. I assume this is because MyRocks uses less IO capacity for writing back database changes so it has more IO capacity for handling user queries.

Transactions per second

MyRocks InnoDB MyRocks/InnoDB

Disk array 2195 414 5.3

Slow SSD 23484 10143 2.3

Fast SSD 28965 21414 1.4

The random operations per second provided by the storage devices above is approximately 1k for the disk array, 10k for the slow SSD and more than 100k for the fast SSD.

Tuesday, November 22, 2016

Why is MyRocks more write-efficient than InnoDB?

This year I shared results where InnoDB wrote between 10X and 20X more data to storage than MyRocks for the same workload. I use KB written to storage per transaction as a measure of write efficiency and I usually compute this with data from the benchmark client and iostat. I get KB written/second from iostat, average transaction/second from the benchmark client and divide the former by the latter to compute KB written/transaction. When using SSD this excludes the writes done by SSD firmware and I previously reported that the overhead was worse for InnoDB than for RocksDB on one vendor's device.

An engine that writes less to storage per transaction is more write efficient. It is a good thing if MyRocks writes 10X less to storage than InnoDB for the same workload. This might enable MyRocks to use lower-endurance SSD for workloads where InnoDB required higher-endurance SSD. This might enable MyRocks to use SSD for workloads in which the device would not last with InnoDB. This also means that MyRocks needs less overprovisioning on the SSD, which is another way of saying you get more capacity from the device.

This is an update on results I previously shared.

Why
There are a few reasons why MyRocks is more write-efficient than InnoDB:

Doublewrite buffer
Configuration
Page size

Doublewrite

The InnoDB doublewrite buffer doubles the storage write rate. It does this for a good reason -- to protect against partial page writes. Perhaps one day SSD vendors will agree on an atomic-write solution that works across vendors and with popular file systems on Linux. Perhaps one day crash safe RAM will be a common thing in data centers. Perhaps one day we will have a copy-on-write filesystem that is widely used for InnoDB on Linux. Until then we are stuck with 2X write-amplification from the doublewrite buffer.

Configuration

I might be overstating this to make a point. If you have fast storage (NAND flash) and the database working set fits in RAM then you have too much RAM. If you have fast storage then configure the database to use it. Or keep the working set in RAM and use a disk array instead of NAND flash. But if you have an in-memory workload and a database engine that does random IO (update-in-place b-tree) then you still need IOPs capacity or you should switch to a proper in-memory database engine like Tarantool.

Most of the servers that I care about are setup so that the database working set isn't in RAM. I configure benchmarks like Linkbench in the same manner. I am not promising that MyRocks will write 10X less to storage than InnoDB for all use cases - workload and configuration matter. But it tends to be better.

Page size

InnoDB page size has a big impact on write-efficiency when the working set isn't cached because dirty b-tree pages will be evicted from the tail of the LRU earlier and when evicted they must be written back to storage (twice with InnoDB thanks to the doublewrite buffer). In the worst case pages are written back with only dirty row and the write-amplification in that case is sizeof(page) / sizeof(row). In the best case all rows on the page are dirty but the best case isn't likely when the working set isn't cached.

The obvious way to reduce write-amplification (and improve write-efficiency) is to reduce the database page size. The default page size for InnoDB is 16kb, but it is possible to use 8kb or 4kb pages at initialization by setting innodb_page_size or when using compressed tables. Both of these have a cost. First, InnoDB performance is greatly reduced when using compressed tables. Second, rows must fit in half of a page, excluding LOB columns, so a smaller page also means a smaller max row size. A smaller page also reduces the max size of an index key. See the manual for more details.

I repeated Linkbench with an IO-bound configuration - 50G of RAM, database is ~400G for MyRocks with zlib compression and ~1.6T for InnoDB without compression. The data below shows throughput (TPS is transactions/second) and the storage write-rate per transaction (iostat wKB/t) for MyRocks versus InnoDB with different page sizes. When using a smaller page size for InnoDB the storage write rate and TPS is better. I assume TPS is better because when less IO capacity is used for writes then more IO capacity is available for reads.

Page-size TPS iostat-wKB/t

MyRocks 16kb 28965 1.25

InnoDB 4kb 24845 6.13

InnoDB 8kb 24352 10.52

InnoDB 16kb 21414 19.70

And graphs for the same data.

Monday, November 21, 2016

Sysbench, InnoDB, transaction isolation and the performance schema

I used sysbench to understand the impact of transaction isolation and the performance schema for InnoDB from upstream MySQL 5.6.26.

The test server has 24 CPU cores, 48 HW threads with hyperthreading enabled, 256G of RAM and fast SSD. For sysbench I used the 1.0 version with support for Lua. Tests were run in two configurations -- cached and IO-bound. For the cached configuration I used 8 tables, 1M rows/table and the database cache was large enough to cache all data. For the IO-bound configuration I used 8 tables, 10M rows/table, a 2G database cache and buffered IO so that all data was in the OS page cache. The database was ~2G for the cached configuration and ~20G for the IO-bound configuration. InnoDB table compression was not used and jemalloc was used. The binlog was enabled but sync-on-commit was disabled for the binlog and InnoDB redo log.

With 8 tables and 1M rows per table the database is very small -- a few GB. I am wary of drawing too many conclusions from sysbench results for such a small database but other people will use it to evaluate MyRocks so I have been spending more time to understand sysbench performance.

Both the cached and IO-bound configurations require data to be written to storage. For the cached test all page reads are served from the database cache. For the IO-bound test some reads are served from the database cache and misses are served from the OS page cache. In both cases nothing is read from storage.

tl;dr

For InnoDB using repeatable-read can help performance.
Disabling performance schema can help performance
My tests are from MySQL 5.6.26. It will be interesting to understand what has changed in 5.7 and 8. I know that the overhead from #1 and #2 should be much less in modern MySQL.

Sysbench

I have a script that uses sysbench to run tests in an interesting pattern and used it for this test. The pattern is in the all.sh script and is explained below. The all.sh script uses the run.sh script to set the sysbench command line options. I run most of the read-write tests before the read-only tests to fragment the database before evaluating query performance.

load - load the 8 tables
update-index - uses oltp.lua and each transaction is an UPDATE statement that finds & changes 1 row by PK. Secondary index maintenance is required for the update statement.
update-nonindex - like update-index but secondary index maintenance is not required.
read-write - uses oltp.lua in read-write mode, a classic sysbench workload. Run with oltp-range-size=100.
read-only - uses oltp.lua in read-only mode, a classic sysbench workload. Run four times with oltp-range-size set 10, 100, 1000 and 10000.
point-query - uses oltp.lua to fetch one row by PK per query
select - uses select.lua to fetch one row per query
insert - uses insert.lua to insert one row per transaction. This grows the database and the growth amount depends on the the insert rate. So a faster engine will grow the database more than a slower engine. It also means that when I run the test for a long time that the database won't fit in the database or OS page cache. For these reasons I run this test last.

The test was run for 1, 2, 4, 8, 16, 24, 32, 40, 48, 64, 80, 96 and 128 concurrent clients. This is currently hardwired in the all.sh script. For each level of concurrency I ran sysbench for 3 minutes for the read-only tests and 5 minutes for the read-write tests. Eventually I will run it for more time at each level of concurrency but I had a large number of tests to run and am trying to figure out which configurations are interesting.

The my.cnf for these tests is here.

Results

Data for the results is here. The numbers provided are queries per second (QPS) not transactions per second (TPS). The configurations tested are described below:

innodb.8t.1m.rr.ps0 - 8 tables, 1M rows/table, repeatable-read, performance_schema=0
innodb.8t.1m.rr.ps1 - 8 tables, 1M rows/table, repeatable-read, performance_schema=1
innodb.8t.1m.rc.ps0 - 8 tables, 10M rows/table, read-committed, performance_schema=0
innodb.8t.1m.rc.ps1 - 8 tables, 10M rows/table, read-committed, performance_schema=1

My summary of performance is:

update-index - disabling the performance schema has a small impact on QPS (between 1% and 5% more QPS is common). Changing transaction isolation has no impact on QPS.
update-nonindex - same as update-index
read-write - disabling the performance schema frequently boosts QPS by 5% to 10% and the impact is greater at high concurrency. Using repeatable-read boosts performance because it reduces the mutex contention from getting a consistent read snapshot as that is done once per transaction rather than once per statement.
read-only - see the conclusions for read-write for shorter range scans (oltp-range-size set to 10 or 100 and maybe 1000). For longer range scans (oltp-range-size set to 10000) transaction isolation and the performance schema have little impact because the overhead is elsewhere
point-query - disabling the performance schema has a big impact on performance (between 5% and 10% more QPS) and the benefit is larger at high concurrency. Transaction isolation has no impact on performance because transactions are single statement.
select - see point-query
insert - disabling the performance schema has a smaller impact on QPS. Transaction isolation doesn't have an impact on performance because transactions are single statement.

Thursday, October 27, 2016

Benchmarketing MyRocks

I have been spending time understanding MyRocks performance for new workloads including benchmarks that potential MyRocks users run. One of those benchmarks is sysbench and I wrote a script to make it easier for me to run.

sysbench

Like most synthetic benchmarks sysbench is valuable but has its flaws. It helps to understand the flaws when looking at results. Most uses of sysbench are for very small databases. A typical run for me is 8 tables with 1M rows per table. That uses about 2G of space with uncompressed InnoDB tables. For a typical MyRocks configuration that will use a 3 level LSM tree with data in levels 0, 1 and 2 and I usually disable compression for those levels. And if you are running performance tests for a 2G database that fits in cache I wouldn't use compression. Small databases save time when running benchmarks as the load happens real fast. But you might miss the real overheads that occur with a larger database.

Another possible problem with sysbench is that several of the test configurations are for read-only workloads. If your real workload isn't read-only, then you might miss real overheads. For example, the RocksDB memtable might be empty for a read-only workload. That avoids the cost of checking the memtable on a query and can overstate the QPS you will measure.

I spent a day explaining unexpected performance variance on a read-only sysbench test. I took too long to notice that the LSM on the slower server had data in levels 0, 1 and 2 while the LSM on the faster server only used levels 1 and 2. By not having data in level 0 there was less work to do to process a query and the faster server got more QPS. This was visible in the compaction IO statistics displayed by SHOW ENGINE ROCKSDB STATUS. Had this been a read-write workload the LSM would have been in a steadier state with data (usually) in the memtable and level 0. But in this case the memtable was empty and compaction was stopped because there were no writes and the compaction scores for all levels was <= 1. I wonder whether we can add a feature to RocksDB to trigger compaction during read-only workloads when the LSM tree can be made more performant for queries?

configuration

The best settings for the MyRocks my.cnf file are also a source of confusion. I almost always enable the concurrent memtable. See the comments for the options allow_concurrent_memtable_write and enable_write_thread_adaptive_yield. I explained the benefits of these options in a previous post. Alas the options are disabled by default and not mentioned in the suggested my.cnf options. They are enabled by adding this to my.cnf:

rocksdb_allow_concurrent_memtable_write=1
rocksdb_enable_write_thread_adaptive_yield=1

I enable the concurrent memtable for most of my benchmarks. When MyRocks arrives in MariaDB Server and Percona Server I wonder whether other users will do the same. For read-write workloads the concurrent memtable can be a big deal.

Wednesday, October 19, 2016

Make MyRocks 2X less slow

Fixing mutex contention has been good for my career. I had the workload, an RDBMS that needed a few improvements and support from a great team. Usually someone else found the bugs and I got to fix many of them. Sometimes I got too much credit because a good bug report is as valuable as the bug fix. These days I don't see many mutex contention bugs but I have begun to see more bugs from memory contention. My perf debugging skills need refreshing. They are far from modern. Thankfully we have Brendan Gregg.

For someone who debugs performance, shared_ptr is a gift. Mistakenly passing shared_ptr by value means the reference count will be changed too much and that is not good on a concurrent workload. I have encountered that at least twice in RocksDB and MyRocks. I even encountered it in MongoDB with SERVER-13382.

I have twice made MyRocks 2X less slow. First with issue 231 peak compaction throughput was doubled and now with issue 343 we almost double range-scan throughput (for long range scans with many concurrent queries). Someone important recently reported a disappointing performance result when comparing MyRocks with InnoDB. After a few days with sysbench I was able to reproduce it. This should be easy to fix.

Not mutex contention

In this bug, with sysbench read-only and read-write the peak QPS for MyRocks saturated long before InnoDB. While MyRocks and InnoDB had similar QPS at low concurrency, the QPS at high concurrency was almost 2X better for InnoDB. This was only an issue for longer range scans (try --oltp-range-size=10000) and the default was a shorter range scan (--oltp-range-size=100). My first guess was mutex contention. There was an odd pattern in vmstat where the context switch rate alternated every second for MyRocks but was steady for InnoDB. Spikes in context switch rate sometimes mean mutex contention but I did not see that with PMP. What next?

The next guess is memory system contention but my debugging skills for that problem are weak. I have told myself many times this year that I need to refresh my skills. So I started with this blog post from Brendan Gregg and tried perf stat and found that InnoDB completed almost twice the number of instructions compared to MyRocks in the same time period. Why is IPC almost 2X better for InnoDB? Results from perf are here.

I then tried a different perf stat command to get other hardware perf counters and results are here. This also shows that InnoDB completed twice the number of instructions while both have a similar value for bus-cycles, so MyRocks uses 2X the number of bus-cycles per instruction. That can explain why it is slower. What are bus-cycles? Most of the documentation only explained this is as [Hardware event] and without more details I can't look that up in Intel manuals. I asked internally and learned the magic code, 0x013c, that leads to more information. Start with this article (and consider subscribing to LWN, I do).

The next step was to get call graphs when bus-cycles was incremented. I used the command below to find the offending code. Disabling that code fixes the problem, but work remains to make that code performant. InnoDB and MyRocks have similar code to count rows read and InnoDB doesn't fall over because of it. I want to make MyRocks not fall over because of it.

perf record -ag -p $( pidof mysqld ) -e bus-cycles -- sleep 10

Useful commands

I used all of these commands today:

perf stat -a sleep 10
perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10
perf stat -e 'syscalls:sys_enter_*' -a sleep 10
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores -a sleep 10
perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses -a sleep 10
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches -a sleep 10
perf top -e raw_syscalls:sys_enter -ns comm
perf list --help
perf record -ag -p $( pidof mysqld ) -e bus-cycles -- sleep 10

Saturday, October 15, 2016

scons verbose command line

Hopefully I can find this blog post the next time I get stuck. How do you see command lines when building your favorite open source project? Try one of variants below. I am sure this list will grow over time. The scons variant is my least favorite. I use too many tools for source configuration and compiling. I am barely competent with most of them, but it is easy to find answers for popular tools. I get to use scons with MongoDB. It is less fun searching for answers to problems with less popular tools.

make V=1
make VERBOSE=1
scons --debug=presub

Pagerank seems to be busted for scons. Top results are for too-old versions of scons. Top-ranked results usually tell you how to solve the problem with Python, but users aren't writing scons input files, we are doing things via the command line. At least with MongoDB's use of scons, the separator for construction variables is a space, not a colon. So do LIBS="lz4 zstd" but not LIBS="lz4:zstd".

This is my second scons inspired post. Just noticed my previous one.

Wednesday, October 12, 2016

MongoRocks and WiredTiger versus linkbench on a small server

I spent a lot of time evaluating open-source database engines over the past few years and WiredTiger has been one of my favorites. The engine and the team are excellent. I describe it as a copy-on-write-random (CoW-R) b-tree as defined in a previous post. WiredTiger also has a log-structured merge tree. It isn't officially supported in MongoDB. Fortunately we have MongoRocks if you want an LSM.

This one is long. Future reports will be shorter and reference this. My tl;dr for Linkbench with low concurrency on a small server:

I think there is something wrong in how WiredTiger uses zlib, at least in MongoDB 3.2.4
mmapV1 did better than I expected.
We can improve MongoRocks write efficiency and write throughput. The difference for write efficiency between MongoRocks and other MongoDB engines isn't as large as it is between MyRocks and other MySQL engines.

Update - I have a few followup tasks to do after speaking with WiredTiger and MongoRocks gurus. First, I will repeat tests using MongoDB 3.2.10. Second, I will use zlib and zlib-noraw compression for WiredTiger. Finally, I will run tests with and without the oplog to confirm whether the oplog hurts MongoRocks performance more than WiredTiger.

All about the algorithm

Until recently I have not been sharing my performance evaluations that compare MongoRocks with WiredTiger. In some cases MongoRocks performance is much better than WiredTiger and I want to explain those cases. There are two common reasons. First, WiredTiger is a new engine and there is more work to be done to improve performance. I see progress and I know more is coming. This takes time.

The second reason for differences is the database algorithm. An LSM and a B-Tree make different tradeoffs for read, write and space efficiency. See the Rum Conjecture for more details. In most cases an LSM should have better space and write efficiency while a B-Tree should have better read efficiency. But better write and space efficiency can enable better read efficiency. First, when less IO capacity is consumed for writing back database changes then more IO capacity is available for the storage reads done for user queries. Second, when less space is wasted for caching database blocks then the cache hit ratio is higher. I expect the second reason is more of an issue for InnoDB than for WiredTiger because WT does prefix encoding for indexes and should have less or no fragmentation for database pages in cache.

Page write-back is a hard feature to get right for a B-Tree. There will be dirty pages at the end of the buffer pool LRU and these pages must be written back as they approach the LRU tail. Things that need to read a page into the buffer pool will take a page from the LRU tail. If the page is still dirty at the point the thing requesting the page will stall until the page has been written back. It took a long time to make this performant for InnoDB and work still remains. It will take more time to get this right for WiredTiger. Checkpoint and eviction are the steps by which dirty pages are written back for WT. While I am far from an expert on this I have filed several performance bugs and feature requests (and many of them have been fixed). One open problem is that checkpoint is still single threaded. This one thread must find dirty pages, compress them and then do buffered writes. When zlib is used then that is too much work for one thread. Even with a faster compression algorithm I think more threads are needed, and the cost of faster decompression is more space used for the database. Server-16736 is open as a feature request for this.

Test setup

I have three small servers at home. They used Ubuntu 14.04 at the time, but have since been upgraded to 16.04. Each is a core i3 with 2 CPUs, 4 HW threads and 8G of RAM. The storage is a 120G Samsung 850 EVO m.2 SSD for the database and a 7200 RPM disk for the OS. I like the NUC servers but my next cluster will use a better CPU (core i5) with more RAM.

The benchmark is Linkbench using LinkbenchX from Percona that has support for MongoDB. For WiredTiger and MongoRocks engines this doesn't use transactions to protect the multi-operation transactions. I look forward to multi-document transactions in a future MongoDB release. I use main from my Linkbench fork rather than from LinkbenchX to avoid the use of the feature to sustain a constant request rate because that has added too much CPU overhead in some tests.

I ran two tests. First, I used an in-memory database workload with maxid1=2M. Second, I used an IO-bound database with maxid1=40M. By IO-bound I mean that the database is larger than 8G but smaller than 120G and the SSD is very busy during the test. Both tests were run with 2 connections for loading and 1 connection (client) for the query tests. The query tests were run for 24 1-hour loops and the result from the 24th hour is shared. I provide results for performance, quality of service (QoS) and efficiency. Note that for the mmapv1 IO-bound test I had to use maxid1=20M rather than 40M to avoid a full storage device.

The oplog is enabled, sync-on-commit is disabled and WiredTiger/MongoRocks get 2G of RAM for cache. Tests were run with zlib and snappy compression. I reduced file system readahead from 128 to 16 for the mmapV1 engine tests. For MongoRocks I disabled compression for the smaller levels of the LSM. For the cached database, much more of the database is not compressed because of this. I limited the oplog to 2000MB. The full mongo.conf is at the end of this post.

I used MongoDB 3.2.4 to compare the MongoRocks, WiredTiger and mmapv1 engines. I will share more results soon for MongoDB 3.3.5 and I think results are similar. When MongoDB 3.4 is published I will repeat my tests and hope to include zstandard.

If you measure storage write rates and use iostat then be careful because iostat includes bytes trimmed as bytes written. If the filesystem is mounted with discard enabled and the database engine frequently deletes files (RocksDB does) then iostat might overstate bytes written. The results I share here have been corrected for that.

Cached database load

These are the results for maxid1=2M. The database is cached for all engines except mmapV1.

Legend:

ips - average inserts/second
wKB/i - average KB written to storage per insert measured by iostat
Mcpu/i - CPU usecs/insert, measured by vmstat
Size - database size in GB at the end of the load
rss - mongod process size (RSS) in GB from ps at the end of the load
engine - rx.snap/rx.zlib is MongoRocks with snappy or zlib. wt.snap/wt.zlib is WiredTiger with snappy or zlib

Summary:

MongoRocks has the worst insert rate. Some of this is because more efficient writes can mean less efficient reads and the LSM does more key comparisons than a B-Tree when navigating the memtable. But I think that most of the reason is management of the oplog where there are optimizations we have yet to do for MongoRocks.
MongoRocks writes the most to storage per insert. See the previous bullet point.
MongoRocks and WiredTiger use a similar amount of space. Note that during the query test that follows the load the size of WT will be much larger than MongoRocks. As expected, the database is much larger with mmapV1.

ips wKB/i Mcpu/i size rss engine
5359 4.81 6807 2.5 0.21 rx.snap
4876 4.82 10432 2.2 0.45 rx.zlib
8198 1.84 3361 2.7 1.82 wt.snap
7949 1.79 4149 2.1 1.98 wt.zlib
7936 1.64 3353 13.0 6.87 mmapV1

Cached database query

These are the results for maxid1=2M for the 24th 1-hour loop. The database is cached for all engines except mmapV1.

Legend:

tps - average transactions/second
wKB/t - average KB written to storage per transaction measured by iostat
Mcpu/t - CPU usecs/transaction, measured by vmstat
Size - database size in GB at test end
rss - mongod process size (RSS) in GB from ps at test end
un, gn, ul, gll - p99 response time in milliseconds for the most popular transactions: un is updateNode, gn is getNode, ul is updateList, gll is getLinkedList. See the Linkbench paper for details.
engine - rx.snap/rx.zlib is MongoRocks with snappy or zlib. wt.snap/wt.zlib is WiredTiger with snappy or zlib

Summary:

WiredTiger throughput is much worse with zlib than with snappy. I think the problem is that dirty page write back doesn't keep up because of the extra overhead from zlib compression. See above for my feature request for multi-threaded checkpoint. There is also a huge difference in the CPU overhead for WiredTiger with zlib compared to WT with snappy. That pattern does not repeat for MongoRocks. I wish I had looked at that more closely.
While WiredTiger and MongoRocks used a similar amount of space after the load, WT uses much more space after the query steps. I am not sure whether this is from live or dead versions of B-Tree pages.
Response time is better for MongoRocks than for WiredTiger. It is pretty good for mmapV1.
mmapV1 has the best throughput. I have been surprised by mmapV1 on several tests.
MongoRocks writes the least amount to storage per transaction.

tps r/t rKB/t wKB/t Mcpu/t size rss un gn ul gll engine
1741 0 0 2.72 16203 3.7 2.35 0.3 0.1 1 0.9 rx.snap
1592 0 0 2.66 19306 3.0 2.36 0.4 0.2 1 1 rx.zlib
1763 0 0 5.70 23687 4.9 2.52 0.4 0.1 2 1 wt.snap
933 0 0 8.94 81250 6.5 2.97 1 0.6 7 5 wt.zlib
1967 0.2 9.70 4.61 12048 20.0 4.87 0.9 0.7 1 1 mmapV1

IO-bound database load

These are the results for maxid1=40M for the 24th 1-hour loop. The database does not fit in cache. I used maxid1=20M for mmapV1 to avoid a full SSD. So tests for it ran with half the data.

The summary is the same as it was for the cached database and I think we can make MongoRocks a lot faster.

ips wkb/i Mcpu/i size rss engine
4896 7.11 8177 27 2.45 rx.snap
4436 6.67 11979 22 2.29 rx.zlib
7979 1.93 3526 29 2.20 wt.snap
7719 1.89 4330 24 2.30 wt.zlib
7612 1.85 3476 66 6.93 mmapV1, 20m

IO-bound database query

These are the results for maxid1=40M for the 24th 1-hour loop. The database does not fit in cache. I used maxid1=20M for mmapV1 to avoid a full SSD. So tests for it ran with half the data.

Summary:

Like the cached test, WiredTiger with zlib is much worse than with snappy. Most metrics are much worse for it. This isn't just zlib, I wonder if there is a bug in the way WT uses zlib.
Throughput continues to be better than I expected for mmapv1, but it has started to do more disk reads per transaction. It uses about 2X the space for the other engines for half the data.
MongoRocks provides the best efficiency with performance comparable to other engines. This is the desired result.

tps r/t rKB/t wKB/t Mcpu/t size rss un gn ul gll engine
1272 1.22 12.82 4.02 17475 29 2.56 0.8 0.5 2 1 rx.snap
1075 1.03 10.44 4.56 25223 23 2.53 0.9 0.6 2 2 rx.zlib
1037 1.24 17.23 11.60 45335 34 2.69 1 1 3 2 wt.snap
446 1.21 23.61 18.00 151628 33 3.38 13 11 21 18 wt.zlib
1261 2.43 34.77 5.28 13357 72 2.05 0.9 0.5 3 2 mmapV1, 20m

mongo.conf

This is the full mongo.conf for zlib. It needs to be edited to enable snappy.

processManagement:

fork: true

systemLog:

destination: file

path: /path/to/log

logAppend: true

storage:

syncPeriodSecs: 60

dbPath: /path/to/data

journal:

enabled: true

mmapv1:

journal:

commitIntervalMs: 100

operationProfiling.slowOpThresholdMs: 2000

replication.oplogSizeMB: 2000

storage.wiredTiger.collectionConfig.blockCompressor: zlib

storage.wiredTiger.engineConfig.journalCompressor: none

storage.wiredTiger.engineConfig.cacheSizeGB: 2

storage.rocksdb.cacheSizeGB: 2

storage.rocksdb.configString: "compression_per_level=kNoCompression:kNoCompression:kNoCompression:kZlibCompression:kZlibCompression:kZlibCompression:kZlibCompression;compression_opts=-14:1:0;"

Tuesday, October 11, 2016

Making the case for MyRocks. It is all about efficiency.

I had two talks at Percona Live - one on MyRocks and another on web-scale. The talk links include the slides, but slides lose a lot of context. But first, the big news is that MyRocks will appear in MariaDB Server and Percona Server. I think MyRocks is great for the community and getting it into supported distributions makes it usable.

Efficiency is the reason for MyRocks. The RUM Conjecture explains the case in detail. The summary is that MyRocks has the best space efficiency, better write efficiency and good read efficiency compared to other storage engines for MySQL. The same is true of MongoRocks and MongoDB storage engines. Better efficiency is a big deal. Better compression means you can use less SSD. Better write efficiency means you get better SSD endurance or that you can switch from MLC to TLC NAND flash. Better write efficiency also means that more IO capacity will be available to handle reads from user queries.

But performance in practice has nuance that theory can miss. While I expect read performance to suffer with MyRocks compared to InnoDB, I usually don't see that when evaluating production and benchmark workloads. I spent most of this year doing performance evaluations for MyRocks and MongoRocks. I haven't shared much beyond summaries. I expect to share a lot in the future.

I prefer to not write about performance in isolation. I want to write about performance, quality of service and efficiency. By performance I usually mean peak or average throughput under realistic conditions. By quality of service I mean the nth (95th, 99th) percentile response time for queries and transactions. By efficiency I mean the amount of hardware (CPU time, disk reads, disk KB written, disk KB written, disk space) consumed. I have frequently written about performance in isolation in the past. I promise to do that less frequently in the future.

My other goal is to explain the performance that I measure. This is hard to do. I define benchmarketing as the use of unexplained performance results to claim that one product is better than another. While I am likely to do some benchmarketing for MyRocks I will also provide proper benchmarks where results are explained and details on quality of service and efficiency are included.

Let me end this with benchmarking and benchmarketing. For benchmarking I have a result from Linkbench on a small server: Intel 5th generation core i3 NUC, 4 HW threads, 8G RAM, Samsung 850 EVO SSD. The result here is typical of results from many tests I have done. MySQL does better than MongoDB, MyRocks does better than InnoDB and MongoRocks does better than WiredTiger. MyRocks and MongoRocks have better QoS based on the p99 update time in milliseconds. The hardware efficiency metrics explain why MyRocks and MongoRocks have more throughput (TPS is transactions/second). M*Rocks does fewer disk reads per transaction (iostat r/t), writes less to disk per transaction (iostat wKB/t) and uses less space on disk (size GB). It uses more CPU time per transaction than uncompressed InnoDB. That is the price of better compression. Why it has better hardware efficiency is a topic for another post and conference talk.

For benchmarketing I have a result from read-only sysbench for an in-memory database. MyRocks matches InnoDB at low and mid concurrency and does better at high-concurrency. This is a workload (read-only & in-memory) that favors InnoDB.

Friday, October 7, 2016

MyRocks, MongoRocks, RocksDB and Mr. Mime

The big news is that MyRocks will arrive in proper distributions with expert support (Percona Server and MariaDB Server). This is a big deal for me as it helps make MyRocks better faster and it gives you a chance to evaluate MyRocks. Earlier this year Percona announced support for MongoRocks.

After two weeks in Europe (Dublin, London, Amsterdam) I have yet to catch or even encounter Mr. Mime, the Europe-only Pokemon Go character. So I will return in November to spend more time searching for Mr. Mime and speak at CodeMesh in London and HighLoad++ in Moscow.

codemesh.io - I look forward to attending as many talks as possible at CodeMesh. I speak on November 4 on the relationship between performance and efficiency. I think the RUM Conjecture makes it easier to understand the choices between database engines which matters more given that we aren't limited to update-in-place b-trees today. Contact me directly if you want a discount code for a CodeMesh ticket.
HighLoad++ - I visit Moscow to speak at HighLoad++, explain the case for MyRocks and MongoRocks and learn more about Tarantool, one of my favorite projects which is also in the process of adding a write-optimized database engine (Vinyl).

Tuesday, September 20, 2016

MyRocks and InnoDB with large objects and compression

I ran tests to explain the difference between MyRocks and InnoDB when storing large objects and data with varying amounts of compressibility.

Compression in MyRocks is simpler than in InnoDB. You should expect the database to use about 1.1X times the size of the compressed output. When rows compress to 60% of their original size and are 10kb before compression, then each row should use about 6.6kb in the database. The 1.1X adjustment is for space-amplification from leveled compaction.

Predicting the space used for InnoDB is harder. First, large LOB column are not stored inline and overflow pages are not shared. Second, disk pages have a fixed size and you risk using too much space or getting too many page splits when searching for a good value for key_block size. More details are here.

I ran two tests for two types of data. The first test is an insert only workload in PK-order for data with varying amounts of compressibility. The second test determined how fast point queries could be done on that data while rate-limited inserts were in progress. By varying amounts of compressibility I mean that there was one large varchar column per row and that 20%, 45%, 75% or 95% of the data in the column was random and the remainder was constant and easily compressed. Both tests used one connection for inserts. The query test also used one connection for queries.

The test pattern was run twice. In both cases the large column was a varchar. In the first case it had a length between 10,000 and 20,000 characters. In the second case it had a length between 100 and 1000 characters. The database block size was 16kb for MyRocks and InnoDB.

Insert only

For the insert-only workload the space used for MyRocks can be predicted from the compressibility of the data. That is much less true for InnoDB. For example compressed InnoDB uses about the same amount of space for pctRand in 20, 45 and 75.

MyRocks used the least amount of space. InnoDB used much more space when the column was larger (10,000 to 20,000 vs 100 to 1000). Overflow pages are the root cause.

The insert rates are better for MyRocks than for InnoDB. They were also stable for MyRocks and uncompressed InnoDB independent of the compressibility. Rates for uncompressed InnoDB are better than compressed InnoDB. While this wasn't a performance benchmark, it matches many other results I get. It is hard to get performance and compression from InnoDB. The CPU overhead per insert was similar between MyRocks and uncompressed InnoDB. CPU overheads were mostly larger for compressed InnoDB.

Legend for the data:

ips - inserts per second
size - database size in GB at test end
Mcpu - microseconds of CPU per insert
pctRand - percentage of random data in large column
engine - rx.zlib-6 is MyRocks with zlib level 6 compression. i6n is InnoDB in MySQL 5.6.26 without compression. i6c is InnoDB in MySQL 5.6.26 with compression.

column up to 20,000 column up to 1000

ips size Mcpu ips size Mcpu pctRand engine

5489 7.7 1090 34468 11 151 20 rx.zlib-6

5540 16 1127 34824 19 149 45

5532 24 1307 34517 27 166 75

5523 30 1467 34701 33 160 95

ips size Mcpu ips size Mcpu pctRand engine

3995 87 933 23470 66 173 20 i6n

3981 87 928 23704 66 174 45

3981 86 917 23487 66` 175 75

3995 88 914 23658 66 176 95

ips size Mcpu ips size Mcpu pctRand engine

3339 36 1064 13429 33 262 20 i6c

2779 32 1278 13124 33 271 45

2133 35 1750 8767 30 392 75

1757 50 2061 7228 38 461 95

Point queries

MyRocks provides the best compression, the best query throughput, and the east CPU overhead per query. My conclusions for InnoDB space consumption are similar to the results from the insert-only workload.

Legend for the data:

qps - queries per second
size - database size in GB at test end
Mcpu - microseconds of CPU per query
pctRand - percentage of random data in large column
engine - rx.zlib-6 is MyRocks with zlib level 6 compression. i6n is InnoDB in MySQL 5.6.26 without compression. i6c is InnoDB in MySQL 5.6.26 with compression.

qps size Mcpu qps size Mcpu pctRand engine

984 9.3 4308 2214 11 1585 20 rx.zlib-6

910 19 4532 2113 19 1627 45

846 30 4952 2102 27 1601 75

795 37 5598 2051 33 1691 95

qps size Mcpu qps size Mcpu pctRand engine

628 113 6240 1302 62 2527 20 i6n

624 110 6226 1300 63 2501 45

624 114 6312 1302 63 2536 75

628 115 6218 1305 66 2474 95

qps size Mcpu qps size Mcpu pctRand engine

708 38 5560 770 34 4450 20 i6c

629 39 6643 687 34 4895 45

513 44 8494 589 30 6046 75

418 57 10619 576 39 6599 95

Thursday, September 15, 2016

Peak benchmarketing season for MySQL

Maybe this is my XKCD week. With Oracle Open World and Percona Live Amsterdam we are approaching peak benchmarketing season for MySQL. I still remember when MySQL 4.0 was limited to about 10k QPS on 4 and 8 core servers back around 2005, so the 1M QPS results we see today are a reminder of the great progress that has been made thanks to investments by upstream and the community.

In General

But getting 1.5M QPS today compared to 1M QPS last year isn't at the top of the list for many (potential) users of MySQL. I use performance, usability, mangeability, availability and efficiency to explain what matters for web-scale DBMS users. My joke is that each of these makes a different group happy: performance -> marketing, usability -> developers, manageability -> operations, availability -> end users, efficiency -> management.

The benchmarketing results mostly focus on performance. Whether InnoDB does a bit more QPS than Amazon Aurora isn't going to make Aurora less popular. Aurora might have excellent performance but I assume people are deploying it for other reasons. I hope we make it easier to market usability, manageability, availability and efficiency in the MySQL community. MongoDB has gone a long way by marketing and then delivering usability and manageability.

Even when limited to performance we need to share more than peak QPS. Efficiency and quality-of-service (QoS) are equally important. QPS without regard to response time is frequently a bogus metric. I get more IOPs from a disk by using a too large queue depth. But more IOPs at the cost of 100 millisecond disk read response times is an expensive compromise. Even when great QPS is accompanied by a good average response time I want to know if there is lousy QoS from frequent stalls leading to lousy 99th percentile response times. Percona has built their business in part by being excellent at documenting and reducing stalls in InnoDB that occur on benchmarks and real workloads.

I have been guilty of sharing too many benchmark reports in the past that ignored efficiency and QoS. I have been trying to change that this year and hope that other providers of MySQL performance results do the same. This is an example of a result that includes performance, efficiency and QoS.

MyRocks and RocksDB

A lot of the RocksDB marketing message has been about performance. Database access is faster with an embedded database than client/server because you avoid network latency. The MyRocks message has been about efficiency. The target has been better compression and less write amplification than InnoDB so you can use less SSD and lower-endurance SSD. For a workload I care about we see 2X better compression and 1/10 the write rate to storage. This is a big deal.

When starting the project we had many discussions about the amount of performance loss (reduced QPS, higher response time) we could tolerate to get more efficiency. While we were vague the initial goal was to get similar QPS and response time to InnoDB for real workloads, but we were willing to accept some regressions. It turned out that there was no regression and similar performance with much better efficiency is a big deal.

But benchmarks aren't real workloads and there will soon be more benchmark results. Some of these will repeat what I have claimed, others will not. I don't expect to respond to every result that doesn't match my expectations. I will consult when possible.

One last disclaimer. If you care about read-mostly/in-memory workloads then InnoDB is probably an excellent choice. MyRocks can still be faster than InnoDB for in-memory workloads. That is more likely when the bottleneck for InnoDB is page write-back performance. So write-heavy/in-memory can still be a winner for MyRocks.

Seriously, this is the last disclaimer. While we are bickering about benchmark results others are focusing on usability and manageability and getting all of the new deployments.

Wednesday, September 14, 2016

Excited about Percona Live Amsterdam

I look forward to attending and speaking at Percona Live Amsterdam. There are presentations and people with expertise on MySQL, MongoDB and PostgreSQL. This is a great chance to learn from people with different experience and different expertise. I hope you make the most of it.

A great way to learn is to ask questions like:

Why are you using X?
What problems are you trying to solve?

A great way to end the conversation early and avoid learning anything is to ask questions like:

Why aren't you using Y?
You should be using Z! (this isn't a question)

zlib vs zstd for MyRocks running Linkbench

I used an IO-heavy configuration to determine the impact of zstandard vs zlib compression for MyRocks. There was about 1 read from SSD per transaction and decompression is done after each page read from the OS page cache and storage.

The results are impressive. Zstandard compresses like zlib level 1 but uses much less CPU.

zstandard reduces CPU by 45% vs zlib level 1 for the load test
zstandard reduces CPU by 11% vs zlib level 1 for the query test
zstandard gets 8% more TPS vs zlib level 1 for the query test

Configuration

Configuration for MyRocks is still complex. The templates for the MyRocks my.cnf files for Linkbench and general usage are explained on the wiki. I used no compression for L0, L1, L2, then lz4 for all but the max level and then one of zlib level 1, zlib level 6 or zstd for the max level. The tests used an Aug5 build of MyRocks, so this used kZSTDNotFinalCompression as the build preceded the 1.0 release of zstandard.

The test host has 50G of RAM available to userland, fast storage (5TB of NVMe MLC) and 24 CPU cores with 48 HW threads. The RocksDB block cache was set to 10G, the binlog was disabled but sync-on-commit was disabled for the binlog and RocksDB. Linkbench is run with maxid1=1B, the load test uses 2 clients and the query tests use 16 clients. Query tests are run as 24 1-hour loops and I report metrics from the 24th hour. I used my branch of linkbench and support scripts.

Results

The results for zstandard are impressive. I look forward to using this in production. Thanks Yann.

Legend:

ips/tps - inserts & transactions per second
r/i, r/t - iostat reads per insert and per transaction
wKB/i, wKB/t - iostat KB written per insert and per transaction
Mcpu/i, Mcpu/t - usecs of CPU time per insert and per transaction
size - database size in GB
rss - mysqld RSS size in GB
un, gn, ul, gl - p99 response time in milliseconds for the most frequent transactions (Update Node, Get Node, Update Link, Get Link List)

Results for the load

ips r/i rKB/i wKB/i Mcpu/i size rss engine
61543 0 0 0.98 81 324 3.1 zstd
61504 0 0 0.98 146 331 2.0 zlib-1
61457 0 0 0.97 153 312 2.2 zlib-6

Results for the 24th hour of the query test

tps r/t rKB/t wKB/t Mcpu/t size rss un gn ul gl engine
39366 1.00 10.38 2.36 878 377 12.2 0.6 0.6 1 0.8 zstd
36524 1.00 10.47 2.45 992 381 12.1 0.7 0.6 1 0.9 zlib-1
37233 0.97 9.76 2.30 1002 360 12.0 0.7 0.7 1 0.9 zlib-6

Wednesday, September 7, 2016

Tuning the RocksDB block cache

I spent many years using InnoDB with direct IO and I didn't worry about buffered IO performance. Well, I didn't worry until Domas told me to worry. My focus has switched to RocksDB and now I worry about buffered IO performance. Fortunately, another co-worker (Jens Axboe) promises to make buffered writeback much better.

With direct IO, InnoDB stores compressed and uncompressed pages in the InnoDB buffer pool. It has a clever algorithm to determine how much memory to use for each based on whether the workload appears to be IO or CPU bound. My vague memory is that we tune my.cnf to keep it from being too clever.

With buffered IO, RocksDB manages a block cache for uncompressed blocks and then depends on the OS page cache for compressed blocks. While I think there is an opportunity to be more efficient in that area, that is not the topic for today.

The question today is how to divide memory between the RocksDB block cache and the OS page cache. I have read tuning advice for other buffered IO databases that suggest giving as much RAM as possible to the database. I disagree and my advice is:

If the uncompressed working set fits in the RocksDB block cache then give as much RAM as possible to the block cache.
Else if the compressed working set fits in the OS page cache then give most RAM to the OS page cache by using a small RocksDB block cache.
Else give the RocksDB block cache about 20% of host RAM.

This is a rule of thumb. Sometimes in rule 3 I suggest giving 25% or 30% to the block cache, but I hope you get the point. The goal is to avoid reads from storage by caching more data in RAM. I assume that decompressing a block is much faster than reading it from storage which is more likely when you use zstandard.

This isn't proven unless you accept proof by anecdote. I ran a test with Linkbench on a host with 50G of RAM and a ~350G database. The test was repeated with the RocksDB block cache set to 5G, 10G, 20G and 35G. Using a smaller block cache reduced the storage read cost per transaction by between 10% and 20% using iostat r/s and iostat rKB/s. My advice might not work for you, but might help you to consider your choices before following tuning advice you read on the web.

Tuesday, August 23, 2016

Web-scale MySQL back in the day

I spent a few years at Facebook where I was extremely busy helping to make MySQL better at web-scale. I worked a lot with Domas. He found so many problems and I helped fix them along with a few others (the MySQL db-eng team was small). Domas made it easy to understand what was broken and there was a lot of low-hanging fruit. This slide deck is one perspective on what we did. I doubt I have the energy to go through another few years like that, but it was a great time. The timing was also right as there were many people at Oracle/MySQL pushing to make MySQL scale on modern hardware.

Wednesday, June 8, 2016

RocksDB, InnoDB and the insert benchmark

Today I share results for RocksDB and InnoDB using MySQL 5.6 and the insert benchmark. I have many more results that I am currently sharing with engine vendors to help them improve (WiredTiger) or to show them the potential of RocksDB. Percona has embraced MongoRocks. I hope to see similar support for MyRocks in a widely used MySQL distribution.

tl;dr - RocksDB did better than InnoDB for a write-heavy workload and a range-scan heavy workload. The former is expected, the latter is a welcome surprise. I hope we get RocksDB into a supported MySQL distribution to make it easier for you to try it.

I have been running several database benchmarks on a variety of hardware, storage, database engines and configurations. The benchmarks include the insert benchmark, linkbench and sysbench. The hardware includes dual-socket, many-core servers with 144G or more of RAM and a 1-socket, few-core server. The few-core server has 2 cores, 4 HW threads and is an Intel NUC 5i3ryh with 8G of RAM and a Samsung m.2 SSD. The storage includes m.2 SSD (Samsung 850 EVO 110G), older PCIe SSD and newer NVMe SSD. The engines include RocksDB, InnoDB, MyISAM and TokuDB for MySQL and RocksDB, WiredTiger and mmapv1 for MongoDB. The configurations include databases that fit in cache and databases that do not. Even when the database fits in cache the workload can be IO-bound from the writes required to persist a database.

Graphs

If you just want the graphs here they are.

Test pattern

I used the small server (Intel NUC, 2 cores, 8G RAM, m.2 Samsung SSD) for this test.`

The workload is the insert benchmark using the pattern described below. I use this pattern to include a mix of workloads (write-only, reads with write-heavy, reads without write-heavy). I don't include a read-only step. The mixed workload (reads + writes) is very important especially when the database is larger than RAM:

Insert 250M rows into a table in PK order. The table has 3 secondary indexes and the inserts are in random order for each of the secondary indexes. Measure insert performance.
Insert 5M rows into the table with the writer limited to 1000/second and measure query performance for short range queries. The range queries use LIMIT 10 and randomly select the index (secondary or PK) and the starting offset for the range. The index is covering for the query. Assuming the writer is able to do 1000/second this takes 5000 seconds. For this step I care about throughput and response time for reads. For writes my main concern is whether the rate was able to sustain the rate of 1000/second.
Same as #2 except 500k rows are inserted and the writer is limited to 100/second. Assuming the writer can do 100/second this also takes 5000 seconds.

Performance & Efficiency

I am interested in performance and efficiency. I include both. Sometimes better efficiency explains better performance. Sometimes we need better efficiency even if that comes at the cost of decreased performance.

The metrics I use for performance are throughput and response time. For performance I measure the average rate for queries and inserts. I also want to determine whether there are stalls in throughput so the benchmark client reports average throughput per interval (~10 seconds each) and at test end I compute the 50th, 75th, 90th, 95th and 99th percentile per-interval throughput. For many tests including the insert benchmark it is even better to show throughput over time as some database engines suffer from stalls or throughput degrades. Some of that is visible via the p95 and p99 throughput metric.

For efficiency I measure HW consumed per operation and including CPU microseconds, storage reads, storage KB read/written. For efficiency I also include the database size and the RSS of the database process. I don't distinguish storage reads and writes done for user inserts versus those done for user queries. If the storage read rate is 200/second, the query rate is 10/second and the insert rate is 5/second then the metrics below will show 20 reads/query and 40 reads/insert.

I usually ignore storage writes for 2 reasons. First, the penalty for random writes is small with SSD. Endurance is still an issue and KB written captures that. Second, it is hard to distinguish between slow and fast write operations. Some write operations are fast because they are mostly sequential or are small log writes that benefit from battery backed write cache on a RAID device.

I use vmstat and iostat to collect efficiency metrics. Note that Linux in mid-2016 still counts trim as bytes written. So with a database engine like RocksDB that deletes files as frequently as it creates them in the steady state and a kernel+filesystem+device setup to do trim on file delete then KB written as reported by iostat will be twice the real rate. You can confirm this by comparing iostat metrics with metrics reported by your SSD.

Configuration

I used a 2G block cache for RocksDB which uses the rest of RAM for compressed blocks in the OS page cache. I used a 6G buffer pool for InnoDB which used direct IO.

For RocksDB I used Facebook MySQL with the git hash 2406e9. For InnoDB I used 5.6.29 and 5.7.10. All mysqld builds used jemalloc. The binlog was enabled for all tests but sync-on-commit was disabled for the database engine and the binlog. The server runs Ubuntu 14.04.

MyRocks is changing fast. I used a build from May 7 and since then several performance bugs have been fixed. My my.cnf settings for RocksDB are also stale compared to the latest & greatest values suggested by Yoshinori. It is fun to work on a project that improving so quickly.

The my.cnf files for RocksDB, InnoDB in MySQL 5.6, InnoDB in MySQL 5.7.

Results for insert-only

Summary:

InnoDB writes between 20X and 30X more per insert than RocksDB
RocksDB does between 4X and 10X more inserts/second. The difference increases as the database:RAM ratio increases because the insert rate for RocksDB is almost flat over time while InnoDB drops dramatically once the database gets larger than RAM.

legend:

* ips.av - average insert rate
* ips.75, ips.95, ips.99 - 75th, 95th and 99th percentile insert rate over the
per-interval (~10 seconds) average rates
* r/i - storage reads per insert via iostat
* rkb/i, wkb/i - storage KB read and written per insert via iostat. iostat

overstates the wkb/i result for RocksDB by 2X because it

counts trim as bytes written and the result here have

been corrected.
* Mcpu/i - usecs CPU per insert via vmstat
* size - database size in GB at test end
* rss - mysqld RSS in GB at test end

* engine - format is $e.$comp[.skip] and .skip means that unique constraints

are not checked (only on RocksDB). For $e, "rx" is RocksDB,
"i7" is InnoDB in MySQL 5.7.10 and "i6" is InnoDB in MySQL 5.6.29.
For $comp "snap" is snappy, "zlib" is zlib level 1, "zlib-6" is zlib
level 6 and "none" is no compression.

ips.av ips.75 ips.95 ips.99 r/i rkb/i wkb/i Mcpu/i size rss engine

23129 24515 12276 9515 0.02 0.25 1.55 3354 19g 2.47 rx.snap.skip

18177 18356 7283 6245 0.02 0.23 1.46 3805 17g 2.52 rx.zlib.skip

17705 15976 15282 14766 0.02 0.25 1.75 3944 18g 2.59 rx.snap

15357 14945 11532 8114 0.02 0.23 1.57 4920 16g 2.67 rx.zlib

12800 14557 5681 4694 0.01 0.19 1.46 5433 16g 2.53 rx.zlib-6

12008 10525 9355 9007 0.02 0.41 2.01 4585 27g 2.64 rx.none

3811 2854 2607 2533 0.06 1.76 30.69 2715 42g 6.98 i7.none

2699 1995 1497 1281 0.29 5.97 36.17 4565 42g 6.78 i6.none

2120 1619 1338 1278 0.09 0.79 51.70 11142 23g 7.06 i7.zlib

2121 1604 1335 1275 0.09 0.81 51.34 11137 23g 6.85 i6.zlib

Results with inserts limited to 1000/second

RocksDB gets more QPS because it does fewer reads from storage per query. That is expected for uncompressed InnoDB because that database is about 2X the size compared to RocksDB. But the compressed InnoDB database isn't much larger than RocksDB so it should be able to keep a similar amount of hot data in cache. I wonder if key prefix compression as done by RocksDB allows it to get a better cache hit rate.

RocksDB also benefits from not doing reads for non-unique secondary index pages during index maintenance. That is a write-only operation for RocksDB.

legend:

* ips.av, qps.av - average insert and query rates
* ips.99, qps.99 - 99th percentile query rate over the per-interval average rates
* r/q - storage reads per insert and per query query via iostat
* rkb/q, wkb/q - storage KB read and written per query

* Mcpu/i - usecs CPU per query via vmstat
* size - database size in GB at test end
* rss - mysqld RSS in GB at test end
* engine - format is $e.$comp[.skip] and .skip means that unique constraints\
are not checked (only on RocksDB). For $e, "rx" is RocksDB,
"i7" is InnoDB in MySQL 5.7.10 and "i6" is InnoDB in MySQL 5.6.29.
For $comp "snap" is snappy, "zlib" is zlib level 1, "zlib-6" is zlib
level 6 and "none" is no compression.

ips.av ips.99 qps.av qps.99 r/q rkb/q Mcpu/q size rss engine

998 997 2877 1579 0.27 3.40 12190 19g 2.53 rx.snap

998 997 2635 1526 0.22 2.32 11544 16g 2.34 rx.zlib

998 997 2652 1531 0.23 2.49 12396 15g 2.45 rx.zlib-6

999 961 256 10 6.68 123.54 29334 42g 6.98 i7.none

999 966 256 12 6.86 128.08 42332 42g 6.79 i6.none

999 967 218 173 6.86 58.00 110277 23g 7.07 i7.zlib

999 966 210 161 6.71 59.37 113406 23g 6.85 i6.zlib

Results with inserts limited to 100/second

InnoDB still does many more storage reads per user request and storage reads done for secondary index maintenance don't explain why RocksDB is more efficient.

ips.av ips.99 qps.av qps.99 r/q rkb/q Mcpu/q size rss engine

100 100 3059 2339 0.22 2.59 9754 19g 2.41 rx.snap

100 100 2773 2312 0.18 1.61 9051 15g 2.33 rx.zlib

100 100 2899 2158 0.18 1.60 8329 15g 2.42 rx.zlib-6

100 100 728 438 5.78 93.25 11258 42g 6.98 i7.none

100 100 637 463 5.82 94.04 16334 42g 6.79 i6.none

100 100 456 439 4.64 37.10 42641 23g 7.08 i7.zlib

100 100 457 354 4.72 37.98 43306 23g 6.85 i6.zlib