Thursday, January 14, 2016

RocksDB vs InnoDB via Linkbench : performance and efficiency

MyRocks can reduce by half the hardware, or at least the storage hardware, required to run Linkbench compared to InnoDB. That is kind of a big deal.

A significant performance problem was recently fixed in MyRocks courtesy of the SingleDelete optimization. With this optimization RocksDB removes tombstones faster so that queries encounter fewer tombstones and waste less time on them. We hope to get the same feature into MongoRocks. I have been waiting a few months for this change and started another round of Linkbench tests when it arrived.

Performance and efficiency for MyRocks look great relative to InnoDB. We are far from done but I am amazed we reached this state so fast. The performance summary from my recent tests with IO-bound Linkbench and PCIe flash:
  • Uncompressed InnoDB loads faster than MyRocks and MyRocks loads faster than compressed InnoDB. I hope to figure out how to make MyRocks load faster than uncompressed InnoDB.
  • MyRocks uses about half the disk space compared to compressed InnoDB.
  • MyRocks writes much less to storage than InnoDB. This allows a workload to run on low-endurance SSD with MyRocks when it requires high-endurance SSD with InnoDB.
  • Average and p99 response times are much better for MyRocks
  • Maximum response times were usually better for InnoDB

Compression vs Device Endurance

Imagine a workload that uses 2 TB of storage and writes 100 MB/second to it. With 2X better compression it uses 1 TB of storage but might continue to write 100 MB/second. A side-effect of compression is that it increases the endurance required from a storage device and better endurance from SSD isn't free. Domas has been pointing out this side effect for many years.

This isn't a problem with MyRocks. It provides 2X better compression than compressed InnoDB for Linkbench. It also provides a write rate that is much less than half the InnoDB rate. When InnoDB uses 2 TB of storage and writes 100 MB/second to storage then MyRocks uses 1 TB of storage and writes less than 50 MB/second to storage. Alternatively we can put twice the number of databases on the server without increasing the storage or endurance requirements. For a server with 4 TB of storage we can either run 2 InnoDB databases that are 2 TB each and sustain 200 MB/second of writes or we can run 4 MyRocks databases that are 1 TB each and sustain less than 200 MB/second of writes. This assumes the server has sufficient CPU capacity.

Note that InnoDB writes more than 8X the number of bytes to storage per query compared to MyRocks. Many more details on that are below.


Configuration


I used my Linkbench repo to compare MyRocks with InnoDB in MySQL 5.6.26 and 5.7.10 with my.cnf files for MyRocks, 5.6.26 and 5.7.10. The test server has 2 sockets, 10 cores (20 HW threads) per socket, 144 GB of RAM and 3+ TB of PCIe flash. For Linkbench I configured loaders=20, requesters=20, maxid1=1B and maxtime=3600. The test pattern is first do the load and then run the query step 24 times for 1-hour each time. I collect performance data from MySQL, vmstat, iostat and the storage device during the tests. This is the schema for MyRocks, compressed InnoDB and uncompressed InnoDB.

I tested the following binaries:
  • rocksdb.zlib - MyRocks with zlib compression
  • orig5626.zlib - MySQL 5.6.26, InnoDB and zlib compression
  • orig5710.zlib - MySQL 5.7.10, InnoDB and zlib compression
  • orig5626.none - MySQL 5.6.26, InnoDB and no compression
  • orig5710.none - MySQL 5.7.10, InnoDB and no compression

Compression


This shows the database size at the end of each 1-hour query steps. The step function exists on the orig5710.none graph because the size is rounded to the nearest 100 GB once the size is >= 1 TB. Later in this post I show that MyRocks sustains a higher QPS during the query steps so it added the most data to the database. But the size growth is larger for InnoDB and likely caused by index fragmentation. This is the data for the graph. MyRocks uses less than half the space compared to compressed InnoDB.


Load


This displays the rate at which rows are inserted during the load with 20 concurrent clients. Uncompressed InnoDB has the best rate but I strongly prefer to use compression for this workload. Regardless I will debug this to see what can be done to help MyRocks have the best load rate on SSD. The data for the chart is here.

Query


This displays the average QPS from each 1-hour query step of Linkbench. The rate for compressed InnoDB increases significantly during the load because it suffers from contention as the b-tree becomes fragmented and page splits are done and this occurs more often immediately after the load. The data for the graph is hereThe QPS for MyRocks is much higher than for all of the InnoDB configurations.

Efficiency


MyRocks is more IO efficient than InnoDB for Linkbench. It has the lowest per-query rates for disk reads and disk bytes written. I think it does fewer disk reads because it keeps more data in cache. One reason is that InnoDB wastes space in the buffer pool for uncompressed pages courtesy of b-tree fragmentation. There are other reasons I won't describe here. There are several benefits from doing fewer disk reads. First, you can get more throughput when the storage device is close to saturation. Second, you use less CPU for decompression because every page must be decompressed after the read. Finally, you waste less time managing the buffer pool -- page eviction has been a source of performance problems for InnoDB. The data for the chart and graph is here.

Uncompressed InnoDB writes 10.4X more data to storage per query than MyRocks. Compressed InnoDB writes 8.4X more data to storage per query than MyRocks. MyRocks enables workload consolidation because it has much better compression and a lower write-rate than InnoDB. The MyRocks rate for bytes read from storage per query includes reads done in the background for LSM compaction. But I am not certain why that rate is between the rates for uncompressed and compressed InnoDB.

Quality of Service


Linkbench has 10 database transactions and reports response time metrics per transaction type. The transactions are ADD_NODE, UPDATE_NODE, DELETE_NODE, GET_NODE, ADD_LINK, UPDATE_LINK, DELETE_LINK, COUNT_LINK, MULTIGET_LINK and GET_LINKS_LIST. By far the most frequent transaction is GET_LINKS_LIST which requires a short range scan on a covering secondary index. The workload is explained in a blog post and conference paper. For the 24th 1-hour run the metrics are listed here per engine (MyRocks, InnoDB) and then reordered with results per transaction type for all engines. From the latter it is clear that while maximum response times are usually better for InnoDB the average and p99 response times are much better for MyRocks.

7 comments:

  1. There are some special my.cnf parameters to make MyRocks data loading much faster (https://github.com/facebook/mysql-5.6/wiki/data-loading).

    ReplyDelete
  2. "MyRocks has 10 database transactions" - I think you meant "Linkbench has ...."

    ReplyDelete
    Replies
    1. I want to learn more about rocksdb, the document I find on rocksdb.org is pretty thin, is there a place I can go read and understand more about rocksdb?

      Delete
    2. There are many wiki pages at https://github.com/facebook/rocksdb/wiki/_pages

      Delete
  3. Very interesting result... Usually people would assume LSM tree like RocksDB does better during load phase and worse during query phase compared to Btree, but the result is the opposite (which means more optimization can be done...)

    Anyway, is the mysql configuration used in LinkBench tuned for your workload?

    ReplyDelete
    Replies
    1. Config is tuned for production and a big part of the production workload is similar to Linkbench.

      Better performance from MyRocks comes in part because it is good, but also because of 2X better compression than InnoDB and workload was IO-bound.

      I think that MyRocks has a bigger advantage versus InnoDB on disk than flash, and more of an advantage when IO bound than when CPU bound.

      Delete