Tuesday, May 23, 2017

MyRocks in MariaDB 10.2.6

MariaDB 10.2.6 is GA. It includes MyRocks, which isn't GA yet but is much easier to use thanks to the hard work of the MariaDB team. This is a big deal if you want performant efficiency or efficient performance. For some workloads it provides better performance than InnoDB. For most (approximately all) workloads it uses much less space and writes much more efficiently. We continue to make RocksDB and MyRocks better and there are many interesting projects in progress. It runs in production, so Domas has begun finding even more things to improve. My contribution is performance evaluation and while I have been busy running tests I have fallen behind on sharing the results. I will do better over the next few months.

Saturday, May 20, 2017

Small servers for database performance tests

I use Intel NUC servers at home to test open source databases. I like them because they are small, quiet and don't use much power. For about 2 years I have been using NUC5i3ryh servers with a 5th gen core i3 CPU, 8gb of RAM, 2.5" SATA disk for the OS and 120gb Samsung 850 EVO m.2 for the database. I used this so much that I replaced the SSD devices last year after one reached the endurance limit.

I am upgrading to a new setup using NUC7i5bnh. This has a 7th gen core i5, 16gb of RAM, 2.5" SATA SSD (Samsung 850 EVO) for the OS and m.2 SSD (Samsung 960 EVO) for the database. It has twice the RAM, twice the CPU and more than twice the IOPs of my old setup. The old and new setups use Ubuntu 16.04 server.

The install was easy with one exception. The old setup used wired networking. This time I enabled wireless after the install finished and that took a few hours to figure out. The important steps are:
  1. Install the HWE enabled kernel to get the drivers that support Intel wireless HW in this server. I didn't do this at first and dmesg | grep iwl showed nothing even though the firmware for that Intel HW was installed. With the HWE kernel I see this in dmesg output Detected Intel(R) Dual Band Wireless AC 8265. The HWE kernel can be selected at the GRUB menu during the install. I assume this step won't be needed once the NUC7i5bnh HW becomes less new.
  2. After the install finishes, install wireless-tools via sudo apt-get install wireless-tools. Without this ifup -v wlan0 failed.
  3. Edit /etc/network/interfaces. This assumes you are using an unsecured network. See below.
I changed /etc/network/interfaces to enable wireless and disable wired using the following contents. After editing the file I tested my changes via sudo ifup -v wlp58s0. If you get it wrong this will take a few minutes to fail. Note that $name is the name for your wireless network and that this works when you are running an unprotected 
# The loopback network interface
auto lo
iface lo inet loopback 
# Wired networking is not started automatically
auto eno1
iface eno1 inet manual
#iface eno1 inet dhcp 
# Wireless networking is started automatically
auto wlp58s0
iface wlp58s0 inet dhcp
wireless-essid $name
wireless-mode Managed
First performance comparison is "make -j4" for MySQL 8.0.1 - 1307 seconds for old NUC, 684 seconds for new NUC.

Friday, May 12, 2017

Database IO performance tests

I work with InnoDB and RocksDB storage engines for MySQL and do performance tests to compare both storage engines and storage devices. I have expertise in MySQL and storage engines but not so much in storage devices, so I don't mind running MySQL. Other people have expertise in the layers under MySQL (Linux, storage) and might mind running MySQL. Fortunately, we have benchmark clients for them.

Obviously there is fio and it is my first choiceThe fio team even added support for coordinated omission when I asked for it. Alas it can't generate all of the IO patterns that I need.

It would be great to share a link to io.go here were Domas to publish that code.

I wrote innosim many years ago to simulate InnoDB IO patterns. Docs are here and a helper script to run a sequence of tests is here.

Finally there is db_bench for RocksDB. One challenge with RocksDB is tuning, so I have a script to help with that and use good options to run a sequence of tests in a special pattern. Well, it works as long as I keep the script current and I just updated it today. It runs these benchmarks in order:
  1. fillseq - Put N key-value pairs in key order
  2. overwrite - Put N key-value pairs in random order. Queries done after a key-order load avoid a few sources of overhead that usually should not be avoided, so this shuffles the database.
  3. overwrite - Put key-values pairs in random order. Runs for K seconds.
  4. updaterandom - do read-modify-write in random order. Runs for K of seconds.
  5. readwhilewriting - 1 rate-limited writer and T threads doing Get.
  6. seekrandomwhilewriting - 1 rate-limited writer and T threads doing range scans.

Thursday, May 11, 2017

InnoDB, MyRocks and TokuDB on the insert benchmark

This post shows some of the improvements we recently made to RocksDB to reduce response time variance for write-heavy workloads. This work helps RocksDB, MyRocks and MongoRocks.
This also extends the result I shared for the impact of the InnoDB redo log size on insert benchmark load throughout. Here I add results for MyRocks and TokuDB. In case you haven't heard, the goal for MyRocks is to provide similar performance to InnoDB with much better space and write efficiency. We have real workloads where InnoDB uses 4X more space than MyRocks.
tl;dr
  • Sorry for overusing the word much.
  • Write response time variance is much better in recent RocksDB builds
  • All engines have room for improvement to reduce write response time variance
  • The MyRocks advantage increases when moving from faster to slower storage
  • MyRocks insert rates are not slowed when compression is enabled.
  • While the write rates I demonstrate here for all engines are impressive, the rates might not be sustainable if I expect the SSD devices to last for more than one year. Of course, better write-efficiency from MyRocks helps a lot with endurance.
Making RocksDB better

My summary of the recent improvements to RocksDB is too vague. I hope the authors write real posts soon to explain their contributions. The work includes:
  • perform L0 to L0 compaction to reduce the number of L0 files when the L1 is busy
  • add compaction_pri=kMinOverlapping to be more clever about the key on which SST files start (or stop) to reduce write-amplification during compaction
  • increase delayed_write_rate from 2MB to 16MB/second to smooth the impact from throttling
Configuration

While I usually run the insert benchmark in 3 steps (insert only, insert & queries, insert & queries) I only share results for the insert only step here. Be wary about drawing too many conclusions from an insert-only workload. I will soon share results for the insert & queries steps.

What I wrote in the Configuration section of my previous post is still valid. I tested InnoDB, MyRocks and TokuDB and the following names describe the engine and configuration:
  • Rocks.def10.kmin.mar - MyRocks from March 17 using the def10 my.cnf without compression and with kMinOverlappingRatio
  • Rocks.def10.nokmin.mar - MyRocks from March 17 using the def10 my.cnf without compression and without kMinOverlappingRatio.
  • Rocks.def10.kmin.apr - MyRocks from April 14 using the def10 my.cnf without compression and with kMinOverlappingRatio.
  • Rocks.def10.nokmin.apr - MyRocks from April 14 using the def10 my.cnf withtout compression and without kMinOverlappingRatio.
  • Rocks.def10.kmin.apr.zstd - MyRocks from April 14 using the def10 my.cnf with zstd compression and with kMinOverlappingRatio.
  • Rocks.def10.nokmin.apr.zstd - MyRocks from April 14 using the def10 my.cnf with zstd compression and without kMinOverlappingRatio.
  • InnoDB-5.7.17.redoX - InnoDB from MySQL 5.7.17 with an Xgb redo log.
  • InnoDB-5.6.35.redoX - InnoDB from MySQL 5.6.35 with an Xgb redo log.
  • TokuDB-5.7.17-none - TokuDB without compression from Percona Server for MySQL 5.7.17
  • TokuDB-5.7.17-zlib - TokuDB with zlib compression from Percona Server for MySQL 5.7.17
I compiled MySQL from source for MyRocks (FB MySQL) and InnoDB (upstream MySQL). I used Percona Server for TokuDB. The TokuDB block cache was 150gb for in-memory workloads and 10gb for IO-bound workloads and the my.cnf is here. The InnoDB buffer pool was 180gb for in-memory workloads and 35gb for IO-bound workloads. The my.cnf for InnoDB is here for 5.6.35 and here for 5.7.17. The base my.cnf for MyRocks is here. It needs to be edited (grep for TODO) because the MyRocks my.cnf files listed above differ in a few ways
  1. The value for RocksDB block_cache_size (150gb for in-memory, 10gb for IO-bound).
  2. Whether compaction_pri=kMinOverlappingRatio was set
  3. Whether compression was enabled. With compression enabled I used none for L0, L1 & L2, then LZ4 starting at L3 and finally bottommost_compression=kZSTDCompression.
In-memory load

The database fits in the database cache for the in-memory load. There should be no reads to storage and many writes to storage.

For fast SSD InnoDB in MySQL 5.7.17 has the best throughput and is much faster than in 5.6.35. A larger InnoDB redo log improves throughput. TokuDB has is faster than MyRocks, but that changes when the workload switches from insert-only to inserts and queries.

Results are different for the server with slow SSD. InnoDB depends on fast random writes and slow SSD provides less of that than. Here MyRocks is faster than InnoDB in 5.7.17 except when a large redo log (32gb) is use. Is is faster than InnoDB in 5.6.35 in all cases. I did not test TokuDB on this hardware.

From the In-memory load metrics section below, InnoDB writes more to storage per insert (see wKB/i), uses more CPU per insert (see Mcpu/i) and uses more space than MyRocks (see size) with or without compression. A larger redo log reduces the InnoDB write rate per insert by more than half. The kMinOverlappingRatio option in MyRocks greatly reduces the write rate to storage.

Compared to MyRocks, TokuDB uses more than 2X the disk space in the database directory (see size) and uses more CPU/insert (see Mcpu/i). TokuDB might be trading more space amplification to get less write amplification and faster inserts as explained by the RUM Conjecture. MyRocks doesn't make that trade with leveled compaction. It does with universal compaction, which I did not test.

In-memory load metrics

See the In-memory load metrics section in my previous post for the legend for the tables below.

Fast SSD

ips.av  ips.99  wkb/i   Mcpu/i  size    rss  wmb/s   cpu  engine
120250   4386   0.78    157      61     2.0  186.1  18.9  Rocks.def10.kmin.mar
132979   7786   0.78    156      63    22.0  206.7  20.8  Rocks.def10.kmin,aor
232666   9925   5.31    176      97   104.4 1235.1  41.0  InnoDB-5.7.17.redo4
295683  18009   1.66    154      97   104.1  490.6  45.5  InnoDB-5.7.17.redo32
 92404   3504   5.11    201      97   106.2  472.5  18.6  InnoDB-5.6.35.redo4
171116  10165   1.74    175      97   106.2  297.9  30.0  InnoDB-5.6.35.redo32
207555  10852   0.78    212     146   176.3  162.3  44.0  TokuDB-5.7.17-none

Slow SSD

ips.av ips.99  wkb/i Mcpu/i size    rss   wmb/s   cpu  engine
115607   4315  0.72     131   61    1.9   163.2   15.2 Rocks.def10.kmin.mar
123793   5834  1.16     168   64    1.8   285.5   20.8 Rocks.def10.nokmin.mar
130548   7724  0.72     130   61   21.8   184.2   17.0 Rocks.def10.kmin.apr
128833   7555  1.09     154   69   20.4   280.9   19.8 Rocks.def10.nokmin.apr
 68672   3277  4.95     165   97  104.4   339.9   11.3 InnoDB-5.7.17.redo4
177179   7231  0.85     130   97  104.3   151.3   23.0 InnoDB-5.7.17.redo32
 38058   1691  5.01     176   97  106.3   190.6    6.7 InnoDB-5.6.35.redo4
 71317   2914  1.26     145   97  106.3    89.5   10.3 InnoDB-5.6.35.redo32

IO-bound load

Things are different for the IO-bound load compared to the in-memory load. MyRocks is strictly faster than InnoDB for the IO-bound load and the redo log size doesn't make a big difference for InnoDB. InnoDB in 5.7.17 does better than in 5.6.35. MyRocks insert rates don't drop when compression is enabled, while they do for TokuDB.

Secondary index maintenance for InnoDB is read-modify-write. Whether the reads are done as part of the insert or deferred to the change buffer, eventually those random reads must get done and they use IO capacity that is then not available for random writes. Statements can also stall on page writeback when the buffer pool is full and pages at the tail of the LRU are dirty (insert joke about single page flush here).

Results for fast SSD and slow SSD are similar but the difference between MyRocks and InnoDB is larger on slow SSD because InnoDB depends more on random IO performance. The slow SSD results also show the benefit from using kMinOverlappingRatio when compression is enabled. The IO-bound load metrics section shows the benefit from kMinOverlapping ratio -- wKB/i and rKB/i are about 2/3 the rate compared to MyRocks without that option.

TokuDB has the fastest insert rate but it also uses much more space (see wKB/i) and CPU (see Mcpu/i) compared to MyRocks. I suspect it is trading more space amplification to get less write amplification as explained by the RUM Conjecture.

Compared to MyRocks, InnoDB writes more to storage per insert (see wKB/i), uses more CPU per insert (see Mcpu/i) and uses more space (see size) even compared to uncompress MyRocks.


IO-bound load metrics

See the IO-bound load metrics section in my previous post for the legend for the tables below.

Fast SSD

ips.av ips.99  r/i   rkb/i wkb/i Mcpu/i size  rss   r/s  rmb/s  wmb/s   cpu  engine
126727   4671  0     0.23  2.07  175    226   4.9   266   29.6  523.2  22.2  Rocks.def10.kmin.mar
129879   7638  0     0.34  1.98  174    225  12.8   383   43.6  513.3  22.7  Rocks.def10.kmin.apr
128816   7687  0     0.13  1.64  209    101  14.1   147   16.6  421.0  27.0  Rocks.def10.kmin.apr.zstd
 61711   3353  0.09  1.48 14.65  562    402   4.0  5698   91.2  904.3  34.7  InnoDB-5.7.17.redo4
 63959   3419  0.09  1.43 10.11  535    402   4.0  5732   91.7  646.6  34.2  InnoDB-5.7.17.redo32
 45874   1790  0.11  1.71 11.25  585    402  43.2  4915   78.6  516.0  26.8  InnoDB-5.6.35.redo4
 59312   2473  0.10  1.56  8.36  593    403  43.2  5776   92.4  495.7  35.2  InnoDB-5.6.35.redo32
156250   4150  0.01  0.82  2.97  340    400  11.2  1854  128.1  464.5  53.1  TokuDB-5.7.17-none
123259   1535  0.01  0.25  1.00  540    143  11.2   854   30.4  123.5  66.6  TokuDB-5.7.17-zlib

Slow SSD

ips.av ips.99  r/i   rkb/i wkb/i Mcpu/i size  rss   r/s  rmb/s  wmb/s   cpu  engine
122963   5280  0.01  1.46  1.55  192    232  12.8  1443  179.6  380.0  23.7  Rocks.def10.nokmin.apr
127316   7283  0     0.32  0.91  148    225  12.9   334   40.8  230.1  18.8  Rocks.def10.kmin.apr
125842   7420  0     0.11  0.73  177    101  14.1   112   13.4  182.7  22.3  Rocks.def10.kmin.apr.zstd
 82771   1980  0.01  0.66  1.11  259    114  13.7   445   54.5  182.7  21.4  Rocks.def10.nokmin.apr.zstd
 34101   1975  0.10  1.60 10.91  402    400  39.7  3412   54.6  372.0  13.7  InnoDB-5.7.17.redo4
 49244   1750  0.10  1.56  6.39  454    403  40.0  4803   76.9  314.5  22.4  InnoDB-5.7.17.redo32
 17654    605  0.18  2.79 12.22  448    398  43.2  3083   49.3  215.7   7.9  InnoDB-5.6.35.redo4
 30734   1276  0.09  1.50  5.01  407    403  43.2  2882   46.1  153.9  12.5  InnoDB-5.6.35.redo32

In-memory throughput over time

Fast SSD

InnoDB variance and throughput are much worse with a 4gb vs 32gb redo log. MyRocks variance has improved significantly from the March 17 to the April 14 build thanks to support for L0 to L0 compaction. TokuDB has a few stalls, similar to the March 17 build of MyRocks.



Slow SSD

InnoDB suffers from too much variance regardless of redo log size. It needs random write IOPs and the slow SSD has less of that than the fast SSD. MyRocks with the April 14 build is much better than the March 17 build because of the change to perform L0 to L0 compaction to reduce the chance of stalls.





IO-bound throughput over time

Fast SSD

This is similar to the in-memory results. InnoDB variance and throughput are much worse with a 4gb vs 32gb redo log. MyRocks variance has improved significantly from the March 17 to the April 14 build thanks to support for L0 to L0 compaction. TokuDB has a few stalls.




Slow SSD

InnoDB has too much variance, redo log size isn't significant but 5.7.17 is better than 5.6.35. For MyRocks the April 14 build is better than March 17 and the kMinOverlappingRatio feature helps.




Tuesday, May 9, 2017

Benchmark(et)ing with InnoDB redo log size

What is a typical InnoDB redo log size? By size I mean the product of innodb_log_file_size and innodb_log_files_in_group. A typical setup for me is 4gb via 2gb/file and 2 files.

Using a large InnoDB redo log can help performance but it comes at a cost. I have been using a 4gb redo log in my tests because that is what we frequently use in production. Dimitri mentioned using a 32gb redo log to show great throughput for InnoDB in recent releases. Here I share results from the insert benchmark with a 4gb, 8gb, 16gb and 32gb redo log.

tl;dr - conclusions specific to my test
  1. A larger redo log improves throughput
  2. A larger redo log helps more with slower storage than with faster storage because page writeback is more of a bottleneck with slower storage and a larger redo log reduces writeback.
  3. A larger redo log can help more when the working set is cached because there are no stalls from storage reads and storage writes are more likely to be a bottleneck.
  4. InnoDB in MySQL 5.7.17 is much faster than 5.6.35 in all cases except IO-bound + fast SSD
A larger redo log means that less checkpoint IO will be done and less IO is good. But there are costs with a larger redo log. It uses more storage and you might not want to spend 32gb of SSD for the redo log. It makes crash recovery slower. With buffered IO you can have 32gb of redo log competing to stay in the OS page cache and waste memory. Finally when using a disk-array, the read-before-write that occurs with buffered IO means that redo log writes may be wasting too much IO.

Configuration

I used my insert benchmark client with two test servers -- one with slow SSD that does ~10k IOPs and one with fast SSD that does more than 100k IOPs. Both servers have 24 cores, 48 HW-threads and 256gb of RAM. Tests were repeated for an in-memory workload (insert 500m rows) and an IO-bound workload (limit server RAM to 50gb and insert 2b rows). The test used 16 tables, 16 clients and each client inserted rows to a different table. Typical command line were:
    # insert 500m rows
    bash iq.sh innodb "" /path/to/bin/mysql /path/to/datadir md2 \
        1 16 no no no 0 no 500000000
    # insert 2b rows
    bash iq.sh innodb "" /path/to/bin/mysql /path/to/datadir md2 \
        1 16 no no no 0 no 2000000000

Test were run with upstream MySQL 5.6.35 and 5.7.17. I used these my.cnf files for 5.7.17 and 5.6.35. I compiled MySQL from source and used jemalloc. The benchmark client ran on the same host as mysqld to remove variance from the network. The binlog is enabled but sync on commit is disabled for InnoDB and the binlog to support faster insert rates.

In-memory load

For the in-memory load the redo log size has more impact with slow SSD than with fast SSD. From the In-memory load metrics section below you can see the benefit from a larger redo log. It reduces the amount of data written to storage per insert by more than half (see the wKB/i column) when going from a 4gb to a 32gb redo log. For fast SSD with 5.7.17, InnoDB writes to storage 5.31 KB/insert with a 4gb redo log versus 1.66 KB/insert with a 32gb redo log. Similar reductions occur for slow SSD and for 5.6.35. The larger redo log helps slow SSD more than fast SSD based on the reduction in wKB/i.
In-memory load metrics

Legend:
  • ips.av, ips.99 - average and p99 insert rates. The 99th percentile is computed from the per-interval rates where the interval is ~10 seconds.
  • wKB/i, rKB/i - KB written to and KB read from storage per inserted row. Measured by iostat.
  • r/i - storage reads per inserted row. Measured by iostat.
  • Mcpu/i - relative CPU overhead per inserted row. Measured by vmstat us and sy columns.
  • size - database size in GB at test end
  • rss - mysqld RSS in GB at test end
  • wMB/s, rMB/s - average MB/second written to and read from storage. Measured by iostat.
  • r/s - average storage reads/second. Measured by iostat.
  • cpu - average CPU utilization. Measured by vmstat us and sy columns.
  • engine - database engine. none means that no compression was used. 150g and 35g are the size of the InnoDB buffer pool. redoX is the size of the InnoDB redo log.

Fast SSD

ips.av  ips.99  wKB/i   Mcpu/i  size    rss  wMB/s   cpu  engine
232666   9925   5.31    176      97   104.4 1235.1  41.0  inno5717.none.150g.redo4
286041  16559   3.46    162      97   104.1  989.0  46.4  inno5717.none.150g.redo8
292740  17587   2.41    157      97   104.3  706.5  46.0  inno5717.none.150g.redo16
295683  18009   1.66    154      97   104.1  490.6  45.5  inno5717.none.150g.redo32
 92404   3504   5.11    201      97   106.2  472.5  18.6  inno5635.none.150g.redo4
134844   5987   2.97    192      97   106.1  400.8  25.8  inno5635.none.150g.redo8
169319   9247   2.02    176      97   106.1  342.5  29.8  inno5635.none.150g.redo16
171116  10165   1.74    175      97   106.2  297.9  30.0  inno5635.none.150g.redo32

Slow SSD

ips.av ips.99  wKB/i Mcpu/i size    rss   wMB/s   cpu     engine
 68672   3277  4.95     165   97  104.4   339.9   11.3    inno5717.none.150g.redo2
 95896   4518  2.67     154   97  104.4   256.4   14.7    inno5717.none.150g.redo4
131787   5868  1.47     135   97  104.4   194.2   17.8    inno5717.none.150g.redo8
177179   7231  0.85     130   97  104.3   151.3   23.0    inno5717.none.150g.redo16
 38058   1691  5.01     176   97  106.3   190.6    6.7    inno5635.none.150g.redo2
 52493   2537  2.74     156   97  106.3   144.0    8.2    inno5635.none.150g.redo4
 61043   2660  1.64     151   97  106.3   100.2    9.2    inno5635.none.150g.redo8
 71317   2914  1.26     145   97  106.3    89.5   10.3    inno5635.none.150g.redo16

IO-bound load

For the IO-bound load and fast SSD the redo log size has little impact for 5.7.17 and some impact for 5.6.35. For slow SSD the redo log size has more impact. But compared to the in-memory results above the impact from redo log size is much less for the IO-bound workload. From the IO-bound load metrics section below the reduction in wKB/i is less for IO-bound than for in-memory workloads but increasing the redo log from 4gb to 32gb still reduces the write rate in half for slow SSD per the wKB/i column.

Per the ips.av column InnoDB with 5.7.17 isn't much faster than 5.6.35 for the fast SSD. It is still much faster for the slow SSD.

IO-bound load metrics

Fast SSD

ips.av ips.99  r/i   rKB/i wKB/i Mcpu/i size  rss   r/s  rMB/s  wMB/s   cpu  engine

 61711   3353  0.09  1.48 14.65  562    402   4.0  5698   91.2  904.3  34.7  inno5717.none.35g.redo2
 62929   3471  0.09  1.47 11.39  546    402   4.0  5771   92.3  717.0  34.4  inno5717.none.35g.redo4
 63593   2542  0.10  1.54 11.01  554    404   4.0  6133   98.1  700.2  35.2  inno5717.none.35g.redo8
 63959   3419  0.09  1.43 10.11  535    402   4.0  5732   91.7  646.6  34.2  inno5717.none.35g.redo16
 45874   1790  0.11  1.71 11.25  585    402  43.2  4915   78.6  516.0  26.8  inno5635.none.35g.redo2
 58682   2500  0.10  1.55  8.57  589    403  43.2  5667   90.7  502.8  34.6  inno5635.none.35g.redo4
 59179   2285  0.10  1.52  8.30  581    402  43.2  5607   89.7  491.2  34.4  inno5635.none.35g.redo8
 59312   2473  0.10  1.56  8.36  593    403  43.2  5776   92.4  495.7  35.2  inno5635.none.35g.redo16

Slow SSD

ips.av ips.99  r/i   rKB/i wKB/i Mcpu/i size  rss   r/s  rMB/s  wMB/s   cpu  engine
 34101   1975  0.10  1.60 10.91  402    400  39.7  3412   54.6  372.0  13.7  inno5717.none.35g.redo2
 46712   1837  0.09  1.49  7.29  433    403  39.9  4362   69.8  340.7  20.2  inno5717.none.35g.redo4
 45231   1649  0.10  1.54  5.34  420    404  40.0  4355   69.7  241.5  19.0  inno5717.none.35g.redo8
 49244   1750  0.10  1.56  6.39  454    403  40.0  4803   76.9  314.5  22.4  inno5717.none.35g.redo16
 17654    605  0.18  2.79 12.22  448    398  43.2  3083   49.3  215.7   7.9  inno5635.none.35g.redo2
 26607    863  0.12  1.86  7.56  438    402  43.2  3099   49.6  201.2  11.7  inno5635.none.35g.redo4
 28069   1143  0.09  1.51  5.50  398    403  43.2  2650   42.4  154.2  11.2  inno5635.none.35g.redo8
 30734   1276  0.09  1.50  5.01  407    403  43.2  2882   46.1  153.9  12.5  inno5635.none.35g.redo16

In-memory throughput over time

The results above show average throughput and that hides a lot of interesting behavior. We expect throughput over time to not suffer from variance -- for both InnoDB and for MyRocks. For many of the results below there is a lot of variance (jitter).

Fast SSD

InnoDB suffers from jitter with 4gb and 8gb redo logs and the problem is much worse for 5.6.35. Another problem occurs with a 4gb redo log -- throughput drops over time. That problem repeats for an 8gb redo log with 5.6.35.






Slow SSD

With slow SSD jitter is much worse for 5.7.17 than for 5.6.35 and the difference is more significant with the larger redo log. Throughput also drops over time for 5.7.17. This is a surprising result.






IO-bound throughput over time

Fast SSD

MySQL 5.6.35 suffers from jitter while 5.7.17 suffers from throughput dropping over time. Although the drop in 5.7 might be reasonable assuming this occurs because the working set doesn't fit in cache and storage reads must be done during index maintenance. It is also interesting that 5.7.17 becomes as slow as 5.6.35 over time.






Slow SSD

Both 5.6.35 and 5.7.17 suffer from jitter.





Wednesday, March 8, 2017

How to build MongoRocks for MongoDB 3.4

This explains how to build MongoRocks for MongoDB 3.4 and is derived from my notes for building for MongoDB 3.2. My server uses Ubuntu 16.04.

# Install many of the dependencies for MongoRocks.
# I assume this is still valid.
sudo yum install snappy-devel zlib-devel bzip2-devel lz4-devel
sudo yum install scons gcc-g++ git

# Unpack MongoDB 3.4 source in $MONGOSRC

# Directory in which git repos are created
mkdir ~/git

# Get MongoRocks engine
cd ~/git
git clone https://github.com/mongodb-partners/mongo-rocks.git
cd mongo-rocks
git checkout --track origin/v3.4 -b v34

# figure out which version of gcc & g++ is installed
# for ubuntu 16.04 that is 5.4

g++ --version


# get and build RocksDB libraries
# disable the use of jemalloc features

git clone https://github.com/facebook/rocksdb.git
cd rocksdb
git checkout --track origin/5.2.fb -b 52fb
EXTRA_CFLAGS=-fPIC EXTRA_CXXFLAGS=-fPIC DISABLE_JEMALLOC=1 make static_lib

# prepare source build with support for RocksDB
cd $MONGOSRC
mkdir -p src/mongo/db/modules/
ln -sf ~/git/mongo-rocks src/mongo/db/modules/rocks

# Build mongod & mongo binaries
# You can edit LIBS="..." depending on the compression libs
# installed on your build server and enabled for RocksDB.
# To debug and see command lines add --debug=presub
# To use glibc rather than tcmalloc add --allocator=system

scons CPPPATH=/home/mdcallag/git/rocksdb/include \
      LIBPATH=/home/mdcallag/git/rocksdb \

      LIBS="lz4 zstd bz2" mongod mongo

# install mongod, I used ~/b/m342 you can use something else
mkdir -p ~/b/m342
cd ~/b/m342
mkdir data
mkdir bin
cp $MONGOSRC/build/opt/mongo/mongod bin
cp $MONGOSRC/build/opt/mongo/mongo bin

# create mongo.conf file with the text that follows. You must
# change $HOME and consider changing 
the value for cacheSizeGB
---
processManagement:
  fork: true
systemLog:
  destination: file
  path: $HOME/b/m342/log
  logAppend: true
storage:
  syncPeriodSecs: 600
  dbPath: $HOME/b/m342/data
  journal:
    enabled: true
operationProfiling.slowOpThresholdMs: 2000
replication.oplogSizeMB: 4000
storage.rocksdb.cacheSizeGB: 1
---

# start mongod, consider using numactl --interleave=all
bin/mongod --config mongo.conf --master --storageEngine rocksdb

# confirm RocksDB is there
ls data/db
> 000007.sst  CURRENT  IDENTITY  journal  LOCK  LOG  MANIFEST-000008  OPTIONS-000005

$ head -4 data/db/LOG
2017-03-08T09:38:33.747-0800 I CONTROL  [initandlisten] MongoDB starting : pid=19869 port=27017 dbpath=/home/mdcallag/b/m342/data master=1 64-bit host=nuc2
2017-03-08T09:38:33.747-0800 I CONTROL  [initandlisten] db version v3.4.2
2017-03-08T09:38:33.747-0800 I CONTROL  [initandlisten] git version: 3f76e40c105fc223b3e5aac3e20dcd026b83b38b
2017-03-08T09:38:33.747-0800 I CONTROL  [initandlisten] allocator: tcmalloc
2017-03-08T09:38:33.747-0800 I CONTROL  [initandlisten] modules: rocks

Friday, March 3, 2017

Part 5: sysbench, a larger server, and IO-bound database and really fast storage

I used a database that doesn't fit in the DBMS cache for part 5 in my unending series on sysbench and MyRocks. In this case I used really fast storage - the database fits in the OS page cache, reads are served from the OS page cache and writes are done to the fast SSD. A previous post has more details on the hardware.

tl;dr
  • MyRocks is much faster than compressed InnoDB except for InnoDB-5.7 with long range scans
  • MyRocks had better or similar throughput than uncompressed InnoDB-5.6 for update-only, insert-only and short range scans. It was slower for long range scans and the point query test.
  • MyRocks was faster than uncompressed InnoDB-5.6 for the update-only tests and slower for insert-only and tests with range/point queries.
  • MyRocks was faster than TokuDB for all tests except the longest range queries

Disclaimer

The goal for MyRocks is the best space efficiency, great write efficiency and good enough read efficiency. The best space efficiency means it needs less SSD. Great write efficiency means that SSD lasts longer, and I just replaced 3 SSDs on my home servers so I appreciate the value of that. Good enough read efficiency is harder to define and my goal over time has become better than good enough. I want MyRocks to match uncompressed InnoDB in response time. We aren't there yet but we are making progress.

For this workload I only share data on read efficiency. MyRocks loses one of its advantages given that the database is in the OS page cache -- when you spend less on writes you can spend more on reads. But when the database fits in the OS page cache there will be no reads from the storage device.

Tests

The tests are fully described in a previous post. In this case I used 8 tables with 10M rows per table and a 2gb DBMS cache. Tests were repeated with no compression, fast compression and slow compression. Fast compression was lz4 for MyRocks and snappy for TokuDB. Slow compression was zlib for all engines. Tests were run for 1 to 128 connections (threads) on a server with 24 CPU cores and 48 HW threads. All engines used jemalloc. The binlog was enabled but fsync for the binlog and storage engine was disabled.

Tests were run for several storage engines:
  • myrocks-5635 - MyRocks from FB MySQL merged to upstream MySQL 5.6.35. The full my.cnf was listed in a previous in a previous post and the paste is here. Then I edited that to reduce the RocksDB block cache and enable or disable compression using this.
  • innodb-5626 - InnoDB from upstream MySQL 5.6.26. In a few weeks or months I will upgrade to a more recent 5.6 build.
  • innodb-5717 - InnoDB from upstream MySQL 5.7.17
  • tokudb-5717 - TokuDB from Percona Server 5.7.17-11. The my.cnf is here.

Update-only with secondary index maintenance

The numbers in the table are the QPS for 1 to 128 threads.

   1       2        4       8      16      24      32      40      48      64      80      96     128   threads
- none
2804    9616    18823   33177   38211   41248   44460   44723   44151   43671   43289   43150   42621   myrocks-5635
2355    4903     8127   12703   15987   28899   29468   30260   30785   30991   30710   29922   28946   innodb-5626
3413    7834    14519   23386   32131   36263   36582   37110   37751   39091   40058   40945   42019   innodb-5717
2099    3958     7085   13135   20494   26263   28408   28723   28472   27786   26615   25577   23243   tokudb-5717
- lz4/snappy
2801    9638    18841   33199   38234   41270   44190   44347   44047   43404   42889   42601   42199   myrocks-5635
2070    3921     7004   12988   20146   25857   28196   28545   28322   27784   26663   25550   23293   tokudb-5717
- zlib
2683    8937    17422   31287   37139   39406   41880   42230   42337   41907   41877   41482   41040   myrocks-5635
 328    1004     1971    3346    4782    5343    5666    5863    6016    6154    6239    6260    6215   innodb-5626
 455    1164     2302    4040    5994    6513    6944    7184    7319    7516    7534    7596    7528   innodb-5717
1988    3741     6665   12441   19575   24062   26028   26211   26129   25727   24554   23746   21799   tokudb-5717

Summary:
  • Uncompressed: MyRocks >> InnoDB-5.7 >> InnoDB-5.6 >> TokuDB
  • Compressed: MyRocks >> TokuDB >> InnoDB
  • Compressed InnoDB might suffer from the per-index mutex
  • MyRocks benefits because secondary index maintenance is read-free (write-only)

Update-only without secondary index maintenance

The numbers in the table are the QPS for 1 to 128 threads.

   1       2        4       8      16      24      32      40      48      64      80      96     128   threads
- none
2772    9935    19432   35272   43243   46621   49285   51651   51780   51154   50176   49766   48711   myrocks-5635
3888    7639    12655   17174   28267   31354   34335   36421   34686   38179   39268   40883   41326   innodb-5626
2917    6166    18918   35674   55660   69910   74162   74654   74734   76431   78150   79040   80607   innodb-5717
2071    4089     7508   14548   24884   31050   34702   37987   40152   41042   39219   37496   33944   tokudb-5717
- lz4/snappy
2779    9948    19471   35516   42875   46143   48964   51227   51290   50592   49984   49372   48504   myrocks-5635
2062    4045     7432   14484   24403   30637   34182   37107   39212   40452   38857   37256   33901   tokudb-5717
- zlib
2660    9235    18026   33259   41518   44511   46920   48697   48865   48870   48668   47668   47119   myrocks-5635
 600    1125     2034    3754    5356    5946    6184    6278    6355    6417    6446    6433    6384   innodb-5626
 688    1292     2652    4776    6950    7825    8186    8329    8435    8551    8535    8518    8408   innodb-5717
1964    3839     7112   13739   23446   28967   32118   34603   36506   37325   36384   35010   32236   tokudb-5717

Summary:
  • Uncompressed <= 4 threads: MyRocks >> InnoDB-5.7 >> InnoDB-5.6 >> TokuDB
  • Uncompressed >= 8 threads: InnoDB-5.7 >> MyRocks >> InnoDB-5.6, TokuDB
  • Compressed: MyRocks >> TokuDB >> InnoDB
  • Compressed InnoDB might suffer from the per-index mutex
  • MyRocks doesn't benefit from read-free secondary index maintenance because there is no index maintenance.

Read-write with --oltp-range-size=100

The numbers in the table are the QPS for 1 to 128 threads.

   1       2        4       8      16      24      32      40      48      64      80      96     128   threads
- none
4615    9749    19889   44836   95983  128906  150236  161720  164109  164003  165044  164140  163066   myrocks-5635
4927   12937    28513   54213   85564   82470   85415  109080  124297  132075  132059  132434  130368   innodb-5626
7686   15402    31015   61110  106977  144695  167321  185388  197180  205404  207852  208216  207327   innodb-5717
4403    7548    15377   36443   66814   82179   92476   96790   97598   97450   96157   93420   91241   tokudb-5717
- lz4/snappy
4581    9728    19962   44480   95368  128023  149742  160837  164039  162767  162522  164242  161043   myrocks-5635
4087    7413    14971   35184   64774   79351   87971   92652   94746   94083   93201   91028   89282   tokudb-5717
- zlib
4300    8899    18466   40781   84814  113870  129297  143581  144207  140118  145012  142208  141812   myrocks-5635
2008    3917     7290   13214   20586   24610   26661   27897   28638   29370   30041   30478   30752   innodb-5626
2175    4234     8150   15318   25087   30660   33600   35528   36792   38187   38967   39302   39509   innodb-5717
3580    6824    14221   30690   55008   68552   76448   81532   82780   83096   81667   80006   79417   tokudb-5717

Summary:
  • Uncompressed <= 8 threads: InnoDB-5.7 >> InnoDB-5.6 >> MyRocks >> TokuDB
  • Uncompressed >= 16 threads: InnoDB-5.7 >> MyRocks >> InnoDB-5.6 >> TokuDB
  • Compressed: MyRocks >> TokuDB >> InnoDB
  • Compressed InnoDB might suffer from the per-index mutex

Read-write with --oltp-range-size=10000

The numbers in the table are the QPS for 1 to 128 threads.

   1       2        4       8      16      24      32      40      48      64      80      96     128   threads
- none
 250     503      991    1932    3771    5083    5937    6400    6354    6605    6522    6276    6556   myrocks-5635
 267     439     1070    2258    4403    5974    6889    7556    7680    7509    7212    6965    6466   innodb-5626
 471     948     1866    3675    6951    8974    9959   10497   10673   10590   10569   10470   10493   innodb-5717
- lz4/snappy
 237     473      946    1861    3597    5089    6025    6267    6465    6802    6450    6472    6643   myrocks-5635
 329     646     1292    2378    4546    6003    6504    6775    6906    6807    6751    6671    6620   tokudb-5717
- zlib
 224     457      907    1775    3459    4709    5243    5928    6008    5822    6052    5823    5639   myrocks-5635
 139     326      668    1376    2723    3851    4471    4845    5039    4997    4821    4822    4822   innodb-5626
 176     431      904    1820    3835    5204    5830    6237    6487    6515    6469    6412    6357   innodb-5717
 276     570     1136    2149    4065    5396    5861    6146    6272    6205    6157    6112    6042   tokudb-5717

Summary:
  • Uncompressed: InnoDB-5.7 >> InnoDB-5.6 >> MyRocks
  • Compressed <= 32 threads: TokuDB >> InnoDB-5.7 >> MyRocks >> InnoDB-5.6
  • Compressed >= 40 threads: InnoDB-5.7 >> TokuDB >> MyRocks >> InnoDB-5.6
  • MyRocks does better here relative to others than on the same test with a cached database.
  • Something was done in MySQL 5.7 to make InnoDB more efficient for long range scans.

Read-only with --oltp-range-size=10

The numbers in the table are the QPS for 1 to 128 threads.

   1       2        4       8      16      24      32      40      48      64      80      96     128   threads
- none
5583   12053    25340   60323  125323  173556  201507  228848  240600  260551  268105  270922  270802   myrocks-5635
5802   11302    23312   45754   86042  119485  132435  182210  278491  282393  285004  286285  286280   innodb-5626
5998   12663    27213   67057  139873  191166  229855  267232  302766  304396  304294  308140  310807   innodb-5717
4619   10658    22704   52257  108814  152185  180060  207094  229817  232031  234555  238575  233373   tokudb-5717
- lz4/snappy
5533   11988    25118   59446  123771  170236  200173  226406  241030  257188  266852  268078  265281   myrocks-5635
5019   10563    22409   51342  107286  150102  177104  204018  227579  229317  231358  232494  228827   tokudb-5717
- zlib
4989   10682    22073   50882  106786  150247  177378  202741  221048  230274  229999  230863  231839   myrocks-5635
3104    8163    18296   40498   84292  117789  136762  155641  172568  173673  174538  174994  175127   innodb-5626
4094    8760    17970   40191   85050  120897  141525  161327  179816  180670  180304  180439  182237   innodb-5717
4490    9828    20436   45973   97078  136122  160800  185617  207077  208896  211602  212022  213916   tokudb-5717

Summary:
  • Uncompressed <= 8 threads: InnoDB-5.7 >> MyRocks >> TokuDB, InnoDB-5.6
  • Uncompressed 16 to 40 threads: InnoDB-5.7 >> MyRocks >> TokuDB >> InnoDB-5.6
  • Uncompressed >= 48 threads: InnoDB-5.7 >> InnoDB-5.6 >> MyRocks >> TokuDB
  • Compressed: MyRocks >> TokuDB >> InnoDB

Read-only with --oltp-range-size=100

The numbers in the table are the QPS for 1 to 128 threads.

   1       2        4       8      16      24      32      40      48      64      80      96     128   threads
- none
4557    9527    19771   44629   95037  133472  155356  175533  189690  196465  196471  197426  196431   myrocks-5635
5076   10543    21840   49775  101828  141198  166103  190612  213432  211546  215583  216149  216561   innodb-5626
5384   11440    23741   56913  118847  165338  193619  220664  246599  250321  251309  249837  252654   innodb-5717
4525    9456    19643   43853   91855  128489  148978  168766  185897  185482  189092  190814  187785   tokudb-5717
- lz4/snappy
4486    9364    19322   43591   93011  131252  151314  173058  189554  193018  193373  192635  191932   myrocks-5635
4445    9270    19275   42894   90120  126824  146602  164909  180813  182457  183962  184567  182690   tokudb-5717
- zlib
3959    8205    16871   36958   79526  113109  130868  148129  164466  165194  165251  164002  164486   myrocks-5635
3438    6997    14207   30336   63505   90598  103758  116782  128542  128625  129293  130163  130199   innodb-5626
3553    7291    14880   32345   68779   99275  113500  127408  139032  140490  141352  141673  140846   innodb-5717
4073    8413    17317   38102   80324  113206  132046  149543  163231  165632  166191  169175  169615   tokudb-5717

Summary:
  • Uncompressed <= 8 threads: InnoDB-5.7 >> InnoDB-5.6 >> MyRocks >> TokuDB
  • Uncompressed 16 to 40 threads: InnoDB-5.7 >> InnoDB-5.6 >> MyRocks >> TokuDB
  • Uncompressed >= 48 threads: InnoDB-5.7 >> InnoDB-5.6 >> MyRocks >> TokuDB
  • Compressed: MyRocks, TokuDB >> InnoDB

Read-only with --oltp-range-size=10000

The numbers in the table are the QPS for 1 to 128 threads.

   1       2        4       8      16      24      32      40      48      64      80      96     128   threads
- none
 197     396      784    1553    3068    4211    4696    5069    5303    5260    5248    5203    5150   myrocks-5635
 227     456      908    1799    3525    4798    5538    6114    6343    6310    6287    6200    6182   innodb-5626
 381     756     1489    2926    5641    7272    7949    8407    8604    8531    8476    8442    8254   innodb-5717
 287     577     1151    2262    4419    5836    6335    6616    6695    6575    6542    6496    6427   tokudb-5717
- lz4/snappy
 189     378      749    1486    2941    4035    4555    4922    5160    5127    5096    5070    5005   myrocks-5635
 278     559     1112    2187    4276    5670    6149    6436    6558    6427    6405    6343    6278   tokudb-5717
- zlib
 159     318      628    1251    2480    3470    3886    4183    4405    4403    4340    4299    4283   myrocks-5635
 140     282      566    1117    2210    3092    3538    3853    4109    4102    4051    3903    4015   innodb-5626
 187     375      744    1488    2925    3992    4460    4716    4890    4914    4908    4881    4835   innodb-5717
 242     485      971    1920    3770    5070    5564    5836    5997    5889    5841    5791    5744   tokudb-5717

Summary:
  • Uncompressed: InnoDB-5.7 >> TokuDB >> InnoDB-5.6 >> MyRocks
  • Compressed: TokuDB >> InnoDB-5.7 >> MyRocks >> InnoDB-5.6

Point-query

The numbers in the table are the QPS for 1 to 128 threads.

   1       2        4       8      16      24      32      40      48      64      80      96     128   threads
- none
6412   13400    28931   72991  152003  206627  245257  282682  317475  320832  320487  319250  325351   myrocks-5635
6831   14592    31265   80304  165013  218684  259701  302272  341855  347161  348867  350069  346924   innodb-5626
6591   13831    29885   77583  159026  215990  258119  300289  336777  347714  351820  353239  353298   innodb-5717
5950   12385    26277   62763  130650  180310  212967  244552  271554  276659  276588  278345  281433   tokudb-5717
- lz4/snappy
6395   13451    28854   72695  151874  205623  245214  283065  317367  314263  319040  324845  323703   myrocks-5635
5828   12137    25693   60984  127016  175532  207269  237836  265395  269802  271339  273249  274790   tokudb-5717
- zlib
5859   12482    26097   62841  134703  183953  219125  253690  284868  288741  285110  291383  292424   myrocks-5635
4658    9672    19931   45142   97018  135486  158160  180528  200518  200171  203003  204169  204478   innodb-5626
4557    9471    19508   44014   94149  132992  155439  177183  197399  198790  200071  200243  200104   innodb-5717
5378   11181    23504   54303  114715  159725  189099  217428  242933  246185  248465  250252  255635   tokudb-5717

Summary:
  • Uncompressed <= 40 threads: InnoDB-5.6 >> InnoDB-5.7 >> MyRocks >> TokuDB
  • Uncompressed >= 48 threads: InnoDB-5.6, InnoDB-5.7 >> MyRocks >> TokuDB
  • Compressed: MyRocks >> TokuDB >> InnoDB

Insert-only

The numbers in the table are the QPS for 1 to 128 threads.


   1       2        4       8      16      24      32      40      48      64      80      96     128   threads
- none
5206   16893    31320   41382   48020   52252   54483   54546   54497   54196   54106   53575   53047   myrocks-5635
5743   12052    32062   49692   56744   57288   56026   53698   51452   50519   48907   47300   44191   innodb-5626
5305   15574    29411   53523   80249   91492   93392   89487   86214   84496   84736   85002   84752   innodb-5717
2938    5797     8947   17728   29184   37570   38571   35497   33231   30409   27484   24989   21330   tokudb-5717
- lz4/snappy
5210   16827    31047   40876   47854   52096   54152   54273   54120   53872   53554   53346   52546   myrocks-5635
2925    5740     8963   17824   29350   37647   38781   35571   33359   30580   27645   25080   21443   tokudb-5717
- zlib
5226   16682    31500   40534   47009   51833   53281   53719   53613   53433   53463   52761   52412   myrocks-5635
3256    5874    12234   20365   28593   31562   32943   33475   33195   33641   32953   33919   34103   innodb-5626
3297    6442    12727   22377   34871   40052   42320   43433   42824   43592   46455   46183   45197   innodb-5717
2920    5801     8931   17724   28971   37195   38456   35110   32937   30220   27492   24924   21337   tokudb-5717

Summary:
  • Uncompressed <= 40 threads: InnoDB-5.7 >> InnoDB-5.6 >> MyRocks >> TokuDB
  • Uncompressed >= 48 threads: InnoDB-5.7 >> MyRocks >> InnoDB-5.6 >> TokuDB
  • Compressed <= 8 threads: MyRocks >> InnoDB >> TokuDB
  • Compressed 16 to 40 threads: MyRocks >> InnoDB-5.7 >> TokuDB >> InnoDB-5.6
  • Compressed >= 48 threads: MyRocks >> InnoDB-5.7 >> InnoDB-5.6 >> TokuDB