Release the Kraken!

By Ryan Meredith - 2017-08-21

How fast can a Ceph implementation go? 1

As a Principal Storage Solutions Engineer with Micron, I am tasked with figuring that out. I was proud to announce Micron’s All-NVMe Ceph® reference architecture at the OpenStack Summit 2017 in Boston. It’s based on Red Hat® Ceph Storage 2.1 (Jewel 10.2) and it is crazy fast.

In case you missed it, here’s a diagram of our Reference Architecture:

Reference Architecture

Here are links to the reference architecture and the video of my session at the OpenStack Summit:

This reference architecture is built with the Micron 9100MAX 2.4TB NVMe SSD, Red Hat Enterprise Linux 27.3 + Red Hat Ceph Storage 2.1, and the Supermicro SYS-1028U-TN10RT+.

Faster!?! Kraken + BlueStore Performance

Many conference attendees asked me about the performance difference when using BlueStore in Ceph instead of FileStore. Using BlueStore should help alleviate the write penalty inherent in Ceph and provide greater performance. Check out this slide deck from Sage Weil (the founder and chief architect of Ceph) on the on the problems with FileStore and how BlueStore addresses them.

I did some testing with the latest GA version of Ceph, Kraken (Ceph 11.2) using BlueStore on the same hardware from the reference architecture and got some impressive performance improvements.

  • 39% higher 4KB random read performance
  • 21% higher 4KB random write performance
  • 25% lower 4MB object average read latency with higher throughput
  • 2.3X higher 4MB object write throughput + 37% lower average latency

Giant Squid-Sized Caveat: BlueStore in Kraken is listed as stable but still experimental and data corrupting, so use it at your own risk and not anywhere near production.

4KB Random Block Workload

I used FIO against the RBD driver on 10 load generation servers to push 4KB random read and write workloads. Tests were run on a 2x replicated pool with 5TB of data (10TB data accounting for replication).

4KB Random Read IOPs RBD FIO4KB Random Write IOPs RBD FIO

Kraken + BlueStore reaches 1.6 Million 4KB random read IOPs at 6.3ms average latency, a 39% increase in read IOPs over Jewel. 4KB random writes hit 291k IOPs with 4.4ms of average latency, a 21% increase.

CPU utilization is lower with Kraken + BlueStore, topping out at 85%+ on reads and 70%+ on writes. There should be more performance to gain with further optimizations and tuning as BlueStore becomes GA.

  IOPs Average Latency
4KB Random Read
RHCS 2.1: Jewel 10.2.3
1.15M 1.1 ms
4KB Random Read
Ceph Kraken 11.2
1.60M 6.3 ms
  IOPs Average Latency
4KB Random Write
RHCS 2.1: Jewel 10.2.3
241k 1 ms
4KB Random Write
Ceph Kraken 11.2
291k 5.4 ms

4MB Object Workload

I used Rados Bench on 10 load generation servers to push 4MB object read and write. Tests were run on a 2x replicated pool with 5TB of data (10TB data accounting for replication).

4MB Object Read Throughput (GB/s) Rados Bench4MB Object Write Throughput (GB/s) Rados Bench

4MB object read throughput with Rados Bench is close because both tests are network limited with 50GbE networking. Kraken achieves higher throughput (+900 MB/s) and 25% lower average latency.

4MB object writes are 2.3X higher than Jewel. This large difference is due to BlueStore writing 2 copies of each object + metadata versus FileStore writing 4 copies of every object due to journaling. This creates a massive throughput increase along with a 37% reduction in latency. With BlueStore, 4MB object writes are network limited.

  Throughput  Average Latency
4MB Object Read
RHCS 2.1: Jewel 10.2.3
21.8 GB/s 35 ms
4MB Object Read
Ceph Kraken 11.2
22.7 GB/s 28 ms
  Throughput Average Latency
4MB Object Write
RHCS 2.1: Jewel 10.2.3
4.6 GB/s 41 ms
4MB Object Write
Ceph Kraken 11.2
10.7 GB/s 30 ms

Faster Ceph Implementations, Coming Soon!

A Ceph implementation is faster with BlueStore, full stop.

BlueStore addresses two of the major drawbacks in the Ceph stack, the penalty of write amplification from using journals, and the overhead of the XFS filesystem for storing data. The performance improvements of BlueStore allow Ceph to take better advantage of high performance drives like the Micron 9100 MAX NVMe SSD.

BlueStore in its current state is encouraging. While it is still experimental, my test cluster was stable with Kraken 11.2. There is no reason to doubt that BlueStore will provide even higher performance in its final form. BlueStore will become the default in the next GA version of Ceph, Luminous, which should be released in late 2017.

The ceph.conf tuning settings for BlueStore and RocksDB I used in these tests are available here:

1 © 2017 Micron Technology, Inc. All rights reserved. All information is provided on an “AS IS” basis without warranties of any kind.

Ryan Meredith

Ryan Meredith

Ryan Meredith is a senior manager of Storage Solutions Engineering at Micron. He's worked in enterprise storage since 2007 for U.S. Bank, IBM and Gemalto. He currently leads a team focused on architecting and performance testing enterprise solutions using Micron's DRAM and flash technologies. He likes dogs, games, travel and scuba diving.

Ryan has a Master of Science in management information systems from the University of South Florida.