Managing immense data growth is a balancing act. Tip the scales in your favor.

By Doug Rollins - 2018-07-10

I’ve heard and read about how data growth rates are trending up, but this was driven home for me last weekend. I was looking through a box of retired camera (flash) cards and before I got to the bottom of the box, I found a few 2GB cards (and they were not *that* old). Digging a bit deeper… 4MB (OK, that one was a CF card and was older, but still…).

My “ah hah” moment? My own data footprint is big, and getting bigger. Then I looked at a project on which I’d been working – revisions, edits, comments – a dozen or more. My second “ah hah” moment? My work-related footprint is growing too (is yours?).

I’m guessing neither of these is a surprise, right? Personal data. Work data. Corporate data. We’re recording more, storing more and managing more in our personal and work lives.

So how to maintain balance between storing so much and getting to it efficiently? For my pictures, I’ve just about given up. For work related data, we have options.

In this blog post I’ll focus on work data, discussing how flash storage can dramatically drive performance - even for very large data sets.

The Setup

When we scale a database - either locally or in the cloud - performance is imperative. Without high performance, a massive-scale database is little more than a (semi?) active archive.

When an entire data set is small and fits into memory (DRAM), performance is straightforward and storage system capability is less important. However, with immense data growth, a dwindling percentage of data affordably fits into memory. Combined with the constant demand for faster and more detailed analytics, we have arrived at a data-driven crossroads: We need high performance, high capacity and affordability.

Enterprise SATA SSDs can help. Building with these SSDs lets us future-proof Apache Cassandra® deployments to perform soundly as active data sets grow, extending well beyond memory capacity. Cassandra’s ability to support massive scaling, combined with multi-terabyte, high-IOPS enterprise SATA SSDs, lets us build high-capacity NoSQL platforms with extreme capacity, extreme agility and extreme capability.

Note: Due to the broad range of Cassandra deployments, we tested multiple workloads. You may find some results more relevant than others for your deployment.

Act 1: Enterprise SSDs Meet Growing Demands

When we built Cassandra nodes with legacy HDD storage, we scaled out by adding more nodes to the cluster. We scaled up by upgrading to larger drives. Sometimes we did both.

Adding more legacy nodes was effective (to a point), but it quickly became unwieldy. We gained capacity and a bit more performance, but as we added nodes, clusters became larger and more complex, consuming more rack space and support resources.

Upgrading to larger HDDs was somewhat effective (also to a point) because we got more capacity per node and more capacity per cluster, but these upgrades give limited additional performance.

With both techniques, performance was expensive and did not scale well with growth.

High-capacity, lightning-quick SSDs, such as the Micron® 5200 series are changing the design rules. With single SSD capacities measured in terabytes (TB), throughput in hundreds of megabytes per second (MB/s) and IOPS in tens of thousands high-capacity, ultra-fast SSDs enable new design opportunities and performance thresholds.

Act 2: SSD Clusters: Real Results from Immense Data Sets

As you plan your next high-capacity, high-demand Cassandra cluster, SSDs can support amazing capacity and provide compelling results. Figures 1a-1c summarize our tested storage configurations.


We used the Yahoo! Cloud Serving Benchmark (YCSB) workloads A–D and F to compare three four-node Cassandra test cluster configurations:

  • SSD Configuration 1: 1x Micron 5200 ECO (3.8TB each)
  • SSD Configuration 2: 2x Micron 5200 ECO (3.8TB each)
  • Legacy Configuration: 4x 15,000 RPM HDD (300GB each)

Note: Due to the broad range of Cassandra deployments, we tested multiple thread counts. See the How We Tested section for details.

With the same number of nodes and a single SSD in each node, the 1x SSD test cluster offers a 3X capacity increase over the legacy configuration (the 2x SSD test cluster offers a 6X capacity increase). We also measured significant performance improvements across all the workloads tested with each SSD test cluster, ranging from a minimum improvement of about 1.7X to a maximum improvement of about 10.7X, along with lower and more consistent latency.


Act 3: SSD Clusters Provide More Consistent Responses

Read Response Consistency: Since many Cassandra deployments rely heavily on fast, consistent responses, we compared the 99th percentile read response times for each test cluster and workload. Figure 3 shows the 99th percentile read latency for each configuration.

Workload A 
An update-heavy workload, with 50% of the total I/Os
writing data. At the application level, this workload is similar
to recording recent session actions.
Workload B
A read-mostly workload (95% read). At the application level,
this workload is similar to adding metadata (such as tagging
photographs or articles) to existing content
Workload C
A read-only workload. At the application level, this workload
is similar to reading user profiles or other static data where
profiles are constructed elsewhere
Workload D
Reading the latest entries (most recent records are the most
popular). At the application level, this workload is similar
to reading user status updates.

The Grand Finale

High-capacity, high-performance SSDs can produce amazing results with Cassandra. Whether you are scaling your local or cloud-based Cassandra deployment for higher performance or faster, more consistent read responses, SSDs are a great option.

We expect impressive performance when our data set fits into memory, but immense data growth means that smaller and smaller portions of that data fit into memory affordably.

We are at a crossroads. Business demands drive us toward higher performance, and data growth drives us toward affordable capacity. When we combine these, the answer is clear: Enterprise SSDs deliver strong results, helping tame performance demands and data growth.

Given data like the above, our customers are increasingly finding that deploying SSDs in the data center is a high-value option for better overall total cost of ownership (TCO).  

If you want more details about our testing you can read the entire technical brief here.

Doug Rollins

Doug Rollins

Doug Rollins is a principal technical marketing engineer for Micron's Storage Business Unit, with a focus on enterprise solid-state drives. He’s an inventor, author, public speaker and photographer. Follow Doug on Twitter: @GreyHairStorage.