Storage

QLC Pushing Enterprise HDDs Into Obscurity for Big Data Solutions

By Dennis Lattka - 2018-08-30

This testing results summary was created in collaboration with Tony Ansley and Ryan Meredith.

With Flash Memory Summit™ (FMS) now behind us, we hope everyone got the chance to attend our session on big-data benefits of flash and see the value that our latest quad-level cell (QLC) based SSDs will provide your big-data solutions. As the economics of flash continue to move ever closer to that of legacy HDDs, SSDs are becoming hard to ignore. They provide tremendous performance advantages over HDDs that will become more and more critical as big-data analytics goes beyond batch to creating true real-time results. If you were not able to attend…no worries! We have you covered with this blog. If you are considering what your next Hadoop solution will look like and how you can get better time-to-insights for your business, read on.

Configuration

The test environment consists of 4 Supermicro SYS-2029U-TR25M (Intel Purley platform) servers as Hadoop data nodes. We also deployed a single Dell-EMC® PowerEdge R630™ server running KVM to run the primary non-data components of the solution in individual virtual machines. This includes the primary NameNode and secondary NameNode, the Resource Manager, Zookeeper, Hive and the Ambari server (Figure 1).

QLC Hadoop 1

Figure 1-Test Configuration Overview

The Supermicro® SYS-2029U-TR25M data node hardware configuration:

CPU (2x)                Intel Xeon Gold 6142 CPU @ 2.60GHz 
Memory                384GB Micron DDR4-2666 (12x32GB)
Network                1x Mellanox Connectx-4 100Gbe
HDD Storage                8x 2.4TB SAS 10K Hybrid HDDs
SSD Storage                4x 5210 ION 4TB

For the Apache Hadoop® software implementation, we deployed Hortonworks HDP 2.6. Specifically, the following components were deployed:

  • Apache Hadoop 2
    • HDFS
    • YARN
    • MapReduce2
  • Apache ZooKeeper™
  • Apache Ambari™

The Tests

We executed a series of benchmark tests that simulate typical map-reduce functions. These benchmarks are part of the built-in Apache Hadoop toolset. Specifically, we used the following benchmarks:

Randomtextwriter        Used to generate 2TB of text-based data 
Sort        A map/reduce program that sorts the data written by Randomtextwriter
WordAggHist 
     An aggregate-based map/reduce program that computes the histogram of words in the test dataset
WordCount       A map/reduce program that counts the words in the test dataset
TeraGen 
     Used to generate 2TB of data to be used with TeraSort benchmark
TeraSort 
     A map/reduce program that sorts data in the TeraGen test dataset
TestDFSIO-Write 
     A distributed I/O benchmark for 100% writes
TestDFSIO-Read   
     A distributed I/O benchmark for 100% reads

All benchmarks were configured such that the total dataset was larger than the aggregate cluster memory capacity of 1TB of RAM. This ensured that the complete dataset was not cached in memory and required disk IO to be performed to complete the tests.

To ensure consistent results, a series of four test runs were executed with each run consisting of a complete set of all benchmarks listed above. The results were then averaged across the 4 runs to create the graphs below. We ran each test set against the four data nodes with data only on the 32x 2.8TB HDDs OR the 16x 4TB Micron 5210 ION SSDs.

The Results

The average time to complete all tests in each sweep is shown in the table chart below. The net results are that our QLC-based 5210 ION performed each test pass an average of 30% faster than the 10K HDD-based solution … using ½ the number of drives.

QLC Hadoop 2

As we break down the individual tests, the SSD advantage varied from near parity (TestDFSIO-Read and WordAggHist) to nearly 2x (TeraGen and Sort). This indicates that your specific needs will still come into play when you are planning your deployment and you must understand the profile of the dataset(s) you are trying to analyze.

QLC Hadoop 3

QLC Hadoop 4

QLC Hadoop 5

Finally, we also want to remind everyone of our previous Hadoop testing (published in this blog) that showed the value of SSDs as a YARN cache for existing solutions that are based on legacy HDDs. Adding a single SSD as a YARN cache can provide dramatic improvements in performance of your big-data solution with a much smaller investment than adding additional nodes to the cluster.

If you want to learn more about our testing and how Micron can help you achieve your big-data performance goals, let us know. You can comment here, or better yet, contact your reseller, OEM, or reach out to us directly at micron.com.

Dennis Lattka

+