This testing results summary was created in collaboration with Tony Ansley and Ryan Meredith.
With Flash Memory Summit™ (FMS) now behind us, we hope everyone got the chance to attend our session on big-data benefits of flash and see the value that our latest quad-level cell (QLC) based SSDs will provide your big-data solutions. As the economics of flash continue to move ever closer to that of legacy HDDs, SSDs are becoming hard to ignore. They provide tremendous performance advantages over HDDs that will become more and more critical as big-data analytics goes beyond batch to creating true real-time results. If you were not able to attend…no worries! We have you covered with this blog. If you are considering what your next Hadoop solution will look like and how you can get better time-to-insights for your business, read on.
Configuration
The test environment consists of 4 Supermicro SYS-2029U-TR25M (Intel Purley platform) servers as Hadoop data nodes. We also deployed a single Dell-EMC® PowerEdge R630™ server running KVM to run the primary non-data components of the solution in individual virtual machines. This includes the primary NameNode and secondary NameNode, the Resource Manager, Zookeeper, Hive and the Ambari server (Figure 1).
Figure 1-Test Configuration Overview
The Supermicro® SYS-2029U-TR25M data node hardware configuration:
CPU (2x) | Intel Xeon Gold 6142 CPU @ 2.60GHz |
Memory | 384GB Micron DDR4-2666 (12x32GB) |
Network | 1x Mellanox Connectx-4 100Gbe |
HDD Storage | 8x 2.4TB SAS 10K Hybrid HDDs |
SSD Storage | 4x 5210 ION 4TB |
For the Apache Hadoop® software implementation, we deployed Hortonworks HDP 2.6. Specifically, the following components were deployed:
- Apache Hadoop 2
- HDFS
- YARN
- MapReduce2
- Apache ZooKeeper™
- Apache Ambari™
The Tests
We executed a series of benchmark tests that simulate typical map-reduce functions. These benchmarks are part of the built-in Apache Hadoop toolset. Specifically, we used the following benchmarks:
Randomtextwriter | Used to generate 2TB of text-based data |
Sort | A map/reduce program that sorts the data written by Randomtextwriter |
WordAggHist |
An aggregate-based map/reduce program that computes the histogram of words in the test dataset |
WordCount | A map/reduce program that counts the words in the test dataset |
TeraGen |
Used to generate 2TB of data to be used with TeraSort benchmark |
TeraSort |
A map/reduce program that sorts data in the TeraGen test dataset |
TestDFSIO-Write |
A distributed I/O benchmark for 100% writes |
TestDFSIO-Read |
A distributed I/O benchmark for 100% reads |
All benchmarks were configured such that the total dataset was larger than the aggregate cluster memory capacity of 1TB of RAM. This ensured that the complete dataset was not cached in memory and required disk IO to be performed to complete the tests.
To ensure consistent results, a series of four test runs were executed with each run consisting of a complete set of all benchmarks listed above. The results were then averaged across the 4 runs to create the graphs below. We ran each test set against the four data nodes with data only on the 32x 2.8TB HDDs OR the 16x 4TB Micron 5210 ION SSDs.
The Results
The average time to complete all tests in each sweep is shown in the table chart below. The net results are that our QLC-based 5210 ION performed each test pass an average of 30% faster than the 10K HDD-based solution … using ½ the number of drives.
As we break down the individual tests, the SSD advantage varied from near parity (TestDFSIO-Read and WordAggHist) to nearly 2x (TeraGen and Sort). This indicates that your specific needs will still come into play when you are planning your deployment and you must understand the profile of the dataset(s) you are trying to analyze.
Finally, we also want to remind everyone of our previous Hadoop testing (published in this blog) that showed the value of SSDs as a YARN cache for existing solutions that are based on legacy HDDs. Adding a single SSD as a YARN cache can provide dramatic improvements in performance of your big-data solution with a much smaller investment than adding additional nodes to the cluster.
If you want to learn more about our testing and how Micron can help you achieve your big-data performance goals, let us know. You can comment here, or better yet, contact your reseller, OEM, or reach out to us directly at micron.com.