OpenPOWER Rebel Alliance and x86 Empire score HPC wins
Intel was awarded the contract today for the third of three massive new supercomputers ordered up by the U.S. DOE as part of the CORAL procurement. CORAL (Collaboration of Oak Ridge, Argonne and Livermore) provides for a shiny new box at each of the three facilities.
The new system, named Aurora, will live in the Argonne National Lab. Aurora will perform 17x faster than the Mira box that it replaces. Intel will partner with Cray to deliver the 180 Petaflop system sometime in 2018.
The other two systems were awarded to IBM and their OpenPOWER Foundation partners NVIDIA and Mellanox. The new Summit system will churn out at least 5x the performance of Oak Ridge’s current Titan system, and the new Sierra box should top LLNL’s Sequoia by more than 7x.
One interesting aspect of the CORAL deal is that it required DOE decision makers to not purchase the same system architecture for all three systems. Since there are only two viable architectures in the Petaflop arena, their choice was between buying two OpenPOWER-based systems (accompanied by one x86 system) or two x86 systems (and one OpenPOWER-based box).
Last November, the DOE announced that they had selected OpenPOWER systems for the first two systems of the CORAL buy. So the Intel announcement today is a bit anticlimactic; given the procurement rules, Intel was certain to win this one.
Performance Rules
Most industry observers were surprised that OpenPOWER nabbed two-thirds of the CORAL deal. Many assumed that because x86 architecture is the standard in HPC, the majority of this procurement would go Intel’s way. Why didn’t it?
It won’t surprise anyone to learn that the procurement process for the $600 million-plus CORAL deal was quite complex. But it’s performance that dictates an HPC system purchase. In fact, I’d argue that performance, whether it’s flops/sec, flops per watt, or flops per dollar, dictates almost all system purchases.
In order to fairly evaluate what each vendor brought to the table, the DOE specified a series of benchmarks in the following categories:
Scalable Science: applications that are required to scale to the entire prospective system
Throughput Benchmarks: either large runs or subsets of full applications
Data Centric Benchmarks: as proxies for new data intensive workloads
Skeleton Benchmarks: designed to test/stress various parts of the platform like network performance, memory architecture, I/O, and other components
Micro Benchmarks: the most demanding compute sections of larger HPC scalable and throughput applications
Vendors were required to run 16 benchmarks covering the above categories, but could also submit results from a list of 17 elective applications or benchmarks. The target for vendors was to build a system that could provide 4-8x performance improvement over their current systems (Sequoia, Mira, and Titan) on full system scalable science runs and 6-12x better performance on the throughput applications.
(If you’re interested in the actual benchmarks and submission rules, check them out here.)
Powerful Win
Given the criteria above, it can be assumed that the IBM POWER collective won the benchmark battle vs. Intel by posting better performance scores and showing a compelling roadmap for where their technology will be when the systems are delivered in 2017.
The performance of the OpenPOWER system relies on contributions from the founding members of the alliance. Here’s a quick rundown on who’s bringing what:
IBM is contributing their POWER9 CPU, which should be introduced in 2017 if not sooner. Typically, POWER processors have much higher memory bandwidth than competing Intel chips, higher frequencies, and better single-thread performance.
In the past, IBM’s POWER processors couldn’t directly compete with Intel’s Xeon line due to the fact that IBM’s chips were big endian, and out of step with the x86 little endian architecture. IBM’s POWER8 chips are now little endian, meaning that the vast majority of Linux software can be transferred from x86 to POWER with simply a recompile rather than a full port.
NVIDIA is supplying their Pascal GPU, which may be as much as 10x faster than Maxwell for certain applications. These gains come from 3D stacked DRAM (up to 32GB of it – yikes), mixed precision capability, and the addition of the uber-fast NVlink interconnect.
NVlink will play a large role in the CORAL systems; assumedly it will be used to connect multiple processors and GPUs to each other. While NVIDIA hasn’t yet released any bandwidth or latency figures for NVlink, it’ll probably be around 5x faster than PCI gen 3.
Mellanox is bringing their Infiniband I/O and switching technology on the CORAL party. This will probably include their fancy new Multi-Host technology, which allows a single NIC to host up to four systems (or components like FPGAs or GPUs), all sharing a single switch connection that can be as fast as 300Gb/sec. This can cut switch and NIC cost by as much as 45%, with little loss of potential performance.
One of the most astounding aspects of the OpenPOWER architecture is that they’re estimating that they’ll provide 5x the performance of the current Titan system, but with only 3,400 nodes (vs. Titan’s 18,688 nodes). That’s an incredible feat. It highlights the payoff from this architecture and, really, the OpenPOWER value proposition as a whole.
Future Implications?
HPC is to commercial computing as Formula 1 racing is to automobiles. The advances we see on the highest end of any technology eventually make their way down into the less rarefied air of widely used products. We’re going to see the server wars heat up again as the plucky Rebel Alliance, as represented by the OpenPOWER partners, takes on the Intel Empire. Get some popcorn and get comfy; it’s going to be quite a show.
