Available 09:00-17:30

Industry Consolidation and the path to Exascale

The HPC industry saw continued consolidation throughout 2019 and the gene pool of HPC companies continues to steadily shrink, with the trend likely to carry on in 2020.  What this will mean for the industry will depend on which side of the fence you sit?

If you are a supplier, then a merger or acquisition can offer benefits including; economies of scale when buying components, easing of cashflow constraints associated with the financial burden of going after larger procurements and savings around acquiring new IP or merging competitive R&D efforts. Additionally, fewer bidders for larger contracts may, for various reasons, mean less pressure to be as price competitive in certain circumstances.

I’ve already mentioned that the traditional HPC vendor gene pool is shrinking (SGI and Cray being the most notable HPC vendor brands to be borged thus far) so one must wonder what effect this will have on the industry as a whole? Well, in my opinion, the big winner will undoubtably be HPE (and probably Atos), but if you are a buyer then the slow decline in the number of independent vendors will also mean a reduction in the technical variety and evolutionary pressures that have kept HPC vendors at the forefront of technical innovation for so long.

There is an explanation, which is that the path to Exascale is so expensive and the costs involved in actually delivering an Exascale system are growing at such dizzying rates (up from ~$100M for a Top 3 entry at Petascale to approaching $500M for similar rank machine in the Exascale era) that only the large could survive.[1]

IBM, having found success partnering with NVIDIA and Mellanox for several pre-Exascale machines, has won none of the big recent procurements. In fact, NVIDIA and Mellanox are also locked out of most of the recent US national lab systems, with AMD being the big winners, and Cray/HPE picking up all the wins as the system integrator.

Going with AMD has implications for the US national lab application ecosystem, as almost all the GPU accelerated software has NVIDIA’s CUDA front and centre. There is significant pressure on Cray and AMD to provide the right toolset and middleware to enable these systems to work well at scale and a reasonable chunk of these system’s NREs is likely going to fund what will be significant software development effort from Cray. AMD really have been cleaning up in the HPC space with Rome recently and are in the position to make the most of Intel’s recent of missteps.

Intel woes

Speaking of which, Intel’s OneAPI initiative is an attempt to make their new Xe range of accelerators relevant to the same space and a prerequisite for the Aurora system they are delivering. Given the knots that Intel has been tying itself in recently it will be interesting to see if they really can deliver everything in time for the targeted 2021 debut for Argonne’s Aurora system.

Intel is clearly a company in the throes of a major upheaval with new CEO, Bob Swan, looking to broaden the focus of the company. With their $2B purchase of Habana Labs, it looks like a second attempt to try and secure a slice of the datacentre AI boom. Given that they previously acquired Nervana (for ~$400M) that would seem to signal that all is not well with that product line. Yet another knock-on effect of the 10nm process difficulties, which will also have far reaching ramifications for Intel’s Xeon family. Who else thinks Sapphire Rapids delivery timetable is at risk? It’s even rumoured that Intel will use a foundry to fab devices for Aurora which would be a painful moment for the company if true.

Intel also has continuing problems satisfying the CPU demand from the Hyperscalers. This is due to a combination of issues, but all principally stem from their 10nm woes, where they have capacity but, apparently no truly viable process. Intel expected 10nm production to be crossing over with 14nm several years ago, so this means they have capacity constraints on their 14nm lines, with fallow capacity on 10nm. To make matters worse they have been forced to backport cores to 14nm, which means what they are fabbing on 14nm takes more wafer area (10nm is ~30% denser than 14nm). Put all these issues together and their production is well short of where it needs to be to satisfy pent up demand. AMD are busy making steady gains in both HPC and the consumer this space and Intel’s supply problems with only benefit AMD.

Other notable mergers

The other notable mergers throughout 2018/19 were; IBM acquiring Red Hat for $34B and in the pure HPC space NVIDIA buying Mellanox for $6.8B. This is a very significant move by NVIDIA and signals their desire to become a system vendor rather than be reliant on the likes of Cray/HPE. With Mellanox IP, which was already well integrated with the NVIDIA ecosystem, they can scale out their AI/ML/HPC systems. All that is missing is the CPU component and for that they can either let the market big-boys slug it out or home-grow a datacentre grade ARM CPU. While in my opinion ARM is important for their edge SoCs (Jetson) it seems unnecessary for NVIDIA to try their hand making a datacentre grade CPU while others are willing to do so.

Of course, the EU is also upping the ante in the CPU space by pushing ahead with the European Processor Initiative (EPI). The system vendor in pole position domestically (in the European sense) is of course Atos and they have started at chip away of the monopoly that Cray seems to have in the larger system category. Having healthy competition for Cray is of course vital for the larger HPC centres globally, but also to the European dream of having a viable all-European HPC ecosystem. We’ll see lots more on this subject in the next couple of years.

Cloud momentum

The consolidation in the HPC vendor space continues to confirm our view that HPC in the Cloud will grow substantially over the next few years. In fact, Hyperion Research have recently designated 2020 as the breakout year for HPC in the Cloud. The main players here are Azure and AWS, with GCP and other vendor clouds also picking up useful revenue.

In fact Google have this finally decided to take the segment more seriously and have announced ambitions to be a top two player by 2023 which will potentially raise the stakes for AWS and Microsoft. It’s still fairy early days in the HPC segment for Cloud adoption but a Hyperion survey records substantial growth in the use of hybrid cloud resources even for established HPC centres in 2018/19.

Core wars

We are entering a new phase in the evolution of processors in HPC (Intel vs AMD vs ARM) and especially in the AI/ML space. Many commentators have labelled this period a Cambrian-like explosion, with new and diverse architectures for AI/ML training and inference appearing at a phenomenal rate. The market projections for this vertical are certainly quite staggering, with some estimates reaching a total addressable market of $300B per annum, in the near future. These sorts of figures dwarf the total HPC market (~$27B total spend today with modest projected CAGAR) but there is a definite convergence of HPC, ML and Big Data for hardware and systems, which will mean that meaningful innovation will still be driven along strongly for some time to come.

We’ve written before that Intel, despite their still very healthy market share, will struggle to dominate HPC in the same way as they have in much of the last decade. AMD have made such a comprehensive technical comeback that it’s hard to remember a major HPC win for Intel (apart from Aurora which wasn’t an Intel win but more of a chance to stay in the game) in the last nine months. Intel’s dominant position and the pervasive nature of their ISA extensions mean that it’s not a trivial task to dethrone them and AMD is still to crack double digit market share.

In HPC it’s less about IPC and raw clock speeds and more about memory and system bandwidths. AMD is still winning most of the benchmarks but due to but the nature of many application codes, the lower bytes to flops ratio of the 64 core Rome means the competition is a bit closer. Intel are not going to give up the segment quietly and may have to start to publicly discount their SKUs rather than continue down the supported pricing route.

ARM is still in the mix, with several vendors shipping datacentre and HPC class silicon with future roadmaps that are on paper at least broadly competitive with the what Intel and AMD are likely to ship (Marvell, Fujitsu, Ampere and AWS with more to come in the next couple of years – several with heavyweight co-founders and high-profile hires). Perhaps the most interesting recent entrant in the CPU space is AWS with their Graviton Processors which are both performant and very price competitive for many traditional workloads.

It will be fascinating watching the CPU market evolve over the next 12-24 months. Can AMD continue their recent execution and keep to their new product cadence? Will Intel rediscover their mojo and get a real roadmap back on track in time? Will Marvell ever deliver on the early promise of their ThunderX roadmap? Will the Fujitsu A64fx broaden its market appeal and ship to a range of non-government customers? How well will the HPE-Cray marriage work and can Cray continue to rack up the system wins (largely dependent on how good Slingshot is in reality)?

I suspect there will be more than a few surprises mixed in amongst the more predictable stories.

Roll on the remainder of 2020!

By Dairsie Latimer – Technical Advisor

[1] It’s also the reason that Intel risked the ire of its biggest customers by taking prime status for some of the Exascale pathfinder systems and Aurora – which could be read as an acknowledgement that Cray couldn’t afford to prime because of the relative size of the procurements vs their own cash position.

Recent Posts

The Myth of Green HPC

Today, we are learning more and more about sustainability, and how this concept is one of the most important when protecting the environment. Often being broken

Read More »