Collectively at Red Oak Consulting we have been working on TCO models for clients for decades. As a result of our work and experience within TCO modelling we have created a highly evolved model for TCO analysis, which drills down into all of the deep dark recesses of whole life cost modelling for HPC and its associated requirements. These include: storage, datacentre facilities, power and support costs, human resources, software efficiency metrics and porting costs, dealing with acceptance testing, obsolescence and decommissioning and last but not least a very detailed model for the cost of the compute itself. So far, so conventional.
We’ve also spent a lot of time and effort helping clients understand the value of HPC to their organisation. This sounds like it should almost be unnecessary right? They surely already know its value if they have a HPC capability? But can they quantify it? Can they actually measure the value it returns?
• What is the actual return on their investment (RoI)?
• Are they spending their money on the right thing?
• Do they need more of it?
• Or is there too much?
• When should it be replaced?
• Where should it be?
Many organisations will have partial answers to some of these questions but it’s rare to find one so self-aware that they are measuring everything necessary to readily answer all these questions.
Why try and metricate and measure what some would classify as intangibles? The answer is because you can’t manage what you don’t measure. Money is like water, it will find ways of seeping out a myriad of small cracks, and before you know it the bucket is empty. So if you want to know the real costs of your HPC, it’s definitely worth quantifying the value it delivers to the organisation. Without fail an analysis of this sort is eye-opening.
Now this segues into the second topic of this blog post and that’s one of the hot industry topics at the moment, which is where does HPC in the Cloud come in?
Unsurprisingly in the past few years we’ve often been asked how best to compare on-premises HPC with various cloud HPC options. At least some of these requests come because on the face of it there is a lot of conflicting chatter about how well HPC can be hosted in the Cloud. What’s more, a conventional TCO evaluation is often surprisingly unfavourable to the cloud solutions and this seems at odds with much of the sentiment about cloud for other tasks.
Now there are a fair number of reasons why Cloud comes off poorly today when compared against on-premises solutions, some of which I’ll briefly highlight here, but what all our work showed us is that it’s remarkably useful to really understand how HPC brings value to the business as it opens up new ways of thinking about how to deliver that value. We’ll get back to that in a moment.
Some of reasons why a traditional TCO analysis doesn’t show cloud HPC vendors in a favourable light include:
• You work somewhere that really is a centre of HPC excellence. You all eat, sleep and breath HPC. (Authors note: Yes there are lots of places that meet that definition and many centres do an excellent job of procuring, maintaining and running their HPCs but there are far more that do not consider HPC central to their mission despite the value it brings to the organisation).
• If you already have a relatively modern datacentre facility (and by this I mean one that has a PUE of less than 2.0) you are in the third quartile (at least according to the last available data from 2014). Anything with a PUE below 1.5 and you are very definitely in the fourth quartile.
The Hyperscale’s are currently operating with PUEs < 1.2 (and it is possible to get below 1.1 but that starts to get stupidly expensive and starts becoming very esoteric very quickly). Conversely if you are in the first quartile then you are probably haemorrhaging money to facilities and rapidly approaching the point where you have to decide between pouring more money into the ground or just getting out of the datacentre game altogether.
• If you have high levels of baseline utilisation on your HPC, and by high, our analysis usually regards this as over 70%, you are again already better than at least fifty percent of HPC facilities.
There are of course plenty of HPC centres that routinely sustain higher levels of utilisation but they are more often than not running a limited range of codes and supporting a fairly homogeneous user community. Job scheduling and queueing times are another interesting metric to consider, not just absolute utilisation. This feeds back into how to measure the value of HPC to an organisation.
• You run codes that are best described as ‘capability’ rather than ‘capacity’ jobs. You measure your job scaling characteristics, not in the hundreds of cores but in the thousands and tens of thousands. You really do need that low latency high bandwidth HDR InfiniBand kit you paid so much for! As a result your codes probably have higher levels of parallel efficiency than most and you or your colleagues have spent significant resources optimising and tuning the applications to run on the latest tin.
• You need to run a fast parallel file system (high bandwidth not necessarily high IOPS).
• You ingest or generate, perhaps both, a lot of data (measured in the TBs and PBs). For many this is the iceberg that sinks many a TCO comparison with the cloud.
• You are not an academic facility (and so are commercial, thus the cloud vendors know you can pay and aren’t giving you breaks on data ingress or egress charges or storage costs).
If you answered yes to more than a couple of these points then HPC in the Cloud is probably still a few years (or at least one more procurement cycle) away from starting to look particularly attractive. But however cheap you think you can buy cycles today, the costs for the hyperscale volumes are always going to be cheaper. It’s just a matter of shaving profit margins (more competition between the cloud vendors and consolidation in the HPC vendor ecosystem) and evolutionary pressure (the Cloud interconnects will continue to improve just as the hypervisor/virtualisation tax will decrease) before for pretty much all bar the top quartile of HPC users will find HPC in the Cloud economically attractive.
If the writing wasn’t already on the wall then the HPC OEMs wouldn’t be busy talking up their own flavours of hybrid HPC.
If you answered no to more or less all of these points then I’m willing to bet that the comparison is already pretty close (and it’s certainly within a factor of two).
Now, what if I told you that there are a couple more ways in which Red Oak Consulting can help you to understand the potential of HPC in the Cloud and to unlock further value that most organisations simply haven’t woken up to yet?
I know that sounds like the intro to cheesiest of Internet pyramid selling schemes, but there really are aspects of conventional TCO models that do not show some of the real advantages of Cloud for HPC.
If you want to know more, or just want a simple chat with one of our experts, we will be happy to talk to you about your HPC TCO modelling. As a business, we continue to help many customers evaluate their HPC costs and we would be delighted to help you shine new light on how HPC delivers value to your organisation.
Dairsie Latimer – Technical Advisor
Red Oak Consulting