Introduction
A common topic during many of our projects is the sizing of a HPC cluster. More specifically, we often come across clusters which we can loosely characterise as having a low average utilisation. This may be an intentional design point or an artifact of over procurement or oversizing.
Reasons why an oversized cluster may be necessary
We have worked with organisations where the cluster has been designed for low utilisation Depending on the exact requirements, this may necessitate an oversized cluster.
In other cases, underutilisation can difficult to avoid. For example, some organisations may procure a cluster for capability so that very large jobs can be run. Nevertheless, the majority of jobs may still be small. Other organisations may oversize a cluster to deal with regular periods of increased demand. However, demand may usually be lower than the full capability of the cluster.
Disadvantages of an oversized cluster
I have alluded to some of the advantages an oversized cluster may offer; however, this comes at a cost, with several disadvantages.
The most obvious disadvantage is a potentially lower return on investment. By buying a larger cluster than needed, there will be idle resources not performing any useful work. The value created relative to the amount invested is therefore lower.
To add to this, more resources require more energy to run, leading to increased bills. Idle resources may require 20-25% of peak power to simply be available for use. Furthermore, more nodes generate more heat, requiring a larger investment in the cooling system and greater water usage.
More idle machines may also lead to increased carbon emissions (assuming 100% renewable energy is not used). Even if energy is generated cleanly, there is the ethical question of why energy is being wasted on essentially nothing useful and not diverted to something more beneficial.
Finally, a larger cluster could also necessitate a larger management overhead.
This could lead to unnecessary maintenance time which could be diverted to other pursuits (such as user ticket resolution). Additional personnel resources could also be required to manage the cluster.
Make the most of what you have
With some organisations, we have found users believing that a cluster is undersized due to long jobs waiting times.
However, on closer inspection, we have found that the cluster is in fact oversized with long queuing times often being a result of sub optimal configuration and resource management.
This brings up another important point. Attempting to solve the problem of increased queueing times by buying more resources without carefully investigating the causes does not incentivise good cluster management.
Put another way, the aim should be to first get the most out of what we have before buying more.
What if oversizing is unavoidable?
In cases where oversizing is unavoidable due to business requirements, it may be worth taking some steps to minimise underusage. For example, it may prove fruitful to provide paid access to the cluster for users external to the organisation.
With some customers, we have noticed that exclusive ownership of resources is allowed on clusters. This is often due to funding reasons.
Commonly, such resources are usually underutilised and we would generally discourage such use cases except for specific cases.
Accessing additional capacity only when required
In cases where cluster utilisation is very high at specific periods of time or for rare large workloads, there are other options available to access necessary capacity when needed. These usually involve paying for time-limited access to more resources.
For example, this may take the form of accessing HPC capacity from cloud providers, often referred to as ‘cloud bursting’.
Per unit time, this is likely to be more expensive than the on-premises cluster, but in the long term, due to being used only when necessary, this may work out cheaper.
This may also allow easy access to newer technology, potentially increasing time to solution.
Conclusion
The right sizing of a cluster will depend on your business requirements. There are cases where requirements may necessitate oversizing a cluster.
However, in most cases, we’d encourage a restrained approach to resource utilisation and focusing on using resources efficiently.
Where additional capacity is required for short amounts of time, options such as using the cloud enable bursting are an option until additional capacity is an absolute necessity.
Manveer Munde
Principal Consultant
Red Oak Consulting