Deployment and Management from the front line


Infrastructure-as-code is an inherent advantage of using the cloud. Simply put, this is the ability to describe new infrastructure required for your cloud deployment in your chosen tool eg Bicep, Terraform, Pulumi, or one of the many others, which can then be read and deployed into your environment.

In the past, an upgrade to a High Performance Computer (HPC) cluster would often involve some form of procurement and a lead time for delivery and installation. In contrast, on the cloud, this is often a case of awaiting the general availability of the desired infrastructure and editing or adding several lines of code. This allows access to the latest compute node technology in a matter of minutes.




Configuration-as-code involves the configuration of nodes after deployment in your chosen tool eg Ansible, Chef, Puppet, or one of the many others. This is something which has been carried forward from on-premises cluster management but has been expanded upon on the cloud.

Tools such as Packer allow the use of configuration as code to create preconfigured custom images for cluster nodes. This reduces the time for provisioning and allows swift rollback and roll-forward of imaging versions.

Perhaps more importantly, this can save on cost by cutting the time needed to prepare a node before carrying out work on it. Any changes which are best not baked into an image can be carried out on node boot through tools such as Cloud Init.

Tips for Cluster Deployment and Management

To further streamline cluster administration on the cloud, various infrastructure-as-code and configuration-as-code scripts, even when in different coding languages, can be stitched together into pipelines using CI/CD tools such as GitHub actions, enabling the whole process to be carried out using simple Click-Ops.

The above approaches aid substantially in cloud HPC cluster management but in my time managing such clusters, experience has taught me to exercise caution when utilising these tools.

A One-Click-Deploy Pipeline

As attractive as a one-click-deploy pipeline may sound, in reality, software is constantly changing and scripts which may be working one day may stop functioning the next. There are several ways to deal with this depending on context.

HPC High Performance Computing Cloud ClusterFirstly, getting into the habit of baking software updates into images instead of installing them on system boot ensures that there is always a tested, working version of the package on an image.

Secondly, it is often worth breaking a one-click-deployment down into a ‘multi-click’ deployment. This is one method of ensuring that an initial failure doesn’t cascade onto its various dependencies. In addition, this method naturally enables the verification of successful deployment before moving on to the next step.

Thirdly, building error checking into pipelines via linting tools and mandatory approval by colleagues to proceed with deployment increases the likelihood of a successful deployment.

On the other hand, I find myself increasingly suspicious of new software versions. I have observed, on a number of occasions that release, seemingly after a while can sometimes be followed up with several minor releases with bug fixes. Software developers, despite their best intentions, have deadlines to work towards and milestones to reach and this inevitably can result in bugs.

In my opinion, it’s prudent to let some time pass before upgrading to the latest version of software. This doesn’t just apply to software used to run calculations, I have encountered similar issues with scheduler versions and the very cloud orchestration frameworks which are responsible for node provisioning.

Finally, with access to such tools, one can often find themselves eager to quickly deploy the latest technology. Utilising a DevTest cluster to implement and test such changes is essential, but do yourself a favour and control the urge to move swiftly to production! Your DevTest cluster is exactly what its name entails.

However, despite all these measures, some issues only become apparent when many calculations are run by several users. Ideally, this is something one would want to simulate in a DevTest environment.

However, on the cloud, the costs associated with doing this frequently may not be insignificant so a carefully considered approach is necessary. This will depend on a number of factors such as the budget available, the size of your HPC administration team and the number of changes you intend to push to production.


Manveer Munde
Principal Consultant
Red Oak Consulting