Building Cloud HPC the Right Way:

Processes That Matter

At Red Oak I have had the privilege of working with numerous organisations to assist with expanding their HPC footprint into the cloud. With an ever-evolving set of resource types available from cloud providers, technical implementation is rarely the issue. Instead, defining the practical processes and governance through which the HPC service will function is the main time consumer in my experience.

In this blog, I hope to highlight some common process questions which I have encountered and should be considered when designing a cloud HPC service (though many of these are also valid for on-premises design).

Access to software and packages

Traditionally, on-premises HPC deployments are often quite siloed from other IT infrastructure, with a priority on productivity and less emphasis on security where it may it may reduce agility. This can make it far easier to obtain relevant packages and software quickly.

For most organisations I have worked with, when considering cloud HPC, engaging with a cloud infrastructure team is required, which is almost always integrated with enterprise IT.

For the most mature organisations, this almost always comes with an enhanced security posture leading to the need to balance productivity with a reasonable set of security requirements.

These requirements almost always take an ‘access by exception’ approach for internet access to adapt to ever-more-sophisticated security threats.

The following are some process questions worth considering for this:

Do files need to be quarantined, scanned and then made available? By what mechanism can scanned files be moved to the HPC environment?
If using a firewall, how can an internet link be requested to be whitelisted?
How long will files be kept before being removed or archived? (more storage use on the cloud=more cost).

Environments

At minimum, development and testing and production environments are essential.

In a world of infrastructure as code, even a small change in code can have platform-breaking consequences.

How different environments are implemented will have a lot do with organisational policy and cost implications.

A staging/pre-production environment is an additional common occurrence for on-premises HPC environments and may be mandated by organisational policy.

Use of the cloud permits deployment of multi-regional infrastructure.

This may be necessary to access specific resources, for optimal user experience, for data sovereignty purposes or even for disaster recovery purposes.

This could itself require staging/production environments.

Some questions worth considering when thinking about environments:

Does organisational policy mandate specific environment requirements around applications/platforms available to users? (eg a staging environment is needed)
Do the environments need to be completely isolated from one-another or can they use the same network?

In a HPC context, sharing a network enables resources such as performance storage, which is often one of the most expensive components of cloud HPC environments, to be shared.

However, this comes at the cost of potentially sensitive data being more exposed.
Are multiple environments required in different regions?
How is data transferred between different regional environments?

Management of upgrades

Hardware

On-premises HPC is usually cyclic, with one big upgrade every 3-5 years and then more minor upgrades in between.

Maintenance periods are communicated in advance and implemented with user access restricted or unavailable during the downtime.

In many cases, deriving the most value out of cloud HPC necessitates using newly available infrastructure promptly.

This makes hardware updates more common and potentially more disruptive.

Questions worth considering for this:

How can new infrastructure be requested?
How soon should new infrastructure be adopted?
Should a regular maintenance time be set to maintain upgrades or should upgrades be done ad-hoc (or a combination of both)?

Additionally, could a parallel running approach be used for more substantial upgrades (eg major infrastructure update)?
How long should older hardware be supported by the platform?

Software

In my experience software upgrades should be approached with more caution than hardware for non-security-related upgrades.

It is not uncommon to see an update released one day and another released shortly after (on the order of days) to address some unforeseen bugs.

This is often the result of time pressure on developers.

Major updates also remove some features and add new ones. For research applications, it is not uncommon for users to require continued access to older software for reproducibility or legacy feature reasons.

Furthermore, some users with many years of experience with a research application maybe more qualified than the admin to build it.

In this case, it may be worth providing access to an environment to such users to build and test the application.

Package managers such as Conda, EasyBuild and Spack greatly simplify software installation and management.

Even a single HPC package can reach hundreds of dependencies which package managers can easily keep track of.

I believe package managers should always be used where possible; however, where they are not the norm, some gentle encouragement maybe needed to discourage old habits.

Process questions worth considering around software include:

How can new software be requested?
How soon should software be upgraded?
Which package manager/s should be used?
Should select users be able to compile /software? Where should they compile it?
How long should older software be supported by the platform?

Conclusion

Thinking about process early together with architectural design is key. You’d be surprised with the number of organisations which do not consider these steps soon enough, leading to delays in implementation and an inefficient HPC service later on.

This article only scratches the surface of what is probably the most complex part of cloud HPC service design, but I hope it has given you somewhere to get started.

Organisations exploring cloud HPC often benefit from independent guidance when structuring environments and processes, an area where Red Oak has extensive experience.

Manveer Munde
Principal Consultant
Red Oak Consulting

CAPABILITIES

Advanced Compute

Compute Environments

STRATEGY & PLANNING

PROCUREMENT

CAPABILITIES

Advanced Compute

Compute Environments

IMPLEMENTATION

OPTIMISATION

CAPABILITIES

Advanced Compute

Compute Environments

MANAGED SERVICES

INDUSTRIES

RESOURCES

CASE STUDIES

FREE TRAINING

ABOUT US