For many organisations the movement of their on-premises high performance computing (HPC) solution to the cloud is an increasingly attractive proposition. Technical benefits of relatively unlimited capacity and quick access to the latest hardware are combined with cost savings in areas such as datacentre maintenance and lengthy procurement cycles. Overall, a careful analysis of organisation-specific requirements can reveal whether a move to the cloud is appropriate. 

To list but a few, caveats of the cloud relative to on-premises infrastructure include large high performance storage costs, relatively low networking speeds and the need for an organisation to make structural changes to accommodate the cloud. In the following sections I will derive some estimates for high performance storage costs on the cloud vs their on-premises counterparts based on publicly available information. To keep things simple, I will concentrate on a scratch file systems, therefore somewhat biasing the argument in favour of bandwidth. 

Perlmutter is a recently installed state of the art on-premises system at the Berkeley Lab. The Perlmutter scratch file system is composed entirely of Flash memory and has a usable capacity of 35 PB with a combined peak bandwidth of 5 TB/s. The contract for this system was worth approximately 146 million USD [1] (with support and warranty included), with approximately 10-15% of the total cost spent on the storage system [2]. This indicates less than 22 million USD spent on the scratch file system (assuming 15% was storage costs). Vendor sponsored research suggests that at the high end at least lost productivity and revenue due to unscheduled file system outages can cost up to 12 million USD over a typical five year period of operation [3]. As we will see, adding up the numbers to get an upper bound cost of approximately 34 million USD (accounting for costs due to lost productivity and revenue) for using Perlmutter’s scratch storage system for five years is more cost-effective than a similar high performance file system on the cloud. 

The following table summarises the approximate monthly costs of four possible solutions to cloud high performance storage when aiming to match the Perlmutter scratch system’s combined peak bandwidth of 5 TB/s. It should be noted that in all cases such a large file system is not available by default on the cloud and the estimated costs are obtained using the monthly pay-as-you-go (PAYG) costs of available storage units duplicated until the required performance is obtained. The prices are correct as of 13/09/2021 and do not for take into account backup costs or discounts due to reserved capacity and any extra taxes. For BeeGFS, full networking efficiency (10 Gbps) is assumed.  

Vendor  Solution  Cloud Region  Storage Space Available (to nearest whole number)  Estimated Cost per Month in 

USD for 5 TiB/s combined peak bandwidth 

Assumptions 
Amazon  AWS FSx for Lustre   London eu-west-2  27 PB  9,404,081  Persistent SSD Scratch, 200 MBps/TiB throughput, no backup 
BeeGFS  AWS BeeGFS   Ireland eu-west-1  35 PB  4,707,116 

 

Recommended c4.8xlarge nodes with 8750 GB GP2 SSD EBS volumes, 10 Gbps network bandwidth per node 
NetApp  Azure NetApp Files Ultra Tier  West Europe  41 PB  14,975,786  128 MiBps/TiB throughput 

Table 1: Cost per month in USD of various cloud performance storage options. For AWS FSx for Lustre and Azure NetApp Files, the storage space available is the minimum necessary to achieve a combined peak bandwidth of 5 TB/s.  

Comparing the cloud costs within themselves, BeeGFS on AWS provides the best value for money. If the 34 million USD for the Perlmutter performance file system is split across five years, this results in a cost of around 0.57 million USD per month  (($22+$12)/60=$0.57). This is around 8 times less than the cheapest option in Table 1! One reason for this is network bandwidth limitations on the cloud. The virtual machines chosen for BeeGFS cannot exceed 10 Gbps, whereas Perlmutter can reach bandwidths of 200 Gbps per storage node. The cloud solutions therefore require far more storage nodes to reach the required combined bandwidth. 

Based on this, in general, if your workloads are particularly I/O heavy, the benefits of a cloud-based solution seem dramatically reduced.  A national lab such as Berkeley would have less clear-cut gains from a purely cloud-based scratch storage system at the time of writing. In such an environment the file storage system tends to more fully utilised at any one time in contrast to a smaller environment such as that of a small to medium size university. 

In our experience, HPC services of such smaller environments tend to overprovision their storage needs. The following table shows estimated corresponding performance storage cloud costs for an on-premises HPC system in a small/medium university environment aiming to achieve a typical combined peak bandwidth of 10 GB/s. Based on our experience in this sector, the typical size of an on-premises scratch storage system of this type would be around 1 PB and cost around 150,000 USD with no more than 30% annual maintenance costs. Typically, the system would have around 200 users. 

Vendor  Solution  Cloud Region  Storage Space Available (to nearest whole number)  Estimated Cost per Month in USD for 10 GB/s combined peak bandwidth  Assumptions 
Amazon  AWS FSx for Lustre   London eu-west-2  50 TB  18,808  Persistent SSD, 200 MBps/TiB throughput, no backup 
BeeGFS  AWS BeeGFS   Ireland eu-west-1  70 TB  10,967 

 

Recommended c4.8xlarge nodes with 8750 GB GP2 SSD EBS volumes, 10 Gbps network bandwidth per node 
NetApp  Azure NetApp Files Ultra Tier  West Europe  82 TB  29,952  128 MiBps/TiB throughput 

Table 2: Cost per month in USD of various cloud performance storage options. For AWS FSx for Lustre and Azure NetApp Files, the storage space available is the minimum necessary to achieve a combined peak bandwidth of 10 GB/s. 

With roughly an order less storage space, the cheapest cloud cost is still (although an improvement from before) more than three times the on-premises storage system cost assuming it is kept for five years (($150,000 + 0.3 x $150,000)/60= $3,250). However, on the upside a cloud-based solution comes with far more flexibility. Assuming the file systems were constantly used at peak performance around the clock, the on-premises storage system would provide a clear advantage. In reality, utilisation varies based on the time of the day and time of the year. As a result, excess storage on the cloud can be shed when it is no longer required or moved to lower cost cloud storage tiers. Consequently, cost savings can be made on the above cloud prices which are not conceivable on-premises. PAYG costs are greatly inflated and reserved cloud storage capacity trades flexibility for considerable discounts. Taking all of this into account, for a smaller HPC environment, a cloud-based scratch storage system begins to compete with its on-premises counterpart and when combined with some of the clear benefits of cloud, it is often the case of ‘when’ to migrate to the cloud rather than ‘whether’. 

[1] https://www.datacenterdynamics.com/en/news/lawrence-berkeley-install-perlmutter-supercomputer-featuring-crays-shasta-system/ 

[2] https://youtu.be/2ibUkFOeyN4 

[3] https://www.nextplatform.com/2020/05/06/unveiling-the-hidden-costs-of-hpc-storage/ 

Manveer Munde
Senior Consultant