Quickstart II: Set up your GCP HPC Cluster
Warning
A running cluster (even idle) costs approximately $18/day for Filestore plus the always-on controller and login VMs. Each H4D compute node adds approximately $10/hour while running. If you forget to stop the cluster, a month of idle time alone is approximately $540.
This guide walks you through deploying an HPC cluster on Google
Cloud that can run GCHP at production scale. It assumes you have
completed Quickstart I: Prepare Your GCP Environment: your project is created
and billed, with IAM, APIs, and quotas in place, and the local
toolchain (gcloud, terraform, gcluster) is installed.
Workflow
Step |
What it does |
Time |
Cost |
|---|---|---|---|
1 |
Write the cluster blueprint |
5 min |
$0 |
2 |
Deploy the cluster |
10-15 min |
starts ~$18/day baseline |
3 |
(Optional) Add Falcon RDMA networking |
2 min |
$0 |
4 |
Boot compute nodes from the GCHP image |
90 s/node |
$10/hr/node while up |
5 |
Build GCHP |
30-60 min |
compute time |
6 |
Run GCHP |
varies |
as priced |
1. Create an HPC Cluster Blueprint
A blueprint is a YAML file that Cluster Toolkit converts to Terraform modules and applies. It defines:
A regional VPC and subnet
A Filestore NFS volume mounted at
/sharedOne or more compute partitions (machine types, max node count)
A controller VM running
slurmctldA login VM where users SSH in
Save the following as gchp-cluster.yaml. Replace <PROJECT_ID>
with your project ID from Quickstart I: Prepare Your GCP Environment. Other
placeholders are sensible defaults you can leave alone.
blueprint_name: gchp-cluster
vars:
project_id: <PROJECT_ID>
deployment_name: gchpslurm
region: us-central1
zone: us-central1-a
deployment_groups:
- group: primary
modules:
# 1. Networking
- id: network
source: modules/network/vpc
# 2. NFS shared file system
- id: homefs
source: community/modules/file-system/filestore
use: [network]
settings:
filestore_tier: BASIC_HDD
size_gb: 1024
local_mount: /shared
# 3. Compute partition - c2-standard-60 baseline
- id: c2_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network]
settings:
node_count_dynamic_max: 10
machine_type: c2-standard-60
instance_image:
family: slurm-gcp-6-9-hpc-rocky-linux-8
project: schedmd-slurm-public
- id: compute_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use: [c2_nodeset]
settings:
partition_name: compute
is_default: true
# 4. H4D partition with the published GCHP image
- id: h4d_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network]
settings:
node_count_dynamic_max: 4
machine_type: h4d-standard-192
bandwidth_tier: tier_1_enabled
instance_image:
family: gchp1470-full # the published GCHP image
project: eece-acag # the ACAG project that hosts it
maintenance_policy: TERMINATE # H4D doesn't support live-migration
- id: h4d_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use: [h4d_nodeset]
settings:
partition_name: h4d
# 5. Login + controller
- id: slurm_login
source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
use: [network]
settings:
machine_type: n2-standard-2
- id: slurm_controller
source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
use:
- network
- homefs
- compute_partition
- h4d_partition
- slurm_login
settings:
machine_type: c2-standard-4
2. Deploy the Cluster
From the directory containing gchp-cluster.yaml:
~/cluster-toolkit/gcluster create gchp-cluster.yaml
cd gchpslurm/primary
terraform init
terraform apply # ~10-15 minutes
When terraform apply finishes you will see output similar to:
Apply complete! Resources: 47 added, 0 changed, 0 destroyed.
Outputs:
instructions_homefs = "..."
instructions_slurm_login = "..."
Find the login node’s external IP:
gcloud compute instances list --filter="name~login" \
--format="table(name,zone,networkInterfaces[0].accessConfigs[0].natIP)"
SSH in for the first time:
gcloud compute ssh gchpslurm-slurm-login-001 --zone=us-central1-a
Once logged in, confirm Slurm can see the partitions:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up infinite 10 idle~ gchpslurm-compute-[0-9]
h4d up infinite 4 idle~ gchpslurm-h4d-[0-3]
The idle~ state means the nodes are powered down (cloud
burst). Slurm will boot them on demand when you sbatch a job.
3. Set up Falcon RDMA Networking (multi-node MPI)
Multi-node MPI between H4D nodes requires a second NIC on a special Falcon RDMA VPC subnet. This is a one-time per-project setup.
# Falcon profiles are zone-pinned; pick the zone matching your H4D partition
ZONE=us-central1-a
gcloud compute networks create gchp-falcon-net \
--subnet-mode=custom \
--bgp-routing-mode=regional \
--network-profile=${ZONE}-vpc-falcon
gcloud compute networks subnets create gchp-falcon-subnet \
--network=gchp-falcon-net \
--range=10.20.0.0/24 \
--region=us-central1
Then attach the second NIC to your H4D nodeset by editing the
h4d_nodeset block in the blueprint:
- id: h4d_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network]
settings:
node_count_dynamic_max: 4
machine_type: h4d-standard-192
bandwidth_tier: tier_1_enabled
instance_image:
family: gchp1470-full
project: eece-acag
maintenance_policy: TERMINATE
additional_networks:
- network: gchp-falcon-net
subnetwork: gchp-falcon-subnet
nic_type: IRDMA
access_config: [] # no external IP on RDMA NIC
resource_policies:
- compact-placement-policy
Re-apply:
cd ~/cluster-toolkit/gchpslurm/primary
terraform apply
The next bursted H4D node will have both gVNIC (eth0) and IRDMA
(rdma0) NICs. The gchp1470-full image already loads the
idpf and irdma kernel modules at boot, so no further
configuration is required.
Note
Falcon RDMA zones are limited: asia-southeast1-a,
europe-west4-b, us-central1-a, us-central1-b, and
us-west4-a. List current availability with:
gcloud compute network-profiles list | grep falcon
4. Verify the GCHP image and Falcon RDMA
Submit a one-node interactive job on the H4D partition:
srun -p h4d -N 1 -n 1 --pty bash
Once inside the compute node, confirm everything works:
$ rpm -q libmd rdma-core-devel
libmd-1.2.0-1.el8.x86_64
rdma-core-devel-48.0-10.el8_10_ciq.x86_64
$ lsmod | grep -E "irdma|idpf"
irdma 528384 0
idpf 188416 0
$ ibv_devinfo | head -5
hca_id: irdma0
transport: InfiniBand (0)
state: PORT_ACTIVE (4) # <- Falcon RDMA ready
$ mount | grep shared
10.x.x.x:/nfsshare on /shared type nfs (...) # Filestore mount
If ibv_devinfo returns Failed to open device, the iRDMA
provider swap (see The GCHP compute image and Falcon RDMA) was not applied. The
published gchp1470-full image handles this for you; if you
built your own image, see that page.
5. Build GCHP on the cluster
The compute image ships the full MPI and library stack under
/opt/gchp, so you do not need to install Spack, UCX, OpenMPI,
HDF5, netCDF, or ESMF yourself. What you do need to build is the
GCHP binary itself, configured for your meteorology and chemistry
choices.
Grab an H4D node interactively:
srun -p h4d -N 1 -n 1 --pty bash
then bring the stack onto PATH and build GCHP into a fresh rundir. The example below uses GCHP 14.7.0 with fullchem; substitute the configuration appropriate to your work:
source /opt/gchp/env.sh
# Source tree (one time per cluster)
cd /shared
git clone --recurse-submodules https://github.com/geoschem/GCHP
cd GCHP && git checkout 14.7.0
# Rundir
cd /shared/GCHP/run
./createRunDir.sh # follow the prompts
# Build
cd /shared/rundir-<your-name>
mkdir -p build && cd build
cmake -DRUNDIR=.. \
-DCMAKE_C_COMPILER=mpicc \
-DCMAKE_CXX_COMPILER=mpicxx \
-DCMAKE_Fortran_COMPILER=mpifort \
/shared/GCHP
make -j 30 && make install
The gchp binary appears at the top of the rundir, ready for
sbatch. Confirm it picked up the right libraries:
ldd ../gchp | grep -E "openmpi|esmf|netcdf|hdf5"
# all paths under /opt/gchp/spack/opt/spack/linux-zen4/...
6. Running GCHP on the cluster
Save the following as run-gchp.sh in your rundir. It is the
multi-node template that uses Falcon RDMA. The /opt/gchp/env.sh
sourcing replaces all the per-library Spack loads you would normally
need.
#!/bin/bash
#SBATCH --job-name=gchp-c90
#SBATCH --partition=h4d
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=180
#SBATCH --time=04:00:00
#SBATCH --chdir=/shared/rundir-c90-360
source /opt/gchp/env.sh
# /opt/gchp/env.sh already sets OMPI_MCA_pml, UCX_TLS, etc.
# Override per-job if you want, for example to disable RDMA:
# export UCX_TLS=self,sm,sysv,posix,tcp
echo "20190701 000000" > cap_restart
source setCommonRunSettings.sh
source setRestartLink.sh
source checkRunSettings.sh
srun --mpi=pmi2 -n 360 ./gchp > h4d.20190701_0000z.log 2>&1
Submit and monitor:
$ sbatch run-gchp.sh
Submitted batch job 1
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 h4d gchp-c90 you CF 0:05 2 gchpslurm-h4d-[0-1] (Powering up)
$ tail -f h4d.20190701_0000z.log
... GCHP Date: 2019/07/01 Time: 00:10:00 Throughput(days/day) ...
7. Cost Management
Important
Cost is the most common source of surprise on GCP. Develop a habit of stopping compute nodes whenever you walk away from the cluster.
Daily idle baseline
Resource |
Daily |
Stoppable? |
|---|---|---|
Filestore 1 TB BASIC_HDD |
$6.67 |
Only by deleting (loses |
Controller VM |
$5.03 |
Yes - but breaks Slurm scheduling |
Login VM |
$2.33 |
Yes - but blocks SSH-in |
Disks + IPs |
$2.50 |
No |
Idle total |
$18-20/day |
Stop the cluster when you are done for the day
# H4D nodes auto-stop when Slurm jobs finish (~5 min idle)
# To stop the always-on VMs (cluster won't accept new jobs until you start them):
gcloud compute instances stop gchpslurm-controller gchpslurm-slurm-login-001 \
--zone=us-central1-a
This drops the cost to approximately $10/day (Filestore + disks only).
Watch your spend
Enable Billing export to BigQuery in the Cloud Console (Billing -> Billing export). After 24 hours you can query exact daily spend per SKU:
SELECT service.description, sku.description, SUM(cost) AS usd
FROM `<project>.billing.gcp_billing_export_*`
WHERE invoice.month = FORMAT_DATE('%Y%m', CURRENT_DATE())
GROUP BY 1, 2
ORDER BY usd DESC
LIMIT 20
Tear down completely
To stop all ongoing charges (including disks and Filestore):
cd ~/cluster-toolkit/gchpslurm/primary
terraform destroy # deletes everything - files in /shared are lost
Warning
terraform destroy permanently deletes the Filestore volume
and all data in /shared. Back up your GCHP build, restarts,
and output to Cloud Storage first:
gsutil -m rsync -r /shared/rundir-c90-360/OutputDir/ \
gs://<your-bucket>/c90-360/
What is next
For the engineering details of the published GCHP image and how to build your own, see The GCHP compute image and Falcon RDMA.
For terminology and pricing references, see Reference: GCP Terminology.