.. |br| raw:: html
.. _set-up-gcp-cluster:
##########################################
Quickstart II: Set up your GCP HPC Cluster
##########################################
.. warning::
A running cluster (even idle) costs approximately **$18/day** for
Filestore plus the always-on controller and login VMs. Each H4D
compute node adds approximately **$10/hour** while running. If you
forget to stop the cluster, a month of idle time alone is
approximately **$540**.
This guide walks you through deploying an HPC cluster on Google
Cloud that can run GCHP at production scale. It assumes you have
completed :ref:`prepare-gcp-environment`: your project is created
and billed, with IAM, APIs, and quotas in place, and the local
toolchain (``gcloud``, ``terraform``, ``gcluster``) is installed.
================================================================================
Workflow
================================================================================
.. list-table::
:header-rows: 1
:widths: 6 50 14 30
* - Step
- What it does
- Time
- Cost
* - 1
- Write the cluster blueprint
- 5 min
- $0
* - 2
- Deploy the cluster
- 10-15 min
- starts ~$18/day baseline
* - 3
- (Optional) Add Falcon RDMA networking
- 2 min
- $0
* - 4
- Boot compute nodes from the GCHP image
- 90 s/node
- $10/hr/node while up
* - 5
- Build GCHP
- 30-60 min
- compute time
* - 6
- Run GCHP
- varies
- as priced
================================================================================
1. Create an HPC Cluster Blueprint
================================================================================
A **blueprint** is a YAML file that Cluster Toolkit converts to
Terraform modules and applies. It defines:
* A regional VPC and subnet
* A Filestore NFS volume mounted at ``/shared``
* One or more **compute partitions** (machine types, max node count)
* A **controller** VM running ``slurmctld``
* A **login** VM where users SSH in
Save the following as ``gchp-cluster.yaml``. Replace ````
with your project ID from :ref:`prepare-gcp-environment`. Other
placeholders are sensible defaults you can leave alone.
.. code-block:: yaml
blueprint_name: gchp-cluster
vars:
project_id:
deployment_name: gchpslurm
region: us-central1
zone: us-central1-a
deployment_groups:
- group: primary
modules:
# 1. Networking
- id: network
source: modules/network/vpc
# 2. NFS shared file system
- id: homefs
source: community/modules/file-system/filestore
use: [network]
settings:
filestore_tier: BASIC_HDD
size_gb: 1024
local_mount: /shared
# 3. Compute partition - c2-standard-60 baseline
- id: c2_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network]
settings:
node_count_dynamic_max: 10
machine_type: c2-standard-60
instance_image:
family: slurm-gcp-6-9-hpc-rocky-linux-8
project: schedmd-slurm-public
- id: compute_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use: [c2_nodeset]
settings:
partition_name: compute
is_default: true
# 4. H4D partition with the published GCHP image
- id: h4d_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network]
settings:
node_count_dynamic_max: 4
machine_type: h4d-standard-192
bandwidth_tier: tier_1_enabled
instance_image:
family: gchp1470-full # the published GCHP image
project: eece-acag # the ACAG project that hosts it
maintenance_policy: TERMINATE # H4D doesn't support live-migration
- id: h4d_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use: [h4d_nodeset]
settings:
partition_name: h4d
# 5. Login + controller
- id: slurm_login
source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
use: [network]
settings:
machine_type: n2-standard-2
- id: slurm_controller
source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
use:
- network
- homefs
- compute_partition
- h4d_partition
- slurm_login
settings:
machine_type: c2-standard-4
================================================================================
2. Deploy the Cluster
================================================================================
From the directory containing ``gchp-cluster.yaml``:
.. code-block:: bash
~/cluster-toolkit/gcluster create gchp-cluster.yaml
cd gchpslurm/primary
terraform init
terraform apply # ~10-15 minutes
When ``terraform apply`` finishes you will see output similar to:
.. code-block:: text
Apply complete! Resources: 47 added, 0 changed, 0 destroyed.
Outputs:
instructions_homefs = "..."
instructions_slurm_login = "..."
Find the login node's external IP:
.. code-block:: bash
gcloud compute instances list --filter="name~login" \
--format="table(name,zone,networkInterfaces[0].accessConfigs[0].natIP)"
SSH in for the first time:
.. code-block:: bash
gcloud compute ssh gchpslurm-slurm-login-001 --zone=us-central1-a
Once logged in, confirm Slurm can see the partitions:
.. code-block:: bash
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up infinite 10 idle~ gchpslurm-compute-[0-9]
h4d up infinite 4 idle~ gchpslurm-h4d-[0-3]
The ``idle~`` state means the nodes are **powered down** (cloud
burst). Slurm will boot them on demand when you ``sbatch`` a job.
================================================================================
3. Set up Falcon RDMA Networking (multi-node MPI)
================================================================================
Multi-node MPI between H4D nodes requires a second NIC on a special
:ref:`Falcon RDMA ` VPC subnet. This is a one-time
per-project setup.
.. code-block:: bash
# Falcon profiles are zone-pinned; pick the zone matching your H4D partition
ZONE=us-central1-a
gcloud compute networks create gchp-falcon-net \
--subnet-mode=custom \
--bgp-routing-mode=regional \
--network-profile=${ZONE}-vpc-falcon
gcloud compute networks subnets create gchp-falcon-subnet \
--network=gchp-falcon-net \
--range=10.20.0.0/24 \
--region=us-central1
Then attach the second NIC to your H4D nodeset by editing the
``h4d_nodeset`` block in the blueprint:
.. code-block:: yaml
- id: h4d_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network]
settings:
node_count_dynamic_max: 4
machine_type: h4d-standard-192
bandwidth_tier: tier_1_enabled
instance_image:
family: gchp1470-full
project: eece-acag
maintenance_policy: TERMINATE
additional_networks:
- network: gchp-falcon-net
subnetwork: gchp-falcon-subnet
nic_type: IRDMA
access_config: [] # no external IP on RDMA NIC
resource_policies:
- compact-placement-policy
Re-apply:
.. code-block:: bash
cd ~/cluster-toolkit/gchpslurm/primary
terraform apply
The next bursted H4D node will have both gVNIC (eth0) and IRDMA
(rdma0) NICs. The ``gchp1470-full`` image already loads the
``idpf`` and ``irdma`` kernel modules at boot, so no further
configuration is required.
.. note::
Falcon RDMA zones are limited: ``asia-southeast1-a``,
``europe-west4-b``, ``us-central1-a``, ``us-central1-b``, and
``us-west4-a``. List current availability with:
.. code-block:: bash
gcloud compute network-profiles list | grep falcon
================================================================================
4. Verify the GCHP image and Falcon RDMA
================================================================================
Submit a one-node interactive job on the H4D partition:
.. code-block:: bash
srun -p h4d -N 1 -n 1 --pty bash
Once inside the compute node, confirm everything works:
.. code-block:: bash
$ rpm -q libmd rdma-core-devel
libmd-1.2.0-1.el8.x86_64
rdma-core-devel-48.0-10.el8_10_ciq.x86_64
$ lsmod | grep -E "irdma|idpf"
irdma 528384 0
idpf 188416 0
$ ibv_devinfo | head -5
hca_id: irdma0
transport: InfiniBand (0)
state: PORT_ACTIVE (4) # <- Falcon RDMA ready
$ mount | grep shared
10.x.x.x:/nfsshare on /shared type nfs (...) # Filestore mount
If ``ibv_devinfo`` returns ``Failed to open device``, the iRDMA
provider swap (see :ref:`falcon-rdma-image`) was not applied. The
published ``gchp1470-full`` image handles this for you; if you
built your own image, see that page.
================================================================================
5. Build GCHP on the cluster
================================================================================
The compute image ships the full MPI and library stack under
``/opt/gchp``, so you do not need to install Spack, UCX, OpenMPI,
HDF5, netCDF, or ESMF yourself. What you do need to build is the
GCHP binary itself, configured for your meteorology and chemistry
choices.
Grab an H4D node interactively:
.. code-block:: bash
srun -p h4d -N 1 -n 1 --pty bash
then bring the stack onto PATH and build GCHP into a fresh rundir.
The example below uses GCHP 14.7.0 with fullchem; substitute the
configuration appropriate to your work:
.. code-block:: bash
source /opt/gchp/env.sh
# Source tree (one time per cluster)
cd /shared
git clone --recurse-submodules https://github.com/geoschem/GCHP
cd GCHP && git checkout 14.7.0
# Rundir
cd /shared/GCHP/run
./createRunDir.sh # follow the prompts
# Build
cd /shared/rundir-
mkdir -p build && cd build
cmake -DRUNDIR=.. \
-DCMAKE_C_COMPILER=mpicc \
-DCMAKE_CXX_COMPILER=mpicxx \
-DCMAKE_Fortran_COMPILER=mpifort \
/shared/GCHP
make -j 30 && make install
The ``gchp`` binary appears at the top of the rundir, ready for
``sbatch``. Confirm it picked up the right libraries:
.. code-block:: bash
ldd ../gchp | grep -E "openmpi|esmf|netcdf|hdf5"
# all paths under /opt/gchp/spack/opt/spack/linux-zen4/...
================================================================================
6. Running GCHP on the cluster
================================================================================
Save the following as ``run-gchp.sh`` in your rundir. It is the
multi-node template that uses Falcon RDMA. The ``/opt/gchp/env.sh``
sourcing replaces all the per-library Spack loads you would normally
need.
.. code-block:: bash
#!/bin/bash
#SBATCH --job-name=gchp-c90
#SBATCH --partition=h4d
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=180
#SBATCH --time=04:00:00
#SBATCH --chdir=/shared/rundir-c90-360
source /opt/gchp/env.sh
# /opt/gchp/env.sh already sets OMPI_MCA_pml, UCX_TLS, etc.
# Override per-job if you want, for example to disable RDMA:
# export UCX_TLS=self,sm,sysv,posix,tcp
echo "20190701 000000" > cap_restart
source setCommonRunSettings.sh
source setRestartLink.sh
source checkRunSettings.sh
srun --mpi=pmi2 -n 360 ./gchp > h4d.20190701_0000z.log 2>&1
Submit and monitor:
.. code-block:: bash
$ sbatch run-gchp.sh
Submitted batch job 1
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 h4d gchp-c90 you CF 0:05 2 gchpslurm-h4d-[0-1] (Powering up)
$ tail -f h4d.20190701_0000z.log
... GCHP Date: 2019/07/01 Time: 00:10:00 Throughput(days/day) ...
================================================================================
7. Cost Management
================================================================================
.. important::
Cost is the most common source of surprise on GCP. Develop a
habit of stopping compute nodes whenever you walk away from the
cluster.
Daily idle baseline
--------------------------------------------------------------------------------
.. list-table::
:header-rows: 1
:widths: 38 14 48
* - Resource
- Daily
- Stoppable?
* - Filestore 1 TB BASIC_HDD
- $6.67
- Only by deleting (loses ``/shared``)
* - Controller VM
- $5.03
- Yes - but breaks Slurm scheduling
* - Login VM
- $2.33
- Yes - but blocks SSH-in
* - Disks + IPs
- $2.50
- No
* - **Idle total**
- **$18-20/day**
-
Stop the cluster when you are done for the day
--------------------------------------------------------------------------------
.. code-block:: bash
# H4D nodes auto-stop when Slurm jobs finish (~5 min idle)
# To stop the always-on VMs (cluster won't accept new jobs until you start them):
gcloud compute instances stop gchpslurm-controller gchpslurm-slurm-login-001 \
--zone=us-central1-a
This drops the cost to approximately **$10/day** (Filestore + disks
only).
Watch your spend
--------------------------------------------------------------------------------
Enable **Billing export to BigQuery** in the Cloud Console
(*Billing -> Billing export*). After 24 hours you can query exact
daily spend per SKU:
.. code-block:: sql
SELECT service.description, sku.description, SUM(cost) AS usd
FROM `.billing.gcp_billing_export_*`
WHERE invoice.month = FORMAT_DATE('%Y%m', CURRENT_DATE())
GROUP BY 1, 2
ORDER BY usd DESC
LIMIT 20
Tear down completely
--------------------------------------------------------------------------------
To stop **all** ongoing charges (including disks and Filestore):
.. code-block:: bash
cd ~/cluster-toolkit/gchpslurm/primary
terraform destroy # deletes everything - files in /shared are lost
.. warning::
``terraform destroy`` permanently deletes the Filestore volume
and all data in ``/shared``. Back up your GCHP build, restarts,
and output to Cloud Storage first:
.. code-block:: bash
gsutil -m rsync -r /shared/rundir-c90-360/OutputDir/ \
gs:///c90-360/
================================================================================
What is next
================================================================================
* For the engineering details of the published GCHP image and how
to build your own, see :ref:`falcon-rdma-image`.
* For terminology and pricing references, see
:ref:`gcp-terminology`.