.. |br| raw:: html
.. _set-up-gcp-cluster: ########################################## Quickstart II: Set up your GCP HPC Cluster ########################################## .. warning:: A running cluster (even idle) costs approximately **$18/day** for Filestore plus the always-on controller and login VMs. Each H4D compute node adds approximately **$10/hour** while running. If you forget to stop the cluster, a month of idle time alone is approximately **$540**. This guide walks you through deploying an HPC cluster on Google Cloud that can run GCHP at production scale. It assumes you have completed :ref:`prepare-gcp-environment`: your project is created and billed, with IAM, APIs, and quotas in place, and the local toolchain (``gcloud``, ``terraform``, ``gcluster``) is installed. ================================================================================ Workflow ================================================================================ .. list-table:: :header-rows: 1 :widths: 6 50 14 30 * - Step - What it does - Time - Cost * - 1 - Write the cluster blueprint - 5 min - $0 * - 2 - Deploy the cluster - 10-15 min - starts ~$18/day baseline * - 3 - (Optional) Add Falcon RDMA networking - 2 min - $0 * - 4 - Boot compute nodes from the GCHP image - 90 s/node - $10/hr/node while up * - 5 - Build GCHP - 30-60 min - compute time * - 6 - Run GCHP - varies - as priced ================================================================================ 1. Create an HPC Cluster Blueprint ================================================================================ A **blueprint** is a YAML file that Cluster Toolkit converts to Terraform modules and applies. It defines: * A regional VPC and subnet * A Filestore NFS volume mounted at ``/shared`` * One or more **compute partitions** (machine types, max node count) * A **controller** VM running ``slurmctld`` * A **login** VM where users SSH in Save the following as ``gchp-cluster.yaml``. Replace ```` with your project ID from :ref:`prepare-gcp-environment`. Other placeholders are sensible defaults you can leave alone. .. code-block:: yaml blueprint_name: gchp-cluster vars: project_id: deployment_name: gchpslurm region: us-central1 zone: us-central1-a deployment_groups: - group: primary modules: # 1. Networking - id: network source: modules/network/vpc # 2. NFS shared file system - id: homefs source: community/modules/file-system/filestore use: [network] settings: filestore_tier: BASIC_HDD size_gb: 1024 local_mount: /shared # 3. Compute partition - c2-standard-60 baseline - id: c2_nodeset source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset use: [network] settings: node_count_dynamic_max: 10 machine_type: c2-standard-60 instance_image: family: slurm-gcp-6-9-hpc-rocky-linux-8 project: schedmd-slurm-public - id: compute_partition source: community/modules/compute/schedmd-slurm-gcp-v6-partition use: [c2_nodeset] settings: partition_name: compute is_default: true # 4. H4D partition with the published GCHP image - id: h4d_nodeset source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset use: [network] settings: node_count_dynamic_max: 4 machine_type: h4d-standard-192 bandwidth_tier: tier_1_enabled instance_image: family: gchp1470-full # the published GCHP image project: eece-acag # the ACAG project that hosts it maintenance_policy: TERMINATE # H4D doesn't support live-migration - id: h4d_partition source: community/modules/compute/schedmd-slurm-gcp-v6-partition use: [h4d_nodeset] settings: partition_name: h4d # 5. Login + controller - id: slurm_login source: community/modules/scheduler/schedmd-slurm-gcp-v6-login use: [network] settings: machine_type: n2-standard-2 - id: slurm_controller source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller use: - network - homefs - compute_partition - h4d_partition - slurm_login settings: machine_type: c2-standard-4 ================================================================================ 2. Deploy the Cluster ================================================================================ From the directory containing ``gchp-cluster.yaml``: .. code-block:: bash ~/cluster-toolkit/gcluster create gchp-cluster.yaml cd gchpslurm/primary terraform init terraform apply # ~10-15 minutes When ``terraform apply`` finishes you will see output similar to: .. code-block:: text Apply complete! Resources: 47 added, 0 changed, 0 destroyed. Outputs: instructions_homefs = "..." instructions_slurm_login = "..." Find the login node's external IP: .. code-block:: bash gcloud compute instances list --filter="name~login" \ --format="table(name,zone,networkInterfaces[0].accessConfigs[0].natIP)" SSH in for the first time: .. code-block:: bash gcloud compute ssh gchpslurm-slurm-login-001 --zone=us-central1-a Once logged in, confirm Slurm can see the partitions: .. code-block:: bash $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up infinite 10 idle~ gchpslurm-compute-[0-9] h4d up infinite 4 idle~ gchpslurm-h4d-[0-3] The ``idle~`` state means the nodes are **powered down** (cloud burst). Slurm will boot them on demand when you ``sbatch`` a job. ================================================================================ 3. Set up Falcon RDMA Networking (multi-node MPI) ================================================================================ Multi-node MPI between H4D nodes requires a second NIC on a special :ref:`Falcon RDMA ` VPC subnet. This is a one-time per-project setup. .. code-block:: bash # Falcon profiles are zone-pinned; pick the zone matching your H4D partition ZONE=us-central1-a gcloud compute networks create gchp-falcon-net \ --subnet-mode=custom \ --bgp-routing-mode=regional \ --network-profile=${ZONE}-vpc-falcon gcloud compute networks subnets create gchp-falcon-subnet \ --network=gchp-falcon-net \ --range=10.20.0.0/24 \ --region=us-central1 Then attach the second NIC to your H4D nodeset by editing the ``h4d_nodeset`` block in the blueprint: .. code-block:: yaml - id: h4d_nodeset source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset use: [network] settings: node_count_dynamic_max: 4 machine_type: h4d-standard-192 bandwidth_tier: tier_1_enabled instance_image: family: gchp1470-full project: eece-acag maintenance_policy: TERMINATE additional_networks: - network: gchp-falcon-net subnetwork: gchp-falcon-subnet nic_type: IRDMA access_config: [] # no external IP on RDMA NIC resource_policies: - compact-placement-policy Re-apply: .. code-block:: bash cd ~/cluster-toolkit/gchpslurm/primary terraform apply The next bursted H4D node will have both gVNIC (eth0) and IRDMA (rdma0) NICs. The ``gchp1470-full`` image already loads the ``idpf`` and ``irdma`` kernel modules at boot, so no further configuration is required. .. note:: Falcon RDMA zones are limited: ``asia-southeast1-a``, ``europe-west4-b``, ``us-central1-a``, ``us-central1-b``, and ``us-west4-a``. List current availability with: .. code-block:: bash gcloud compute network-profiles list | grep falcon ================================================================================ 4. Verify the GCHP image and Falcon RDMA ================================================================================ Submit a one-node interactive job on the H4D partition: .. code-block:: bash srun -p h4d -N 1 -n 1 --pty bash Once inside the compute node, confirm everything works: .. code-block:: bash $ rpm -q libmd rdma-core-devel libmd-1.2.0-1.el8.x86_64 rdma-core-devel-48.0-10.el8_10_ciq.x86_64 $ lsmod | grep -E "irdma|idpf" irdma 528384 0 idpf 188416 0 $ ibv_devinfo | head -5 hca_id: irdma0 transport: InfiniBand (0) state: PORT_ACTIVE (4) # <- Falcon RDMA ready $ mount | grep shared 10.x.x.x:/nfsshare on /shared type nfs (...) # Filestore mount If ``ibv_devinfo`` returns ``Failed to open device``, the iRDMA provider swap (see :ref:`falcon-rdma-image`) was not applied. The published ``gchp1470-full`` image handles this for you; if you built your own image, see that page. ================================================================================ 5. Build GCHP on the cluster ================================================================================ The compute image ships the full MPI and library stack under ``/opt/gchp``, so you do not need to install Spack, UCX, OpenMPI, HDF5, netCDF, or ESMF yourself. What you do need to build is the GCHP binary itself, configured for your meteorology and chemistry choices. Grab an H4D node interactively: .. code-block:: bash srun -p h4d -N 1 -n 1 --pty bash then bring the stack onto PATH and build GCHP into a fresh rundir. The example below uses GCHP 14.7.0 with fullchem; substitute the configuration appropriate to your work: .. code-block:: bash source /opt/gchp/env.sh # Source tree (one time per cluster) cd /shared git clone --recurse-submodules https://github.com/geoschem/GCHP cd GCHP && git checkout 14.7.0 # Rundir cd /shared/GCHP/run ./createRunDir.sh # follow the prompts # Build cd /shared/rundir- mkdir -p build && cd build cmake -DRUNDIR=.. \ -DCMAKE_C_COMPILER=mpicc \ -DCMAKE_CXX_COMPILER=mpicxx \ -DCMAKE_Fortran_COMPILER=mpifort \ /shared/GCHP make -j 30 && make install The ``gchp`` binary appears at the top of the rundir, ready for ``sbatch``. Confirm it picked up the right libraries: .. code-block:: bash ldd ../gchp | grep -E "openmpi|esmf|netcdf|hdf5" # all paths under /opt/gchp/spack/opt/spack/linux-zen4/... ================================================================================ 6. Running GCHP on the cluster ================================================================================ Save the following as ``run-gchp.sh`` in your rundir. It is the multi-node template that uses Falcon RDMA. The ``/opt/gchp/env.sh`` sourcing replaces all the per-library Spack loads you would normally need. .. code-block:: bash #!/bin/bash #SBATCH --job-name=gchp-c90 #SBATCH --partition=h4d #SBATCH --nodes=2 #SBATCH --ntasks-per-node=180 #SBATCH --time=04:00:00 #SBATCH --chdir=/shared/rundir-c90-360 source /opt/gchp/env.sh # /opt/gchp/env.sh already sets OMPI_MCA_pml, UCX_TLS, etc. # Override per-job if you want, for example to disable RDMA: # export UCX_TLS=self,sm,sysv,posix,tcp echo "20190701 000000" > cap_restart source setCommonRunSettings.sh source setRestartLink.sh source checkRunSettings.sh srun --mpi=pmi2 -n 360 ./gchp > h4d.20190701_0000z.log 2>&1 Submit and monitor: .. code-block:: bash $ sbatch run-gchp.sh Submitted batch job 1 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1 h4d gchp-c90 you CF 0:05 2 gchpslurm-h4d-[0-1] (Powering up) $ tail -f h4d.20190701_0000z.log ... GCHP Date: 2019/07/01 Time: 00:10:00 Throughput(days/day) ... ================================================================================ 7. Cost Management ================================================================================ .. important:: Cost is the most common source of surprise on GCP. Develop a habit of stopping compute nodes whenever you walk away from the cluster. Daily idle baseline -------------------------------------------------------------------------------- .. list-table:: :header-rows: 1 :widths: 38 14 48 * - Resource - Daily - Stoppable? * - Filestore 1 TB BASIC_HDD - $6.67 - Only by deleting (loses ``/shared``) * - Controller VM - $5.03 - Yes - but breaks Slurm scheduling * - Login VM - $2.33 - Yes - but blocks SSH-in * - Disks + IPs - $2.50 - No * - **Idle total** - **$18-20/day** - Stop the cluster when you are done for the day -------------------------------------------------------------------------------- .. code-block:: bash # H4D nodes auto-stop when Slurm jobs finish (~5 min idle) # To stop the always-on VMs (cluster won't accept new jobs until you start them): gcloud compute instances stop gchpslurm-controller gchpslurm-slurm-login-001 \ --zone=us-central1-a This drops the cost to approximately **$10/day** (Filestore + disks only). Watch your spend -------------------------------------------------------------------------------- Enable **Billing export to BigQuery** in the Cloud Console (*Billing -> Billing export*). After 24 hours you can query exact daily spend per SKU: .. code-block:: sql SELECT service.description, sku.description, SUM(cost) AS usd FROM `.billing.gcp_billing_export_*` WHERE invoice.month = FORMAT_DATE('%Y%m', CURRENT_DATE()) GROUP BY 1, 2 ORDER BY usd DESC LIMIT 20 Tear down completely -------------------------------------------------------------------------------- To stop **all** ongoing charges (including disks and Filestore): .. code-block:: bash cd ~/cluster-toolkit/gchpslurm/primary terraform destroy # deletes everything - files in /shared are lost .. warning:: ``terraform destroy`` permanently deletes the Filestore volume and all data in ``/shared``. Back up your GCHP build, restarts, and output to Cloud Storage first: .. code-block:: bash gsutil -m rsync -r /shared/rundir-c90-360/OutputDir/ \ gs:///c90-360/ ================================================================================ What is next ================================================================================ * For the engineering details of the published GCHP image and how to build your own, see :ref:`falcon-rdma-image`. * For terminology and pricing references, see :ref:`gcp-terminology`.