.. _gcp-terminology:

==========================
Reference: GCP Terminology
==========================

A short glossary of Google Cloud terms used in this guide.
Equivalents to AWS terms (where relevant) are listed for readers
familiar with the AWS GEOS-Chem cloud guide.


.. _term-project:

Project
----------------------

What is a GCP Project?
^^^^^^^^^^^^^^^^^^^^^^

A **Project** is the top-level GCP container for all resources.
Every VM, disk, network, and IP belongs to exactly one project, and
billing accumulates per project. Each project is identified by a
globally unique **Project ID** (e.g., ``gchp-prod-414000``), which
is what every ``gcloud`` command uses.

*AWS equivalent: AWS Account.*


.. _term-compute-engine:

Compute Engine
----------------------

What is Compute Engine?
^^^^^^^^^^^^^^^^^^^^^^^

**Compute Engine** is GCP's virtual machine service. It hosts the
cluster's controller, login, and burst compute nodes. The two
machine types used in this guide are:

* ``c2-standard-60`` - Cascade Lake Xeon with 30 physical cores
  per VM (hyperthreading off). ~$2.48/hour. Good for C48-C90
  single-node baseline runs.

* ``h4d-standard-192`` - AMD EPYC Bergamo with **192 physical
  cores per VM** and Intel Falcon iRDMA support. ~$10/hour.
  Required for multi-node Falcon RDMA workloads.

*AWS equivalent: EC2.*


.. _term-cluster-toolkit:

Cluster Toolkit
----------------------

What is Cluster Toolkit?
^^^^^^^^^^^^^^^^^^^^^^^^

Google's open-source HPC cluster deployer (formerly called HPC
Toolkit). It reads a **blueprint** (YAML) and generates Terraform
code to provision a Slurm cluster on Compute Engine.

Why use Cluster Toolkit for GCHP?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Reproducible**: the entire cluster (network, Filestore,
  partitions, login, controller) is described in one YAML file.

* **Slurm-integrated**: the toolkit configures Slurm-on-GCP, which
  handles the bursting of compute nodes from the cluster's pool of
  zero running nodes when ``sbatch`` is invoked, and shuts them
  down once they have been idle for ~5 minutes.

* **Cost-aware**: only the controller and login VMs are always
  on. Compute nodes incur charges only while a job is running.

*AWS equivalent: AWS ParallelCluster.*


.. _term-filestore:

Filestore
----------------------

What is Filestore?
^^^^^^^^^^^^^^^^^^

**Filestore** is GCP's managed NFS service. We use it for the
cluster's ``/shared`` mount holding the Spack stack, GCHP binary,
ExtData, and run directories.

The smallest BASIC_HDD volume is 1 TB, costing ~$6.67/day. This is
the dominant fixed-cost line for an idle GCHP cluster.

*AWS equivalent: FSx for Lustre or EFS.*


.. _term-slurm-gcp:

Slurm (and Slurm-GCP)
----------------------

What is Slurm?
^^^^^^^^^^^^^^

The standard HPC batch scheduler. Submitted via ``sbatch``,
monitored with ``squeue`` and ``sinfo``. Cluster Toolkit ships a
Slurm-GCP integration that handles burst node provisioning. When
you ``sbatch`` a job, Slurm-GCP boots the requested number of
compute VMs from a pool of cloud-burst nodes (about 90 s for an H4D
node), runs the job, then powers them down ~5 minutes after the
last job finishes.


.. _term-falcon-rdma:

Falcon RDMA
----------------------

What is Falcon RDMA?
^^^^^^^^^^^^^^^^^^^^

Intel's RDMA-over-Ethernet technology, exposed on H4D instances via
the ``irdma`` kernel module and a dedicated VPC subnet with a
``<zone>-vpc-falcon`` network profile. Gives multi-node MPI the
low-latency, zero-copy semantics of InfiniBand.

The supported zones (as of 2026-06) are ``asia-southeast1-a``,
``europe-west4-b``, ``us-central1-a``, ``us-central1-b``, and
``us-west4-a``. The current list can be queried with
``gcloud compute network-profiles list | grep falcon``.

Why Falcon RDMA matters for GCHP
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Multi-node MPI via TCP over gVNIC degrades sharply at higher core
counts because the kernel network stack adds latency and CPU
overhead. Falcon RDMA bypasses the kernel for inter-node
communication, restoring near-shared-memory performance across the
cluster. Concretely, our C90 strong-scaling stays linear through
360 cores (2 H4D nodes) with Falcon RDMA but would not over TCP.

*AWS equivalent: EFA on c5n/hpc6id instances.*


.. _term-gvnic:

gVNIC
----------------------

What is gVNIC?
^^^^^^^^^^^^^^

Google Virtual NIC - the standard high-performance virtual NIC on
modern Compute Engine instances. Carries normal TCP/IP traffic
(including NFS to Filestore). On H4D nodes it serves as the primary
NIC alongside an IRDMA secondary NIC.


.. _term-irdma:

IRDMA NIC
----------------------

What is an IRDMA NIC?
^^^^^^^^^^^^^^^^^^^^^

The NIC type GCP uses for Falcon RDMA on H4D. Attached as a second
network interface on each H4D VM, on a Falcon-enabled VPC subnet.
The kernel-side driver is the ``irdma`` module (loaded automatically
at boot by the published GCHP image).


.. _term-image:

Image
----------------------

What is a GCP Image?
^^^^^^^^^^^^^^^^^^^^

A snapshot of a VM's boot disk that can be used to launch new VMs.
The ``gchp-h4d-rocky8-v2`` image (this guide) has the kernel
modules, system packages, and first-boot scripts required to run
GCHP on H4D. See :ref:`falcon-rdma-image` for details.

*AWS equivalent: AMI.*


.. _term-vpc-gcp:

VPC
----------------------

What is a VPC on GCP?
^^^^^^^^^^^^^^^^^^^^^

Virtual Private Cloud - the networking layer that connects your
VMs. Cluster Toolkit creates a regional VPC for the cluster. Falcon
RDMA requires a second, zone-pinned VPC with a ``vpc-falcon``
network profile.


.. _term-quota-gcp:

Service Quota
----------------------

What is a Service Quota?
^^^^^^^^^^^^^^^^^^^^^^^^

A regional or global limit on how many of a particular resource
(CPUs, IP addresses, Filestore TB) you can create. Default quotas
on a fresh project are low; raise them via **IAM & Admin -> Quotas
& System Limits**.

*AWS equivalent: Service Quotas / vCPU limits.*


.. _term-iam-gcp:

IAM (Identity and Access Management) on GCP
-------------------------------------------

What is IAM on GCP?
^^^^^^^^^^^^^^^^^^^

Google's identity and authorization framework. IAM controls who
(users, groups, service accounts) can do what (compute.admin,
file.editor, etc.) to which resources. Permission grants are made
by binding a **role** (a named set of permissions) to a **principal**
(a user, group, or service account).

When you run ``terraform apply`` to deploy a cluster, your user
needs ``roles/compute.admin``, ``roles/file.editor``, and several
others - listed in :ref:`prepare-gcp-environment`.


.. _term-billing-export:

Billing Export to BigQuery
--------------------------

What is Billing Export to BigQuery?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A GCP feature that mirrors every billing event into a BigQuery
table for SQL analysis. The cleanest way to audit your spend by
service over arbitrary time ranges. Enable in **Billing -> Billing
export**.

After 24 hours, you can run queries like:

.. code-block:: sql

   SELECT service.description, sku.description, SUM(cost) AS usd
   FROM `<project>.billing.gcp_billing_export_*`
   WHERE invoice.month = FORMAT_DATE('%Y%m', CURRENT_DATE())
   GROUP BY 1, 2 ORDER BY usd DESC


.. _term-cloud-storage:

Cloud Storage
----------------------

What is Cloud Storage?
^^^^^^^^^^^^^^^^^^^^^^

GCP's object store. Useful for archiving simulation output before
tearing down the cluster (``terraform destroy``) so that destroying
Filestore does not lose your results.

*AWS equivalent: S3.*


.. _term-os-login:

OS Login
----------------------

What is OS Login?
^^^^^^^^^^^^^^^^^

Google's recommended SSH access mechanism. Instead of putting
public keys into instance metadata, OS Login lets users SSH with
their GCP IAM identity. The published ``gchp-h4d-rocky8`` image
works with both approaches.


.. _term-dws-flex-start:

Dynamic Workload Scheduler (DWS) Flex Start
-------------------------------------------

What is DWS Flex Start?
^^^^^^^^^^^^^^^^^^^^^^^

A provisioning mode where you ask GCP to run a job "sometime in the
next N hours" instead of demanding the resource right now. The
scheduler queues your request and starts it when capacity is
available. Cheaper than reservations; the right choice for H4D runs
that hit ``ZONE_RESOURCE_POOL_EXHAUSTED`` stockouts.

Submit with ``--provisioning-model=FLEX_START`` on
``gcloud compute instances bulk create``.