Compute

The two non-negotiables

Every job that runs on lab compute carries two SHAs at the kernel level: the digest of its container and the hash of its dataset manifest. If either is missing, the scheduler refuses the job. If either changes, the job is treated as a new experiment and gets a new run ID.

$ etr run --pin sha256:9f2c1a…
  ✓ image verified
  ✓ dataset hash matched
  → launching on cluster aurora-n7
  → run id: r7-2026-04-29-1947

This is not a recommendation — it is a kernel-level requirement of the scheduler. Reviewers can re-issue any run by replaying the same two SHAs against the cluster, and the scheduler will refuse if anything has drifted.

∎

No "just one quick run" without a hash.

Ad-hoc runs are explicitly disallowed in the scheduler. If a researcher wants to try something experimental, they pin a sandbox container — which has its own SHA — and the experiment becomes part of the public ledger automatically.

Cluster topology

The compute fabric is intentionally small and uniform. Six nodes, each addressable by name; one scheduler; one pin-vault that holds every SHA the lab has ever issued.

Node	Role	Hardware
`aurora-n7`	Primary training cluster	8 × H100, 1.5 TB DRAM
`archive-α`	Append-only artefact storage	612 TB, 3-way ECC
`runner-β`	CI / replication harness	24×7, 4 × A100
`pin-vault`	SHA-256 manifest authority	dedicated, immutable
`scheduler`	Slurm 24.05 with the pin enforcer	dual-node failover
`peer-mirror`	Three-replica off-site mirror	partner institutions

Reproducibility, end to end

A reproducible run is not a collection of “best practices.” It is a property the scheduler enforces:

Container digest pinned — every run carries the OCI digest of its image; floating tags are rejected.
Dataset manifest pinned — every run names the dataset by SHA, never by mutable path.
Random seeds versioned — seeds are recorded with the run; replays use the recorded seed by default.
Output content-addressed — every artefact a run emits is hashed at write time and indexed in the vault.
Replay command emitted — every run produces a one-line replicate.sh that any reviewer can run.

Storage, in three tiers

The archival surface is structured so the hot path is small and the historical surface is mirrored.

Hot tier — 24 TB on cluster: Active datasets and recent run outputs. Reads are local; writes go through the pin vault.
Warm tier — 192 TB on archive-α: Datasets and runs from the last 18 months. Reads are within seconds; writes are append-only.
Cold tier — 396 TB mirrored: Older artefacts replicated across the peer-mirror network. Reads can take minutes; integrity is checked at read time.
Vault — 4 TB, dedicated: Pin authority. Every SHA the lab has ever issued. Append-only with hash chains; an entry is never deleted.

How a run looks in practice

A typical training run from a working group looks like:

$ etr run \
    --image     sha256:9f2c1a8b… \
    --dataset   sha256:7b3a4f12… \
    --config    configs/rl-feature-q.yaml \
    --seed      0xC0FFEE \
    --replicate manifests/r7-2026-04-29-1947.json

  ✓ image verified                    9f2c1a8b…
  ✓ dataset hash matched              7b3a4f12…
  ✓ pin vault entry created           v-r7-2947
  → launching on cluster              aurora-n7
  → run id                            r7-2026-04-29-1947

When the run finishes, the manifest is published to the vault, mirrored to peer storage, and the replicate.sh is added to the paper-in-progress. A reviewer with the manifest can replay the run on their own cluster, or on runner-β if they have access.

Request access

Compute access is granted through one of three doors — the same as the network. The most direct door is a collaboration proposal, which carries with it a 90-day allocation on aurora-n7 once accepted.

For replication runs, access is automatic: anyone with a public claim’s manifest can submit a replay against runner-β without a proposal. The replay is rate-limited to two concurrent jobs per replicating team.

Reproducible compute, by default.