Observability

Why observability is a research surface

Most “monitoring” stacks were designed to keep services alive — counter-of-the-week, threshold alerts, a dashboard in a war-room. We treat observability as a first-class research artefact: the stream of measurements is the experiment, and every transformation between sensor and figure is content-addressed.

The result is that the system that watches an experiment is the experiment. There is no separate telemetry that diverges from the published claim.

A claim is a hash, not a screenshot.

Every figure that ships with a paper resolves to a SHA-pinned dataset, a SHA-pinned container, and a SHA-pinned commit. If the inputs no longer hash to the same bytes, the figure is invalidated automatically.

What we instrument

We instrument three layers — sensor, transform, and decision — and emit a parallel stream of audit events that mirrors the data path.

Sensor layer: Raw acquisition: hardware counters, simulator state vectors, cohort enrolment events. Hashed at ingest with the originating clock skew recorded.
Transform layer: Every parametric and non-parametric step in the analysis graph. Each transform writes a deterministic function-of-inputs hash so the path is replayable.
Decision layer: Where a model emits a class, a flag, or a confidence interval. We capture the full posterior, not just the winning argmax, because reviewers ask.
Audit channel: A side-channel of structured events tagged with experiment ID, container digest, and reviewer ID. It survives independently of the data path so a corrupted run is detectable post-hoc.

Streaming decomposition, not batch summarisation

The telemetry surface runs an online principal-component decomposition over the embedding stream. The top-10 components are updated every 250 ms with a moving variance attribution that lets a reviewer see — before the next gradient step — whether the signal is concentrating into a single dominant axis.

In practice this means anomalies are caught at the representation level, not at the loss level. A model whose loss looks fine but whose embeddings have collapsed onto two effective dimensions is flagged before checkpoint.

from etr.observability import StreamingPCA

# 1.2M feature embeddings, 10 retained components
pca = StreamingPCA(n_components=10, decay=0.97)

for batch in loader:
    z = encoder(batch).detach()
    attr = pca.update(z)
    if attr.dominant_share() > 0.62:
        audit.flag("representation_collapse", attr=attr.snapshot())

The audit.flag call is not a side-effect — it produces a content-addressed event that becomes part of the run’s manifest. A reviewer querying the published artefacts will see the flag in the same stride as the figures it influenced.

What gets surfaced to a reviewer

Reviewers do not see a dashboard. They see the same stream of decompositions the lab sees, frozen at the run hash. Every figure in a published paper is a query against this stream — anyone with the SHA can re-issue the query and confirm the figure.

Top-k component drift, with attribution variance bands
Anomaly flags with their originating transform hash
Cluster-residency timelines for the embedding stream
Counterfactual replays at any point in the run

Replication

Every published claim that depends on observability data carries a replicate.sh that pins the container, the dataset, and the streaming PCA seed. A reviewer with one machine and one GPU should be able to rebuild any anomaly figure in under twenty minutes — or the claim does not ship.

High-dimensional telemetry, decomposed in real time.