How Conformance Works

The suite is empirical, not a proof. For each capability it:

Creates a fresh store per seed (isolation between runs).
Writes test data drawn from the locked corpus, issues queries, and checks results against an oracle that derives ground truth.
Scores two-sided — measuring both false positives (e.g. leakage) and false negatives (e.g. over-restriction).
Computes a bootstrap confidence interval per direction.

A direction PASSES iff its CI excludes the failing outcome (GMP §8.2). With deterministic per-seed outcomes the CI is degenerate (lo = hi = mean), so a clean backend passes unambiguously.

Source: src/aml/eval/conformance.py, src/aml/eval/metrics.py