How Conformance Works
The suite is empirical, not a proof. For each capability it:
- Creates a fresh store per seed (isolation between runs).
- Writes test data drawn from the locked corpus, issues queries, and checks results against an oracle that derives ground truth.
- Scores two-sided — measuring both false positives (e.g. leakage) and false negatives (e.g. over-restriction).
- Computes a bootstrap confidence interval per direction.
A direction PASSES iff its CI excludes the failing outcome (GMP §8.2). With deterministic per-seed outcomes the CI is degenerate (lo = hi = mean), so a clean backend passes unambiguously.
Source: src/aml/eval/conformance.py, src/aml/eval/metrics.py