Experiments
Five studies establish the validity and generalizability of the GS methodology.
| Experiment | What It Tests | Status |
|---|---|---|
| AX — Adversarial | Quality as a function of specification completeness. Eight conditions, naive through ForgeCraft treatment v7. RealWorld Conduit benchmark. | ✅ Complete |
| BX — Benchmark | Rubric validity. Three Conduit implementations scored blind against the GS rubric — two never exposed to GS. Establishes the rubric captures real quality. | ✅ Complete |
| CX — Patchability | GS-specified codebases are more patchable. SWE-bench-style patch tasks on two quality tiers characterized by BX. | ✅ Complete |
| RX — Replication | Any reader can reproduce 104 passing tests against a live PostgreSQL instance from a GS document alone. No ForgeCraft required. | ✅ Complete |
| DX — Human Practitioner | 40 developers, two conditions. Tests between-practitioner replication across engineers of varying GS skill. Crossover design. | 🗓 April 2026 |
Validation Structure
The five experiments address a three-layer validity problem:
| Layer | Threat | Closed By |
|---|---|---|
| Output measurement | External checks use criteria the author defined | BX: rubric applied to non-GS implementations |
| Rubric validity | Rubric rewards GS compliance, not objective quality | BX + CX: congruent with CVE count, test count, patchability |
| Guidance circularity | GS guided the implementation AND scored it | DX: blind evaluator, 40 external practitioners |
Layers 1 and 2 are closed. Layer 3 closes April 2026.
Pre-Registration Policy
AX and DX rubrics, hypotheses, and evaluation criteria were committed to this repository before any experimental run. Commit timestamps are cryptographically signed by GitHub. This prevents post-hoc rubric adjustment.
Reproduce RX Yourself
git clone https://github.com/jghiringhelli/generative-specification
cd generative-specification/experiments/rx
docker compose up -d postgres
./runner/run.sh
cat evidence/jest-output.json # numFailedTests === 0