Experiments

Five studies establish the validity and generalizability of the GS methodology.

Experiment	What It Tests	Status
AX — Adversarial	Quality as a function of specification completeness. Eight conditions, naive through ForgeCraft treatment v7. RealWorld Conduit benchmark.	✅ Complete
BX — Benchmark	Rubric validity. Three Conduit implementations scored blind against the GS rubric — two never exposed to GS. Establishes the rubric captures real quality.	✅ Complete
CX — Patchability	GS-specified codebases are more patchable. SWE-bench-style patch tasks on two quality tiers characterized by BX.	✅ Complete
RX — Replication	Any reader can reproduce 104 passing tests against a live PostgreSQL instance from a GS document alone. No ForgeCraft required.	✅ Complete
DX — Human Practitioner	40 developers, two conditions. Tests between-practitioner replication across engineers of varying GS skill. Crossover design.	🗓 April 2026

Validation Structure

The five experiments address a three-layer validity problem:

Layer	Threat	Closed By
Output measurement	External checks use criteria the author defined	BX: rubric applied to non-GS implementations
Rubric validity	Rubric rewards GS compliance, not objective quality	BX + CX: congruent with CVE count, test count, patchability
Guidance circularity	GS guided the implementation AND scored it	DX: blind evaluator, 40 external practitioners

Layers 1 and 2 are closed. Layer 3 closes April 2026.

Pre-Registration Policy

AX and DX rubrics, hypotheses, and evaluation criteria were committed to this repository before any experimental run. Commit timestamps are cryptographically signed by GitHub. This prevents post-hoc rubric adjustment.

Reproduce RX Yourself

git clone https://github.com/jghiringhelli/generative-specification
cd generative-specification/experiments/rx
docker compose up -d postgres
./runner/run.sh
cat evidence/jest-output.json   # numFailedTests === 0

Experiments

Validation Structure

Pre-Registration Policy

Reproduce RX Yourself

Table of contents