Ax is a controlled, multi-agent adversarial experiment measuring the effect of Generative Specification (GS) artifacts on AI-assisted software development quality. The study runs seven conditions — from a completely artifact-free naive baseline through five progressively richer treatment levels — all against the same benchmark task.
| Directory | Label | Description |
|---|---|---|
naive/ |
Naive | No prompting strategy, no GS artifacts. Raw capability baseline. |
control/ |
Expert-prompting control | Expert prompt engineering only — no GS artifacts. Isolates prompting skill. |
treatment/ |
Treatment v1 | GS artifact cascade (CLAUDE.md, ADRs, diagrams, schema). |
treatment-v2/ |
Treatment v2 | v1 + ForgeCraft pre-commit hooks and verification protocol. |
treatment-v3/ |
Treatment v3 | v2 + full ForgeCraft scaffold (Status.md, C4 diagrams, Mermaid flows). |
treatment-v4/ |
Treatment v4 | v3 + adversarial review agent (second model challenges each commit). |
treatment-v5/ |
Treatment v5 | v4 + multi-turn correction loop (agent self-repairs on reviewer feedback). |
RealWorld (Conduit) API — the standard fullstack benchmark used across the industry: https://github.com/realworld-apps/realworld
All seven conditions implement the same backend API from the same specification
(REALWORLD_API_SPEC.md). Evaluation is automated and human-reviewed against a shared
rubric (RESULTS.md).
Pre-registration commit hashes in the source repository
(github.com/jghiringhelli/forgecraft-mcp):
| Commit | Event |
|---|---|
bd2c05b |
Naive condition pre-registered |
7661e62 |
Control condition pre-registered |
7e06e78 |
Treatment v1 pre-registered |
6c24f6d |
Treatment v2–v3 pre-registered |
482a111 |
Treatment v4–v5 pre-registered |
To verify: clone the forgecraft-mcp repository and inspect each commit timestamp.
The implementation sessions for each condition began only after the corresponding
pre-registration commit. The supplement documents the full chain of custody.
Each condition directory contains an evaluation/ subdirectory with:
evaluation/scores.md — per-endpoint pass/fail scores and aggregate totalsevaluation/metrics.md — code quality metrics (coverage, lint, type errors, test counts)Aggregate results across all conditions are in RESULTS.md.
The full experiment design, statistical analysis, and conclusions are in:
white-paper/README.md — entry pointwhite-paper/conditions.md — condition definitions and protocolwhite-paper/data.md — raw data tableswhite-paper/metrics.md — metric definitionswhite-paper/scores.md — scoring rubricwhite-paper/conclusions.md — findings and implicationswhite-paper/gs-artifacts.md — GS artifact inventorywhite-paper/code-comparison.md — side-by-side code analysis