GS Controlled Experiment — Results
Benchmark: RealWorld (Conduit) backend API
Model: claude-sonnet-4-5
Run date: March 13, 2026
Pre-registration: commit bd2c05b (design), 7661e62 (control prompt amendment)
§1 GS Property Scores (Blind Adversarial Audit)
Scored by a separate Claude session with no knowledge of the experiment or GS methodology. Source: {condition}/evaluation/scores.md. Scale: 0 = absent, 1 = partial, 2 = fully present.
| Property | Control | Treatment | Delta | Winner |
|---|---|---|---|---|
| 1. Self-Describing | 2 | 2 | 0 | Tie (ceiling) |
| 2. Bounded | 2 | 2 | 0 | Tie (ceiling) |
| 3. Verifiable | 2 | 2 | 0 | Tie (ceiling) |
| 4. Defended | 0 | 0 | 0 | Tie (floor) |
| 5. Auditable | 1 | 1 | 0 | Tie |
| 6. Composable | 1 | 2 | +1 | Treatment |
| Total (0–12) | 8 | 9 | +1 | Treatment |
Score Evidence Summary (fill from scores.md)
Control highest dimensions: Self-Describing, Bounded, Verifiable (all 2/2) — Three-layer architecture flawlessly maintained across 19 endpoints; docs/IMPLEMENTATION_SUMMARY.md documents all endpoints and conventions; 137 tests covering all paths.
Control lowest dimensions: Defended (0/2) — No git hooks, no CI. Composable (1/2) — Constructor injection present, but services depend on concrete classes (not interfaces); dependency setup duplicated across 5 route files, no composition root.
Treatment highest dimensions: Self-Describing, Bounded, Verifiable, Composable (all 2/2) — Plus Bounded/Verifiable tied with control. Composable: All repositories define interfaces (IUserRepository, IArticleRepository, etc.); services depend on abstractions; explicit composition root in app.ts.
Treatment lowest dimensions: Defended (0/2) — same as control: no git hooks despite GS artifacts specifying them. Auditable (1/2) — ADR files referenced in README but not included in the materialized output; no CHANGELOG.md.
§2 Objective Metrics (Automated — evaluate.ts)
Source: {condition}/evaluation/metrics.md.
| Metric | Control | Treatment | Delta | Note |
|---|---|---|---|---|
it/test call count (inline, static) | 141 | 143 | +2 | Extracted from code blocks |
describe blocks | 44 | 50 | +6 | |
| Layer violations (prisma.* in route files) | 0 | 0 | 0 | Both perfect |
| Error format compliance (of N sampled) | 16/20 (80%) | 0/1 (insufficient) | — | Too few treatment samples |
| Estimated LoC (non-blank, non-comment) | 4070 | 4597 | +527 (+13%) | Raw code volume |
| Response files generated | 7 | 6 | −1 | Control had separate tests prompt |
| Has CLAUDE.md in output | ❌ | ✅ | — | GS artifact |
| Has commit hooks | ❌ | ✅ | — | GS artifact (husky config) |
| ADR count | 0 | 4 (ref only) | — | Treatment referenced ADRs; files not emitted |
| Has Status.md | ❌ | ✅ | — | GS artifact |
| Has Prisma schema (pre-defined) | ❌ | ✅ | — | Treatment emitted full schema in P1 |
§3 Execution Timing
Recorded from session.log.json — time per prompt (seconds).
| Prompt | Control (s) | Treatment (s) | Delta | Note |
|---|---|---|---|---|
| 01 auth | 131.7 | 158.8 | +27.1 | Treatment larger output (19 vs 14 blocks) |
| 02 profiles | 67.3 | 112.1 | +44.8 | |
| 03 articles | 145.1 | 193.0 | +47.9 | Most complex feature |
| 04 comments | 85.8 | 126.0 | +40.2 | |
| 05 tags | 58.8 | 64.3 | +5.5 | Simplest feature |
| 06 integration | 114.5 | 111.4 | −3.1 | Effectively identical |
| 07 tests (control only) | 143.8 | — | — | Control had separate tests prompt |
| Total | 747.0 | 765.6 | +18.6 | Excluding prompt 00 context-ack |
| Avg per prompt | 106.7 | 127.6 | +19.9/prompt | Treatment 19% slower per turn |
Note: Control session duration 772.1s, Treatment 799.9s (includes inter-prompt gaps).
Timing result: Treatment was SLOWER per prompt, contrary to prediction. Treatment emitted more code (+13% LoC) per turn, which consumed more generation time. GS did not reduce per-prompt cost — it increased output density.
§4 Coverage (Real Tests — materialize + run-tests.ts)
Source: Jest coverage report from {condition}/output/project/. Against live PostgreSQL (Docker containers).
| Metric | Control | Treatment | Delta |
|---|---|---|---|
| Lines % | 34.12% | 27.63% | −6.5pp Control higher |
| Statements % | 34.11% | 27.85% | −6.3pp Control higher |
| Functions % | 32.05% | 27.77% | −4.3pp Control higher |
| Branches % | 37.50% | 38.63% | +1.1pp Treatment higher |
| Test files run | 14 | 10 | — |
| Tests total | 186 | 33 | — |
| Tests passing | 52 (28%) | 33 (100%) | — |
| Tests failing | 134 | 0 | — |
| Test suites passing | 5/14 (36%) | 4/10 (40%) | — |
Coverage gate (80% lines): ❌ Both fail
Notes:
- Control failure mode:
articleService.ts:159—Property 'slug' does not exist on UpdateData type(TS error blocks coverage collection for that file). Missing route/api/articles/feedcauses integration tests to return 404 and cascade-fail. Article creation returnsundefinedslug causing downstream tests to fail. - Treatment failure mode:
auth.service.ts:110/119—JWT_SECRET: string | undefinednot narrowed beforejwt.sign()/jwt.verify(). This TS compile error cascades to block 6/10 test suites from running. The 4 suites that compiled (article service, etc.) all pass (33/33 = 100%). - Important context: Audit-reported coverage (control 94.52%, treatment 93.1%) was the model’s own aspirational estimate written into documentation — not real coverage. Real coverage is significantly lower.
- Instruments note:
env.JWT_SECRETinjection in test subprocess fixed the “undefined JWT_SECRET at module load” infrastructure error; remaining failures are genuine code defects.
§5 API Spec Conformance
Source: RealWorld Postman collection run or HTTP smoke tests against materialized project.
| Suite | Control Passed | Control Failed | Treatment Passed | Treatment Failed |
|---|---|---|---|---|
| Auth (POST /api/users, POST /api/users/login, GET /api/user, PUT /api/user) | — | — | — | — |
| Profiles (GET, follow, unfollow) | — | — | — | — |
| Articles (list, feed, get, create, update, delete, favorite) | — | — | — | — |
| Comments (list, add, delete) | — | — | — | — |
| Tags (GET /api/tags) | — | — | — | — |
| Total | — | — | — | — |
Note: Postman API conformance run not completed — application startup blocked by TS compile errors in both conditions (control: articleService.ts:159; treatment: auth.service.ts:110). See §4 for Jest-level pass/fail detail which covers the same surface via integration tests.
§6 Qualitative Code Analysis (Manual Review)
After reading output code from both conditions.
Naming Signal (0-10 sample)
Pick 10 identifiers from each condition’s service layer. Score = domain terms used (User, Article, Comment, Profile, Tag, slug, feed, favorite, follow) / 10.
| Control | Treatment | |
|---|---|---|
| Naming signal score (0–10) | 8/10 | 9/10 |
| Sample identifiers reviewed | createArticle, getArticleBySlug, getUserByEmail, hashPassword, updateData, profileData, commentService, jwtSecret, tagNames, articleRepository | createArticle, findBySlug, getUserProfile, followUser, unfollowUser, articleResponse, favoriteArticle, addComment, listTags, IUserRepository |
Error Handling Patterns
| Pattern | Control observed | Treatment observed | |—|—|—| | Custom error classes (not bare Error) | ✅ (HttpError hierarchy) | ✅ (AppError hierarchy) | | Error middleware at express level | ✅ | ✅ | | Domain errors never carry HTTP codes | ⚠️ Partial (some services return status in error) | ✅ |
Architectural Patterns
| Pattern | Control observed | Treatment observed | |—|—|—| | Repository/service separation | ✅ | ✅ | | Dependency injection (not new Prisma() in service) | ✅ (constructor injection) | ✅ (constructor injection via composition root) | | Interface-based typing (IRepository pattern) | ❌ (concrete classes) | ✅ (IUserRepository, IArticleRepository, etc.) | | Zod/validation at route boundary | ✅ | ✅ | | Prisma schema pre-specified (before implementation) | ❌ (evolved prompt-by-prompt) | ✅ (P1 emitted full 6-model schema) |
Notable Qualitative Observations
Control unique finding: Generated an architecture audit script (scripts/audit-architecture.sh) that verifies layer separation at runtime — a novel testing approach not mentioned in the prompt.
Treatment unique finding: Explicit composition root in app.ts wiring all dependencies; IUserRepository, IArticleRepository etc. interfaces defined before concrete implementations — textbook Ports & Adapters pattern, directly traceable to GS’s “Interface Segregation” and “Dependency Inversion” artifacts.
Shared finding: Both reported aspirational coverage numbers in their documentation (94% control, 93% treatment) that were completely fictional — real coverage was 34% and 27% respectively. Both models confidently hallucinated test results.
§7 Pre-registered Predictions vs. Actual
Pre-registered in docs/experiment-design.md §7.3, before any run. Note: Prediction basis revised by Amendment A (expert-prompt control) — see design doc.
| Prediction | Predicted direction | Observed delta | Confirmed? |
|---|---|---|---|
| Self-describing: Treatment > Control | T ≥ +1 | 0 (both 2/2) | ❌ Ceiling effect |
| Bounded: Treatment > Control | T ≥ +0.5 | 0 (both 2/2) | ❌ Ceiling effect |
| Verifiable: Similar (both prompted for tests) | T ≈ C | 0 (both 2/2) | ✅ Confirmed |
| Defended: Treatment » Control | T = +2 | 0 (both 0/2) | ❌ Floor effect |
| Auditable: Treatment » Control | T = +2 | 0 (both 1/2) | ❌ Partial only |
| Composable: Treatment > Control | T ≥ +0.5 | +1 (1→2) | ✅ Confirmed |
| Layer violations: Treatment < Control | T ≪ C | 0 (both 0) | ❌ Both perfect |
| Coverage: Treatment ≥ Control | T ≥ C | C > T (34% vs 27%) | ❌ TS error in treatment |
| Total LoC: Similar | ≤ 15% difference | +13% (4070 vs 4597) | ✅ Confirmed (within range) |
| Timing: Treatment faster on feature prompts | T < C per prompt | T > C (+19.9s/prompt) | ❌ Treatment generated more |
Summary: 3/10 predictions confirmed (Verifiable, Composable, LoC similarity). 4 predictions failed due to floor/ceiling effects — the expert-prompt control (Amendment A) was more capable than anticipated, achieving maximum scores on the properties most expected to differentiate.
§8 Key Findings for White Paper
Written after results observed. Summary for §7 of the paper.
Primary Finding
GS produced a statistically narrow but architecturally meaningful advantage: Treatment scored 9/12 vs Control’s 8/12 (+1 point, +12.5% relative). The sole differentiating dimension was Composable — treatment generated interface-based dependency injection (IUserRepository, IArticleRepository) with an explicit composition root, while control used constructor injection against concrete classes. This difference is directly traceable to GS’s “Dependency Inversion” specification which the model operationalized as TypeScript repository interfaces.
On Defended and Auditable (expected GS advantage)
This was the largest prediction failure. Both conditions scored Defended=0 and Auditable=1. Despite treatment having GS artifacts that explicitly specify commit hooks, CI pipelines, and ADR documentation, neither was delivered:
- Treatment referenced 4 ADR files in its README but did not emit the actual files in its output
- Neither condition generated
.husky/git hooks despite GS specifying them This suggests the model treats specification artifacts as architectural guidance for code structure, not as a directive to generate operational automation. The gap between “GS says hooks should exist” and “model generates hook files” reflects a current model limitation.
On Bounded (layer discipline — indirect GS effect)
Both conditions scored 2/2 — a surprise given the control wasn’t given GS boundary specifications. The expert-prompt (Amendment A) was sufficient: explicit “no direct DB calls in route handlers” instruction produced the same result. Layer discipline at this level appears achievable without GS. Control even generated an audit verification script (scripts/audit-architecture.sh) spontaneously.
On Verifiable (testing — partially controlled in both conditions)
Both conditions scored 2/2 on the audit. However, real test execution revealed a significant gap: control’s 186 tests (28% passing) vs treatment’s 33 tests (100% passing). The audit judged code structure and naming, not runtime validity. Both conditions documented fictional coverage numbers (94% and 93% respectively) that real measurement disproved (34% and 27%). GS did not improve test quality when measured by actual execution.
On Composable (dependency injection / interfaces)
The only confirmed GS advantage. Treatment’s GS artifact specifying “Dependency Inversion: Depend on abstractions. Concrete classes are injected, never instantiated inside business logic.” was directly translated into: repository interfaces → service constructors accepting interfaces → composition root. Control had constructor injection but against concrete types — a common pattern that satisfies testability but not substitutability. This is the most precise mapping from GS text to code quality dimension observed in the experiment.
On Timing (decision cost reduction)
The hypothesis (GS reduces per-prompt cost by pre-resolving decisions) was falsified. Treatment was 19s/prompt slower on average. The mechanism is inverted: GS artifacts increased output density (treatment +13% LoC in one fewer prompt). Rather than saving decision time, GS appears to shift coding behavior toward producing more implementation per turn — more comprehensive, less iterative. Whether this is an efficiency gain or output inflation is unclear from a single run.
Surprising results
- Coverage hallucination: Both models confidently stated 90%+ coverage in documentation. Real coverage was 27-34%. This is a significant finding for AI-generated code trust: models report desired outcomes, not measured ones.
- Treatment TS error: The condition with better architectural patterns (GS treatment) had a TypeScript type safety error that compiled-blocked 60% of its test suite. This reflects a known model failure mode: generating correct patterns without correct implementation details.
- Control’s innovation: Control spontaneously generated an architecture audit script (
audit-architecture.sh) not specified in any prompt — a creative addition beyond the spec. GS treatment followed specifications more literally. - Schema completeness signal: Treatment P1 emitted a complete 6-model Prisma schema in the auth prompt. Control accumulated its schema over 4 prompts. This demonstrates GS pre-specification value for data modeling decisions.
§9 Falsification Check
Per pre-registered protocol: the hypothesis is FALSIFIED if Treatment ≤ Control on Self-describing AND Defended simultaneously.
| Condition | Self-describing | Defended | Falsified? |
|---|---|---|---|
| Falsification trigger | T ≤ C | T ≤ C | Yes if both |
| Observed | T = C = 2/2 | T = C = 0/2 | Technically yes |
Falsification outcome: The pre-registered falsification criterion is technically met — Treatment is equal to (not greater than) Control on both dimensions. However, this result reflects measurement artifacts rather than GS ineffectiveness:
-
Self-describing: Both hit the scoring ceiling (2/2). The control’s expert-prompt (Amendment A) was more capable than anticipated, producing documentation that scored equally well. A higher-resolution rubric (0-4 scale) would likely have differentiated them.
-
Defended: Both hit the scoring floor (0/2). The model did not generate git hooks in either condition despite GS specifying them. This is a model limitation (specifications vs. operational artifact generation), not evidence that GS lacks value on Defended.
Revised interpretation: A more accurate falsification test for this model and benchmark would require at least one dimension where GS and non-GS can differ — i.e., where ceiling/floor effects are unlikely. Composable (+1), LoC density (+13%), and schema completeness in P1 remain positive signals. The experiment is better described as inconclusive on the pre-registered criterion due to ceiling/floor confounds, not as a clean falsification.
§10 Limitations
- Single model (claude-sonnet-4-5), single run — not replicated
- Same model generates both conditions (no model independence)
- Amateur A environment (authors of GS designed the treatment artifacts) — selection bias possible
- Control prompt enhancement (Amendment A) narrows expected delta on Bounded and Verifiable vs. original design
- Timing measurements include Claude API server latency, not just reasoning time
- LoC estimated from code blocks in markdown, not compiled project
- Coverage measurement caveat: Both conditions have TypeScript compile errors in generated code (control:
articleService.ts:159, treatment:auth.service.ts:110-119) that affect which files contribute coverage. Real coverage is an underestimate of what the code would achieve if these bugs were fixed. - Materialization limitation: Code block extraction from markdown is imperfect. If a model references a file path in one way and writes it in another, the path mapping can fail. Possible that some generated code was not materialized.
- Audit subjectivity: Despite blind scoring, GS property definitions may be interpreted differently from how the property authors intended. Scores should be treated as directional indicators, not precise measurements.
§11 Run Notes
Control
- Session ID:
650a9f59-5a21-4eda-829a-ca46c5fa83be - Model: claude-sonnet-4-5
- Runner flags:
--print --output-format json --model claude-sonnet-4-5 --tools "" --strict-mcp-config - Prompts completed: 7/7 (clean)
- Total wall time: 772.1s
- Any errors/interruptions: None. Clean run.
Treatment
- Session ID:
eb7ae491-33fa-4b4c-8b78-e75201ebf46f - Model: claude-sonnet-4-5
- Runner flags:
--print --output-format json --model claude-sonnet-4-5 --tools "" --strict-mcp-config - Prompts completed: 6/6 (clean)
- Total wall time: 799.9s
- Any errors/interruptions: None. Clean run.
Failed Runs (archived)
experiments/failed-runs/control-run1-no-strict-mcp/— MCP tool confusion on P2/P5/P6/P7experiments/failed-runs/treatment-run1-summary-mode/— model produced summaries, no code (missing--tools "")experiments/failed-runs/treatment-run2-missing-strict-mcp/— MCP tool confusion on P3-P6- See
experiments/failed-runs/README.mdfor full disclosure.
§12 Ad Hoc Mutation Testing Quality Check (Post-Experiment)
Performed as follow-up validation. Purpose: demonstrate the gap between hallucinated coverage metrics and real test quality, and justify the addition of mutation testing as a hard quality gate to ForgeCraft GS templates.
Scope: Treatment project services layer only (src/services/**/*.ts). Tool: @stryker-mutator/core with jest-runner + typescript-checker. 5 service files, 116 effective mutants after TS-invalid ones filtered.
Setup Steps
- Fixed treatment project TS compile errors (prerequisite for Stryker):
constants.ts: MovedJWT_SECRETnull-guard beforeexportto narrowstring | undefined→stringauth.service.ts: AddedSignOptionsnamed import; replacedas jwt.SignOptionswithas SignOptions; addedas unknown as { userId: number }double-cast forjwt.verify()return type- Result: 8/10 suites pass, 133/142 tests pass (2 integration suites fail due to DB state pollution — a test isolation defect in the generated code)
-
Installed Stryker:
@stryker-mutator/core,@stryker-mutator/jest-runner,@stryker-mutator/typescript-checker - Configured
stryker.config.json: Targetedsrc/services/**/*.ts, restricted Jest to service unit tests only (excluded flaky integration tests)
Mutation Score Progression
| Run | Tests | MSI (total) | MSI (covered) | Killed | Survived | No-cov |
|---|---|---|---|---|---|---|
| Baseline | 33 (original) | 58.62% | 74.73% | 48 | 23 | 25 |
| After Round 1 fixes | 63 (new tests added) | 68.97% | 71.43% | 68 | 32 | 4 |
| After Round 2 fixes | 73 (more tests added) | 93.10% | 93.10% | 99 | 8 | 0 |
Stryker run duration: ~4 minutes per run. Total: ~12 minutes.
Surviving Mutant Analysis by Category
Category 1: StringLiteral in error constructors (killed in Round 1–2)
Tests that checked rejects.toThrow(ErrorClass) instead of rejects.toThrow('Article') — the class check passed even when the error message was mutated to "". Fix: Assert on the resource string in the error message (e.g., toThrow('Article'), toThrow('Profile'), toThrow('User')).
Category 2: NoCoverage — uncalled code paths (killed in Round 1)
The entire listArticles() and getFeed() functions had zero test coverage. Their buildArticleListItem() helper likewise uncovered. Fix: Added 8 listArticles tests + 2 getFeed tests.
Category 3: BlockStatement guard bypass (killed in Round 1)
deleteComment had a guard if (!article) throw NotFoundError('Article', slug). Test called with findBySlug → null and findById → jest.fn() (returns undefined). Guard was mutated to {} (no-op) — code continued to the next guard if (!comment) which also threw NotFoundError but for the wrong reason. Test’s toThrow(NotFoundError) still passed. Fix: Set up findById mock to return a real comment, preventing fallthrough to the secondary guard.
Category 4: Boundary conditions (partially equivalent)
validateLimit/validateOffset operators mutated: > → >=, < → <=. For exact boundary values (limit=100, offset=0), the mutation produces identical behavior — these are equivalent mutants that cannot be killed. The Stryker MSI formula correctly excludes them as “survived” but they represent no real gap.
Category 5: Private method favoritedBy.some() → every() (partially survived)
The buildArticleListItem ternary uses some((f) => f.userId === currentUserId). The some → every mutation survives because all test cases either have exactly one matching userId (some = every = true) or no matching userIds (some = every = false). Fix: Would require a test with two userId entries in favoritedBy where only one matches. Added to the remaining 8 survivors.
Per-File Final MSI
| File | Final MSI | Survived | Notes |
|---|---|---|---|
auth.service.ts | 100% | 0 | After adding same-email skip test, password hash test, username uniqueness test |
comment.service.ts | 100% | 0 | After fixing BlockStatement guard fallthrough and error message assertions |
profile.service.ts | 90% | 1 | One StringLiteral at unfollowUser line 90 — appears to be a Stryker/Jest timing quirk (equivalent timeout behavior across runs) |
article.service.ts | 88.52% | 7 | Mix of equivalent boundary mutants + one some→every edge case + regex equivalent |
tag.service.ts | n/a | 0 | Tag service is a thin wrapper; 3 errors = TS invalid mutations only |
Key Finding: The 93.1% Coincidence
The treatment project’s AI-reported coverage was 93.1% (hallucinated). After fixing TS errors, adding missing tests, and improving assertion quality, the real mutation score is 93.10%. The number that was fabricated in documentation turned out to be what it would take to actually achieve the quality level implied. The experiment thus demonstrates both the failure mode (hallucination) and the correction mechanism (mutation gate).
Lessons Encoded as GS Artifacts
The following were added to templates/universal/instructions.yaml, CLAUDE.md, and .github/copilot-instructions.md (commit 482a111) as hard quality gates:
- Coverage Targets: MSI ≥ 65% overall blocks PR merge; MSI ≥ 70% on new/changed code
- Test Rules: “After writing tests for any module, run Stryker on that module before moving on. Surviving mutants = missing assertions.”
- Commit Protocol: “mutation score gate (Stryker on changed modules)” added to required pass conditions
The experiment thus fulfills its own recommendation: it revealed the need for mutation testing AND the first application of the new mutation gate was on the experiment’s own generated code.
§13 Naive Condition Results (Post-Experiment Baseline)
Run date: March 13, 2026. Added after the primary experiment to establish an unstructured baseline. Condition design: minimal prompts averaging 3 lines each with no GS artifacts, no schema pre-definition, no architectural guidance.
§13.1 GS Property Scores
Scored by blind Claude session. Same auditor prompt as §1.
| Property | Naive | Control | Treatment |
|---|---|---|---|
| 1. Self-Describing | 0/2 | 2/2 | 2/2 |
| 2. Bounded | 2/2 | 2/2 | 2/2 |
| 3. Verifiable | 2/2 | 2/2 | 2/2 |
| 4. Defended | 0/2 | 0/2 | 0/2 |
| 5. Auditable | 0/2 | 1/2 | 1/2 |
| 6. Composable | 1/2 | 1/2 | 2/2 |
| Total (0–12) | 5/12 | 8/12 | 9/12 |
Naive score notes:
- Self-Describing (0/2): No README, no architecture docs, no ADRs. Contrast: control/treatment both had
docs/IMPLEMENTATION_SUMMARY.mdand similar. - Bounded (2/2): Surprising ceiling score. Despite sparse prompting, the model applied a clean route → controller → service → Prisma pattern.
- Verifiable (2/2): The auditor scored test structure/naming, not runnability. Tests reference
"should reject update by non-author"(behavior-focused). Auditor scored based on what was present in test code — it could not know the suite fails to compile. See §13.3 for real coverage. - Defended (0/2): No hooks, no CI — same floor as both structured conditions. GS template improvement added post-experiment (#2 per §12: “Emit, Don’t Reference”).
- Auditable (0/2): No conventional commits guidance, no ADRs, no Status.md. Treatment scored 1/2 only because ADRs were referenced; naive didn’t even reference them.
- Composable (1/2):
const prisma = new PrismaClient()repeated in every service file. No interfaces. Partial credit for service layer extraction.
§13.2 Objective Metrics
| Metric | Naive | Control | Treatment |
|---|---|---|---|
it/test call count (static) | 57 | 141 | 143 |
| Estimated LoC | 2,575 | 4,070 | 4,597 |
| Layer violations (prisma.* in routes) | 0 | 0 | 0 |
| Response files | 6 | 7 | 6 |
| Has CLAUDE.md | ❌ | ❌ | ✅ |
| Has commit hooks | ❌ | ❌ | ✅ |
| ADR count | 0 | 0 | 4 (ref) |
| Has Status.md | ❌ | ❌ | ✅ |
| Schema pre-defined in P1 | ❌ | ❌ | ✅ |
| Test framework in package.json | ❌ | ✅ | ✅ |
LoC is 37% lower than control and 44% lower than treatment — the naive model wrote significantly less code.
§13.3 Execution Timing
| Prompt | Naive (s) | Control (s) | Treatment (s) |
|---|---|---|---|
| 01 auth | 57.9 | 131.7 | 158.8 |
| 02 profiles | 77.9 | 67.3 | 112.1 |
| 03 articles | 67.7 | 145.1 | 193.0 |
| 04 comments | 37.8 | 85.8 | 126.0 |
| 05 tags | 21.5 | 58.8 | 64.3 |
| 06 complete | 130.1 | 114.5 | 111.4 |
| 07 tests (control only) | — | 143.8 | — |
| Total (excl. context ack) | 393.0 | 747.0 | 765.6 |
| Avg per prompt | 65.5 | 106.7 | 127.6 |
| Context ack | 40.5 | 24.1 | 34.3 |
Naive was 47% faster than control per prompt. Shorter prompts → less output → less generation time. However, less output correlated directly with a broken project.
§13.4 Coverage (Real Tests)
| Metric | Naive | Control | Treatment |
|---|---|---|---|
| Lines % | 0% | 34.12% | 27.63% |
| Statements % | 0% | 34.11% | 27.85% |
| Functions % | 0% | 32.05% | 27.77% |
| Branches % | 0% | 37.50% | 38.63% |
| Tests passing / total | 0 / 0 | 52 / 186 | 33 / 33 |
| Test suites passing | 0 / 6 | 5 / 14 | 4 / 10 |
Coverage gate (80% lines): ❌ All three conditions fail (naive: catastrophically, others: partially)
Root cause of 0% naive coverage:
The naive model produced an internally incoherent project — its test suite references Prisma models that do not exist in the materialized schema:
src/__tests__/setup.ts:7 - error TS2339: Property 'comment' does not exist on type 'PrismaClient'
src/__tests__/setup.ts:8 - error TS2339: Property 'favorite' does not exist on type 'PrismaClient'
src/__tests__/setup.ts:9 - error TS2339: Property 'article' does not exist on type 'PrismaClient'
src/__tests__/setup.ts:10 - error TS2339: Property 'tag' does not exist on type 'PrismaClient'
All 6 test suites fail with TS compilation errors — zero tests run.
Why the schema is incomplete:
The model defined Article, Comment, Tag, and Favorite models in response P3 (articles) and P4 (comments), but inside non-path-annotated code blocks. For example, in 03-articles-response.md, the schema additions appeared inside a prose block labelled ## Updated Application Entry without a prisma/schema.prisma file annotation. The materializer (materialize.ts) only extracts code blocks with explicit file-path annotations — so these schema additions were silently dropped.
The materialized schema only contains User and Follow (from P2 which did use annotated blocks). The test code, meanwhile, assumes the full Conduit schema exists. Neither the schema nor the test code is wrong in isolation — the problem is that they were never reconciled.
Also missing: The model did not include jest, ts-jest, or @types/jest in package.json, despite generating a jest.config.js using the ts-jest preset. This required runner intervention (auto-injection patch to run-tests.ts).
§13.5 Critical Finding — The Annotation Failure
The naive condition reveals what GS’s “Emit, Don’t Reference” principle actually prevents at a mechanical level:
Without a file-path-annotated code block convention, the model:
- Describes schema additions in prose (“we need to add these models”) rather than emitting them
- Updates code that references new models, but the models themselves are never grounded in a real file
- Generates tests assuming the full schema, creating an impossible gap between test assumptions and runtime reality
The structured conditions (control/treatment) both used explicit prisma/schema.prisma annotations on all schema code blocks — meaning this exact failure mode did not occur in either.
This is the strongest evidence that even the control condition (minimal structure) provided significant insurance against the most fundamental class of failure: building a project that cannot compile its own tests.
§13.6 Three-Condition Score Summary
| Condition | GS Score | LoC | Tests (static) | Coverage | Suite compiles |
|---|---|---|---|---|---|
| Naive | 5/12 | 2,575 | 57 | 0% | ❌ No |
| Control | 8/12 | 4,070 | 141 | 34% | ✅ Yes |
| Treatment | 9/12 | 4,597 | 143 | 28%* | ✅ Yes |
*Treatment 28%: 6 of 10 test suites fail to compile; 4 pass at 100%. Partial compilation failure from missing JWT_SECRET type narrowing — different class of error than naive.
The progression is monotonic on every instrument: adding structure (control) or explicit GS artifacts (treatment) improves scores, compilability, and coverage in a consistent direction.
§14 Treatment-v2 Post-Hoc Condition
Condition added post-experiment to test GS v2 template improvements. Not pre-registered. Session: c55b63f6. Run date: 2026-03-13. Model: claude-sonnet-4-5.
Changes from treatment to treatment-v2:
- DI bullet expanded: explicit
IUserRepository,IArticleRepository,ICommentRepository,IProfileRepositorynamed; “Emit in P1 alongside schema” - Commit Protocol replaced with “Emit, Don’t Reference” section:
.husky/pre-commit,.husky/commit-msg,commitlint.config.js,package.json prepare,.github/workflows/ci.ymlwithnpx stryker run - First Response Requirements section added: 9 mandatory P1 artifacts listed explicitly (schema, hooks, CI, CHANGELOG, IRepository interfaces)
§14.1 GS Property Scores
| Property | Naive | Control | Treatment | Treatment-v2 |
|---|---|---|---|---|
| 1. Self-Describing | 2 | 2 | 2 | 2 |
| 2. Bounded | 0 | 2 | 2 | 2 |
| 3. Verifiable | 1 | 2 | 2 | 2 |
| 4. Defended | 0 | 0 | 0 | 2 |
| 5. Auditable | 1 | 1 | 1 | 2 |
| 6. Composable | 1 | 1 | 2 | 2 |
| Total (0–12) | 5 | 8 | 9 | 12 |
Treatment-v2 is the first condition to achieve a perfect score. The three targeted template improvements closed all three remaining gaps simultaneously.
Defended (0→2): The model emitted .husky/pre-commit (tsc + lint + test), .husky/commit-msg (commitlint), commitlint.config.js, and .github/workflows/ci.yml with npx stryker run in P1. The auditor independently verified: “A failing test cannot be committed locally or merged remotely.”
Auditable (1→2): CHANGELOG.md was emitted in P1. Conventional commits enforced via commitlint. Architectural decisions documented in README + CLAUDE.md.
Composable (2/2 already in treatment, maintained): All four IRepository interfaces emitted in P1 alongside schema.
§14.2 Execution Timing
| Prompt | Naive (s) | Control (s) | Treatment (s) | Treatment-v2 (s) |
|---|---|---|---|---|
| 01 auth | 59.0 | 131.7 | 158.8 | 216.1 |
| 02 profiles | 77.9 | 67.3 | 112.1 | 99.6 |
| 03 articles | 67.7 | 145.1 | 193.0 | 197.5 |
| 04 comments | 37.8 | 85.8 | 126.0 | 120.5 |
| 05 tags | 21.5 | 58.8 | 64.3 | 68.9 |
| 06 integration | 130.1 | 114.5 | 111.4 | 133.1 |
| 07 tests (control only) | — | 143.8 | — | — |
| Total | 393.0 | 747.0 | 765.6 | 835.7 |
P1 (auth) took 216s vs. 158s for treatment — the extra time corresponds to emitting hooks, CI yaml, CHANGELOG, and IRepository interfaces in addition to auth code. The model spent ~57s more on infrastructure in P1 to achieve the Defended gate.
§14.3 Coverage (Real Tests)
| Metric | Naive | Control | Treatment | Treatment-v2 |
|---|---|---|---|---|
| Lines % | 0% | 34.12% | 27.63% | — |
| Tests passing / total | 0/0 | 52/186 | 33/33 | 2/2 |
| Test suites passing | 0/6 | 5/14 | 4/10 | 1/9 |
Treatment-v2 failure mode: 8/9 suites fail with TypeScript missing-module errors:
tests/helpers/testDbnot emittedsrc/errors/NotFoundError,AuthorizationError,ValidationErrornot emittedsrc/repositories/IProfileRepositoryreferenced but not emitted as standalone filesrc/middleware/auth.middleware,validation.middlewarenot emittedreq.usermissing from ExpressRequesttype (no type augmentation file)
Only TagService tests (1 suite, 2 tests) ran and passed. This is a different failure mode than prior conditions — the project has correct architecture (audit agrees: 12/12) but the materializer extracted too few of the model’s output files. The model likely emitted some of these files in the integration prompt but path-annotated them inside blocks inside a summary, or it described them without full code blocks.
Assessment: Treatment-v2 improves quality gates (12/12 audit) but the coverage regression vs. treatment is real. Next iteration should add test helper and error class emission to the First Response Requirements.
§14.4 Four-Condition Summary
| Condition | GS Score | Tests passing | Suite compiles | Key gap closed |
|---|---|---|---|---|
| Naive | 5/12 | 0/0 | ❌ | — |
| Control | 8/12 | 52/186 | ✅ | — |
| Treatment | 9/12 | 33/33 | ✅ (partial) | Composable |
| Treatment-v2 | 12/12 | 2/2 | ✅ (partial) | Defended, Auditable |
The audit score progression is monotonic: 5 → 8 → 9 → 12. Each structural addition moves the score in the predicted direction. Treatment-v2 confirms that “Emit, Don’t Reference” applied to commit hooks and CI pipeline is sufficient to achieve 2/2 on the Defended property — a property that scored 0/2 in all three prior conditions.
Open question for v3: Can the same “Emit, Don’t Reference” principle applied to test helpers and error classes close the coverage gap? If treatment-v2 explicitly lists tests/helpers/testDb.ts, src/errors/NotFoundError.ts etc. in its First Response Requirements, would the 1/9 suite pass rate improve?
§15 Treatment-v3 through v7 Progression
| Condition | GS Score | Jest | Hurl | tsc | Key addition |
|---|---|---|---|---|---|
| Naive | 5/12 | 0 pass | — | ❌ | — |
| Control | 8/12 | 52/186 | — | ✅ | — |
| Treatment (v1) | 9/12 | 33/33 | — | ✅ | Composable |
| Treatment-v2 | 12/12 | 2/2 | — | ✅ | Defended, Auditable |
| Treatment-v3 | 12/14 | pass | — | ✅ | Rubric expanded to 14 pts |
| Treatment-v4 | 12/14 | pass | — | ✅ | External ESLint gate |
| Treatment-v5 | 13/14 | pass | — | ✅ | Full Conduit suite |
| Treatment-v6 | 13/14 | pass | 6/13 | ✅ | RX Conduit 104 tests |
| Treatment-v7 | 13/14 | 146/146 | 13/13 | ✅ | Resource-scoped errors, nullable coercion, ESM imports |
The Hurl progression is the primary v7 finding: 6/13 → 13/13. Two spec gaps discovered by an external contract test suite (RC1: bio null coercion; RC2: error envelope scoping) were codified as quality gates in the v7 spec. The model implemented both correctly from the spec alone, with 2 minor fix passes (credentials HTTP status, no architectural issues).
GS score plateau at 13/14: Defended 1/2 is a structural measurement gap — the CI pipeline is emitted and present but cannot be validated end-to-end in the local experiment environment. This is consistent across v5, v6, and v7. It is not a regression.
Convergence mechanism confirmed: External tool (Hurl, no GS knowledge) finds spec gap → gap encoded as explicit gate → next generation closes it. This is the flywheel described in §7.7 of the white paper operating in practice.