GS for Machine Learning

Why This Domain Is Different

ML systems have non-deterministic outputs and data-dependent correctness. A model that scores 94% accuracy in development may score 71% on production data because of distribution shift, data leakage, or an incorrect train/test split. These are not caught by structural code testing.

The quality question for ML is not “does the code run” — it is “does the model generalize.”

Relevant Tag

ML — activates ML-specific gates, verification strategies, and guidance.

Active Quality Gates

Gate	Why it matters here
coverage-threshold-80	Pipeline logic (preprocessing, feature engineering) is testable
no-hardcoded-secrets	Model registry credentials, data warehouse keys
environment-variables-config	Model endpoints, evaluation thresholds, feature toggles
conventional-commits	Model versioning requires traceable experiment history

Domain-Specific Gates Needed (Contribution Targets)

Gate ID (proposed)	GS Property	Description
`model-evaluation-threshold`	Verifiable	Held-out test set F1/accuracy/AUC must exceed spec threshold before merge
`no-test-data-in-training`	Defended	Test split indices must not overlap with training indices
`dataset-version-pinned`	Auditable	Training dataset must be referenced by commit SHA or versioned artifact ID
`model-versioned`	Auditable	Every model artifact is tagged with the code commit that produced it
`no-pickle-untrusted-input`	Defended	No `pickle.load` on user-supplied or network-sourced data

Verification Strategy

What to verify	How
Model performance	Holdout test set (never seen during training) — metrics against spec thresholds
Data integrity	Assert no row overlap between train/val/test splits
Reproducibility	Given pinned dataset + seed, training produces the same metric ± epsilon
Distribution shift	Canary evaluation on a sample of production data before full rollout
Feature pipeline	Unit tests on pure transformation functions; property tests on invariants

Spec Patterns

A GS PRD for an ML project must include:

Evaluation metrics — which metrics, at what thresholds, on which dataset splits
Dataset description — source, version, size, class distribution, known biases
Acceptable drift thresholds — when does production performance trigger a retrain
Feature contract — schema of input features, types, expected ranges, null policy
Model explainability requirements — is a black-box acceptable, or is interpretability required

Contribute a Gate

Most underrepresented GS property for ML: Verifiable (model evaluation is almost entirely ungated in the current library).

See CONTRIBUTING.md.