GS for Machine Learning
Why This Domain Is Different
ML systems have non-deterministic outputs and data-dependent correctness. A model that scores 94% accuracy in development may score 71% on production data because of distribution shift, data leakage, or an incorrect train/test split. These are not caught by structural code testing.
The quality question for ML is not “does the code run” — it is “does the model generalize.”
Relevant Tag
ML — activates ML-specific gates, verification strategies, and guidance.
Active Quality Gates
| Gate | Why it matters here |
|---|---|
| coverage-threshold-80 | Pipeline logic (preprocessing, feature engineering) is testable |
| no-hardcoded-secrets | Model registry credentials, data warehouse keys |
| environment-variables-config | Model endpoints, evaluation thresholds, feature toggles |
| conventional-commits | Model versioning requires traceable experiment history |
Domain-Specific Gates Needed (Contribution Targets)
| Gate ID (proposed) | GS Property | Description |
|---|---|---|
model-evaluation-threshold | Verifiable | Held-out test set F1/accuracy/AUC must exceed spec threshold before merge |
no-test-data-in-training | Defended | Test split indices must not overlap with training indices |
dataset-version-pinned | Auditable | Training dataset must be referenced by commit SHA or versioned artifact ID |
model-versioned | Auditable | Every model artifact is tagged with the code commit that produced it |
no-pickle-untrusted-input | Defended | No pickle.load on user-supplied or network-sourced data |
Verification Strategy
| What to verify | How |
|---|---|
| Model performance | Holdout test set (never seen during training) — metrics against spec thresholds |
| Data integrity | Assert no row overlap between train/val/test splits |
| Reproducibility | Given pinned dataset + seed, training produces the same metric ± epsilon |
| Distribution shift | Canary evaluation on a sample of production data before full rollout |
| Feature pipeline | Unit tests on pure transformation functions; property tests on invariants |
Spec Patterns
A GS PRD for an ML project must include:
- Evaluation metrics — which metrics, at what thresholds, on which dataset splits
- Dataset description — source, version, size, class distribution, known biases
- Acceptable drift thresholds — when does production performance trigger a retrain
- Feature contract — schema of input features, types, expected ranges, null policy
- Model explainability requirements — is a black-box acceptable, or is interpretability required
Contribute a Gate
Most underrepresented GS property for ML: Verifiable (model evaluation is almost entirely ungated in the current library).
See CONTRIBUTING.md.