49 rules · 28 rubrics
distilled from 53 craft samples.
Principle distribution
49 RULES · 8 CATEGORIESBy category
Each principle is one falsifiable rule. Categories grew organically as bad-case patterns repeated — anti-pattern bucket appeared only after iteration 02.
Rubric composition
28 rubrics weight-distributed across 3 axes. Taste axis carries 2× weight despite fewer items — taste is where reward-hacking shows up.
Generator-critic inflation
HOLDOUT TEST · CRITIC INDEPENDENTSelf-eval vs blind-critic delta (10pt)
In-distribution: generator inflated +0.8 → +1.5 across rounds. Holdout pass-2 added independent critic in the loop — inflation collapsed to −0.20. Inflection point = training maturity signal.
| RUN | GEN | CRITIC | Δ |
|---|---|---|---|
| test_01 | 8.0 | 7.20 | +0.80 |
| test_02 | 9.2 | 7.70 | +1.50 |
| test_03 | 8.5 | 7.20 | +1.30 |
| holdout | 9.10 | 9.30 | −0.20 |
| MEAN INFLATION ID | +1.20 | ||
S-tier closed-loop iteration
v01 → v02 · +0.40 SCORE GAINRe-scoring after critic-applied fixes
First pass scored 4.55/5 (high A-tier). Critic surfaced 5 specific rubric gaps. Generator applied targeted patches without rewrite. Re-score: 4.95 — entered S band on closed loop.
Training protocol — 5-step loop
CRITIC IS INDEPENDENT · NOT GENERATORProtocol self-reflection
META-LAYER · 3 ACTIONABLE GAPSGenerator self-eval inflates by +1.2 / 10 on average.
Across 3 independent runs, generator scored itself ≥ 8.0 / 10; blind critic re-evaluation returned 7.20 ~ 7.70. Inflation is structural, not noise — reward-hacking surfaces when scorer = scored.
PRIORITY · MUST FIXCritic reports surface newer gaps than good/bad exemplars.
4 of 15 new principles (P-036~039) originated from critic reports rather than prior exemplars. Critic loop should feed back into principles.md as primary update source, not secondary.
PRIORITY · RECOMMENDEDSingle-rubric weight of 0.10 can't express severity of red-line violations.
A rule like "rAF must release on visibility-hidden" weighs only 2% of total. Generator can violate critical principles and still grade out high. Tag red-line rubrics with override-weights or trigger-tier demotion.
PRIORITY · RECOMMENDED