G3→G6 Value-Added Progress Model
How well does Grade 3 achievement predict Grade 6? How much persistent school "value-added" remains after controlling for G3 baseline and participation? Two mixed-effects models decompose the variance and allow us to check whether the model fits equally well across subjects and pair types.
Variance decomposition
After controlling for G3 baseline, school value-added accounts for roughly half the remaining variance in G6 outcomes.
% School VA
Persistent school value-added on the logit scale — stable across subjects. σ² =
% Subject-specific
Extra subject-specific divergence beyond the shared school effect. Zero: M_VA2 (school×subject) adds no signal over M_VA1.
% Residual
Pair-level noise — cohort fluctuation, G3 measurement error, participation shifts. σ² =
School VA distribution
Each point is one school. The logit-scale BLUP captures how much better (or worse) a school's G6 cohorts perform relative to what would be predicted from their G3 scores and participation profile.
Shrinkage: G6 raw vs model-adjusted
Partial pooling pulls schools with few observations toward the grand mean. The x-axis is each school's raw (unadjusted) mean G6 logit; the y-axis is the model's BLUP after conditioning on G3 baseline, participation, and pooling. Schools near the bottom-left or top-right of the diagonal are those whose raw G6 rank is largely explained by their G3 intake — the model discounts their apparent VA.
Reliability by school size
Each school has up to 15 observation rows (5 pairs × 3 subjects). Schools with fewer pairs or subjects — those that appear in a subset of years or dropped a subject — are shrunk more aggressively toward zero.
With the full 15 observations per school, reliability is λ = 0.932. Schools with only 2 observations drop to λ = 0.65 — shrunk to 65% of their raw residual. About 21% of schools (631/2,946) have fewer than 15 observations and experience meaningfully stronger shrinkage, pulling some toward the centre of the distribution.
Predictive checks
Observed vs simulated outcome distribution
The model simulates G6 outcomes by drawing from N(X β̂ + BLUP, σ²_resid) for each observation. Key moments match closely:
The simulated SD (1.485) is slightly below observed (1.496) because the simulator draws new noise whereas the observations already contain sampling variation. The 5th–95th range is well captured (obs: [−1.00, 3.71]; sim: [−1.26, 3.61]). The KS statistic of 0.078 (p < 0.001) flags mild non-normality of residuals — expected with n = 41,738; the deviation is small in absolute terms.
Per-subject predictive performance
Cohort vs cross-sectional pair performance
The data includes one true cohort (2022 G3 → 2025 G6, same students) and four cross-sectional pairs (different cohorts, same school, different years). If the model's G3 slope properly substitutes for cohort tracking, the cohort pair should not be harder to predict.
Residual diagnostics
All five diagnostics pass. The most informative is r(|residual|, n) = −0.234: smaller schools have larger residuals, as expected under binomial noise — not a model failure but confirmation that variance scales with 1/n. The G3 slope is cleanly linear (r = −0.023 for the quadratic term, well below the 0.10 flag threshold).
Shrinkage verification
The BLUP is shrunk to 92.0% of the raw school residual on average (theoretical reliability at median n_obs = 15: 0.932 — 1.2 pp difference). The high BLUP–raw correlation (r = 0.999) reflects that 79% of schools have the full 15 observations and receive identical shrinkage, so their ranks are preserved. The remaining 21% — schools with fewer pair-subject observations — are shrunk more (λ as low as 0.65 at n_obs = 2), which does change their rank. In practice, small-sample schools with extreme raw residuals are pulled meaningfully toward zero, with rank shifts of up to 43 positions.
Value-added vs G3 achievement
The x-axis uses a shrunken G3 estimate — each school's mean G3 logit pulled toward the grand mean by an empirical Bayes factor that accounts for the number of observations.
Note also that a zero slope is not a universal requirement of a valid VA model — it is simply what would be expected if the model fully absorbed the G3 baseline. Genuine heterogeneity in VA by G3 level is possible (schools with weaker intakes may genuinely add more relative value), and ceiling effects do further compress measurable G6 gains for the highest-G3 schools. But at r = 0.66 the Mundlak bias dominates.
Interpretation
The G3→G6 progress model is well-calibrated and internally consistent. Simulated outcomes match observed moments, residuals are well-behaved, and shrinkage tracks theory.
Key findings:
- G3 logit explains roughly 55–60% of variance in G6 outcomes (R² by subject). The other 40–45% includes real school VA (48% of total) and residual noise (52%).
- Math is more predictable than Reading or Writing — G3 reading may be a noisier signal, or reading achievement diverges more between cohorts.
- Cross-sectional pairs are as predictive as the true cohort pair when they share the same G6 year. Older G3 baselines lose predictive power (72.7% for 2022 vs 78.5% for the 2022→2025 cohort). This is the expected attenuation from cohort drift.
- Subject-specific VA is zero: schools that add value do so uniformly across Reading, Writing, and Math. A school with high Math VA also has high Reading and Writing VA — so a single school-level VA score summarises all three subjects without loss.
- Reliability at the median school (n_obs = 15 pair-subject observations) is 0.93. Schools with only 2 observations drop to λ = 0.65 and are shrunk to 65% of their raw residual. About 21% of schools have fewer than 15 observations and experience meaningfully stronger shrinkage, with rank shifts of up to 43 positions relative to their raw residual order.