G3→G6 Value-Added Progress Model

How well does Grade 3 achievement predict Grade 6? How much persistent school "value-added" remains after controlling for G3 baseline and participation? Two mixed-effects models decompose the variance and allow us to check whether the model fits equally well across subjects and pair types.

Method: M_VA1 (school random intercept) and M_VA2 (school×subject random intercept) both predict G6 logit(L3/4%) from G3 logit(L3/4%) plus participation covariates (G6 participation rate, change in participation, log n-ratio). Fixed effects include pair×subject indicators to absorb cohort and cross-sectional baselines. The sample contains 41,738 school×pair×subject observations across 2,946 schools and five pair types (one true cohort, four cross-sectional). Logit-normal approximation with REML; weights proportional to n·p·(1−p).

Variance decomposition

After controlling for G3 baseline, school value-added accounts for roughly half the remaining variance in G6 outcomes.

% School VA

Persistent school value-added on the logit scale — stable across subjects. σ² = , ICC = .

% Subject-specific

Extra subject-specific divergence beyond the shared school effect. Zero: M_VA2 (school×subject) adds no signal over M_VA1.

% Residual

Pair-level noise — cohort fluctuation, G3 measurement error, participation shifts. σ² = .

Cross-subject consistency is 100%. A school's G3→G6 value-added is completely shared across Reading, Writing, and Math. Subject VA correlations are near zero (−0.007 to −0.117), meaning M_VA2 is absorbing noise, not signal. Schools that add value in one subject add it in all three.

School VA distribution

Each point is one school. The logit-scale BLUP captures how much better (or worse) a school's G6 cohorts perform relative to what would be predicted from their G3 scores and participation profile.

The distribution is approximately centred on zero (mean = −0.000) with SD ≈ 0.74 on the logit scale. On the probability scale near 50% L3/4, 1 logit unit ≈ 25 percentage points — so the interquartile range spans roughly ±10 pp of value-added.

Shrinkage: G6 raw vs model-adjusted

Partial pooling pulls schools with few observations toward the grand mean. The x-axis is each school's raw (unadjusted) mean G6 logit; the y-axis is the model's BLUP after conditioning on G3 baseline, participation, and pooling. Schools near the bottom-left or top-right of the diagonal are those whose raw G6 rank is largely explained by their G3 intake — the model discounts their apparent VA.

High raw G6 is not the same as high VA — schools with strong G3 cohorts slide left (lower raw G6 relative to their G3 baseline). Small schools (light yellow) are pulled more aggressively toward zero. The near-zero correlation between raw G6 rank and VA rank confirms the G3 adjustment matters.

Reliability by school size

Each school has up to 15 observation rows (5 pairs × 3 subjects). Schools with fewer pairs or subjects — those that appear in a subset of years or dropped a subject — are shrunk more aggressively toward zero.

With the full 15 observations per school, reliability is λ = 0.932. Schools with only 2 observations drop to λ = 0.65 — shrunk to 65% of their raw residual. About 21% of schools (631/2,946) have fewer than 15 observations and experience meaningfully stronger shrinkage, pulling some toward the centre of the distribution.


Predictive checks

Observed vs simulated outcome distribution

The model simulates G6 outcomes by drawing from N(X β̂ + BLUP, σ²_resid) for each observation. Key moments match closely:

The simulated SD (1.485) is slightly below observed (1.496) because the simulator draws new noise whereas the observations already contain sampling variation. The 5th–95th range is well captured (obs: [−1.00, 3.71]; sim: [−1.26, 3.61]). The KS statistic of 0.078 (p < 0.001) flags mild non-normality of residuals — expected with n = 41,738; the deviation is small in absolute terms.


Per-subject predictive performance

Math is the most predictable subject (R² = 60.5%, RMSE = 0.619 logit). Reading is the hardest to predict (R² = 54.6%, RMSE = 0.794). The 11 pp R² gap and 28% higher RMSE for Reading suggest G3 reading scores carry less forward signal — or that reading achievement is more volatile between pairs than Math. Writing sits in between (R² = 57.0%, RMSE = 0.805).

Cohort vs cross-sectional pair performance

The data includes one true cohort (2022 G3 → 2025 G6, same students) and four cross-sectional pairs (different cohorts, same school, different years). If the model's G3 slope properly substitutes for cohort tracking, the cohort pair should not be harder to predict.

The true cohort pair (R² = 78.5%) matches the most recent cross-sectional pair (xsec_2025: 78.4%) — the G3 baseline from the same students is no more predictive than a cross-sectional same-year pair. Older cross-sectional pairs underperform (xsec_2022: 72.7%), consistent with G3 data growing stale over time. This validates using cross-sectional pairs as proxies for cohort tracking.

Residual diagnostics

All five diagnostics pass. The most informative is r(|residual|, n) = −0.234: smaller schools have larger residuals, as expected under binomial noise — not a model failure but confirmation that variance scales with 1/n. The G3 slope is cleanly linear (r = −0.023 for the quadratic term, well below the 0.10 flag threshold).


Shrinkage verification

The BLUP is shrunk to 92.0% of the raw school residual on average (theoretical reliability at median n_obs = 15: 0.932 — 1.2 pp difference). The high BLUP–raw correlation (r = 0.999) reflects that 79% of schools have the full 15 observations and receive identical shrinkage, so their ranks are preserved. The remaining 21% — schools with fewer pair-subject observations — are shrunk more (λ as low as 0.65 at n_obs = 2), which does change their rank. In practice, small-sample schools with extreme raw residuals are pulled meaningfully toward zero, with rank shifts of up to 43 positions.


Value-added vs G3 achievement

The x-axis uses a shrunken G3 estimate — each school's mean G3 logit pulled toward the grand mean by an empirical Bayes factor that accounts for the number of observations.

Model limitation — Mundlak/ecological bias. The strong positive slope (r ≈ 0.66) reveals that current VA estimates are confounded with school G3 intake level. This is not a ceiling effect but a slope identification problem: the model uses a single pooled G3→G6 slope estimated from both within-school (year-to-year) and between-school variation (0.55 vs 0.81 respectively). Because the between-school slope is steeper, high-G3 schools are under-predicted by the pooled slope and their random intercept absorbs the residual — falsely appearing as high value-added. A Mundlak correction (adding school-mean G3 as an additional fixed effect per subject) would separate the two slopes and remove this bias.

Note also that a zero slope is not a universal requirement of a valid VA model — it is simply what would be expected if the model fully absorbed the G3 baseline. Genuine heterogeneity in VA by G3 level is possible (schools with weaker intakes may genuinely add more relative value), and ceiling effects do further compress measurable G6 gains for the highest-G3 schools. But at r = 0.66 the Mundlak bias dominates.


Interpretation

The G3→G6 progress model is well-calibrated and internally consistent. Simulated outcomes match observed moments, residuals are well-behaved, and shrinkage tracks theory.

Key findings:

The residual normality KS test (stat = 0.078, p < 0.001) flags mild non-normality. With n = 41,738 the test is very sensitive; the deviation reflects slight boundary effects from the logit transformation at high/low performance levels. It does not invalidate the variance decomposition but suggests caution about prediction intervals for schools near the ceiling (>90% L3/4) or floor (<10%).