Metric Stability
How much do school-level achievement metrics vary from year to year? Separating statistical noise from genuine change is essential for interpreting year-over-year comparisons.
Section 1: Year-to-Year Stability by Lag
All (school, year₁, year₂) pairs are built for lags 1, 2, and 3. Observed variability (MAD, RMSD) is compared to expected sampling noise.
Two noise estimates are shown. Conservative (100/√n) is the upper bound used elsewhere in the dashboard — it equals 2× the binomial worst case and is deliberately wide to penalise small schools. Binomial uses each school's actual observed proportion, and is typically 40–50% of the conservative value. Signal fraction is computed from each.
MAD and RMSD bars per lag; red dashed = conservative noise ceiling (100/√n); orange dotted = binomial noise using actual observed proportions. The conservative line often exceeds both bars because it is 2× the binomial worst-case; signal fraction relative to binomial noise is the more realistic signal measure.
Signal fraction = max(0, 1 − median_noise² / RMSD²). A value of 0 means noise alone could account for all observed variability; higher values indicate genuine signal. The conservative signal fraction will often be 0 (noise ceiling exceeds RMSD); the binomial signal fraction is a more realistic estimate of how much real change is present.
Section 2: Stability vs School Size
Smaller schools have more sampling noise. The scatter below shows lag-1 pairs only, with the theoretical noise curve overlaid.
Each dot is one school's year-over-year transition. The black line is a smoothed empirical median (rolling log-space window). The dashed red curve is the theoretical expected |change| under pure sampling noise — it uses 100/√n (conservative, p = 50% assumed) and therefore sits well above the typical data.
The noise curve sits above most of the data because 100/√n is ~2× the actual binomial SD at typical L3/L4% values. The empirical smooth shows that actual median change is much lower than the conservative noise ceiling — consistent with the table in Section 1 where conservative noise exceeds RMSD. Both reflect a real phenomenon: observed year-to-year change at most schools is modest, and the conservative noise model is intentionally pessimistic.
By size bin
Box plots show distribution compression as n increases — the IQR narrows substantially from the smallest to largest schools.
Section 3: Surprise Scores
Sections 1–2 show aggregate variability. But the useful question is: for this school, was this change real? A "surprise score" (z-score) compares each school's observed change to what we'd expect from noise alone, using an empirically calibrated noise model — not just the binomial formula.
Method. For each lag-1 transition, the expected SD of change has two components:
- Sampling noise — known from n₁ and n₂ (binomial)
- Cohort replacement + baseline drift — estimated from the data as the residual variance after subtracting sampling noise
We estimate the "true" within-school SD (σ_true) from the full population of lag-1 pairs:
σ²_observed = σ²_true + σ²_sampling → σ²_true = max(0, σ²_observed − median(σ²_sampling))
Then each school's expected SD is √(σ²_true + σ²_sampling_i), and z = Δ / σ_expected.
Variance decomposition (lag-1 pairs, schools with n ≥ 10):
Z-score histogram. If the noise model is well-calibrated, this should resemble a standard normal (dashed curve). Fat tails mean the model underestimates true variation; a too-narrow distribution means it overestimates noise.
Surprise score vs school size. Small schools rarely produce surprising z-scores because their wide expected SD absorbs most changes. Large schools with big moves are the ones that stand out.
Most surprising changes
The 30 most surprising year-over-year transitions (|z| ≥ 1.5). These are the changes most likely to reflect genuine school-level shifts rather than noise.
Section 4: Model-Implied Estimates
The Shrunk Level (EB) model is run on all school histories. Its estimate for each school is a blend of last year's result and the school's own weighted-historical mean, with more weight on last year for larger schools. Model-implied estimates are joined back to the observed data for visual comparison.
Observed vs model-implied (2025). Color = assessed students (log, viridis). Large schools (small measurement noise) cluster on the diagonal — the model trusts last year's result. Small schools are pulled toward their long-run weighted mean and scatter further from the diagonal.
4-year trajectories for a small (5th-percentile), median, and large (95th-percentile) school. Solid black = observed; dashed = all four model estimates. For the small school, history-based models pull the estimate toward the long-run mean; for the large school all models converge toward last year's result (less shrinkage).
Residuals vs school size. Larger schools have smaller in-sample residuals, confirming that the noise model tracks school size.
Section 5: Predictive Validation
Train on 2022–2024, predict 2025, compare to actual. Every registered model in MODELS is evaluated automatically — adding a new model to school-models.js makes it appear here with no page changes.
Prediction error by size bin and model (train 2022–2024, predict 2025).
RMSE by school size. Models: Last year (random walk), Weighted mean (flat history), EMA half-life 2yr (time-decay mean — recent years weighted more, no trend extrapolation), Shrunk level EB (James-Stein blend of last year toward school's own mean — weight on last year scales with school size).
Calibration scatter (Shrunk level EB, 2025). Well-calibrated predictions scatter symmetrically around the diagonal. Systematic bow or offset would indicate model bias.
Section 5b: All-Combination Model Comparison
Predictive accuracy for all 4 models across every combination of grade (3/6), subject (Reading / Writing / Math), and metric (L3/L4%, Worst-case L3/L4%, L4%) — 18 combinations, no clicking needed. Train 2022–2024, predict 2025.
Each line is one model; x-axis shows grade × subject combinations. Facets split by metric. Lower is better — a model that sits consistently below the others is the best predictor. Note: raw RMSE is not comparable across metrics because they have different scales (e.g. L4% lives in a much narrower range than L3/L4%). Use the R² chart below for cross-metric comparison.
R² by metric — correlation R² (Pearson r²) between predicted and actual 2025 values. Measures how well the model tracks cross-school variation, independent of systematic bias from province-wide shifts between training and test periods. Unlike RMSE, r² is scale-invariant, always in [0, 1], and directly comparable across metrics. Higher is better. A metric whose r² is consistently higher across all models is genuinely more predictable — not just living on a narrower scale.
RMSE (pp) and R² for each model. Best = model with highest R² for that row. SD(actual) shows the cross-school standard deviation of 2025 values — metrics with smaller SD have less spread to predict, so raw RMSE will look lower even if the model explains no more of the variance.
Section 6: Methodology Notes
- Why own-history smoothing: The school's own trajectory is the most relevant prior. School administrators trust their own data more than comparisons to board averages.
- Shrunk level EB: Blends last year's result toward the school's own weighted-historical mean. The blend weight (λ) is calibrated from the empirical year-to-year drift variance: large schools have small measurement noise so λ → 1 (rely on last year); small schools have large measurement noise so λ → 0 (rely on history). This is the James-Stein / empirical-Bayes optimal predictor for a random-walk level with known noise.
- EMA (half-life 2yr): Exponentially time-decayed weighted mean. Gives recent years more weight than older ones without projecting any slope forward. A useful complement when the question is "what does recent data suggest?" rather than "what is the school's long-run level?"
- The noise model:
100/√n(not binomial MoE). See Measurement Reliability for full rationale. - Limitations: A genuine sudden shift (new principal, program change) in a small school will be partially smoothed away. The raw observed value is always shown alongside the model. The model assumes some year-to-year persistence in a school's underlying quality.
- Out-of-sample validation: In-sample fit always looks good. The train-on-3/predict-4 test shows whether the model genuinely reduces uncertainty or just overfits.
- Adding new models: Register a new entry in
MODELSinschool-models.js— it will appear automatically in Section 5's validation table with no page changes needed.