Metric Stability

How much do school-level achievement metrics vary from year to year? Separating statistical noise from genuine change is essential for interpreting year-over-year comparisons.


Section 1: Year-to-Year Stability by Lag

All (school, year₁, year₂) pairs are built for lags 1, 2, and 3. Observed variability (MAD, RMSD) is compared to expected sampling noise.

Two noise estimates are shown. Conservative (100/√n) is the upper bound used elsewhere in the dashboard — it equals 2× the binomial worst case and is deliberately wide to penalise small schools. Binomial uses each school's actual observed proportion, and is typically 40–50% of the conservative value. Signal fraction is computed from each.

MAD and RMSD bars per lag; red dashed = conservative noise ceiling (100/√n); orange dotted = binomial noise using actual observed proportions. The conservative line often exceeds both bars because it is 2× the binomial worst-case; signal fraction relative to binomial noise is the more realistic signal measure.

Signal fraction = max(0, 1 − median_noise² / RMSD²). A value of 0 means noise alone could account for all observed variability; higher values indicate genuine signal. The conservative signal fraction will often be 0 (noise ceiling exceeds RMSD); the binomial signal fraction is a more realistic estimate of how much real change is present.


Section 2: Stability vs School Size

Smaller schools have more sampling noise. The scatter below shows lag-1 pairs only, with the theoretical noise curve overlaid.

Each dot is one school's year-over-year transition. The black line is a smoothed empirical median (rolling log-space window). The dashed red curve is the theoretical expected |change| under pure sampling noise — it uses 100/√n (conservative, p = 50% assumed) and therefore sits well above the typical data.

The noise curve sits above most of the data because 100/√n is ~2× the actual binomial SD at typical L3/L4% values. The empirical smooth shows that actual median change is much lower than the conservative noise ceiling — consistent with the table in Section 1 where conservative noise exceeds RMSD. Both reflect a real phenomenon: observed year-to-year change at most schools is modest, and the conservative noise model is intentionally pessimistic.

By size bin

Box plots show distribution compression as n increases — the IQR narrows substantially from the smallest to largest schools.


Section 3: Surprise Scores

Sections 1–2 show aggregate variability. But the useful question is: for this school, was this change real? A "surprise score" (z-score) compares each school's observed change to what we'd expect from noise alone, using an empirically calibrated noise model — not just the binomial formula.

Method. For each lag-1 transition, the expected SD of change has two components:

  1. Sampling noise — known from n₁ and n₂ (binomial)
  2. Cohort replacement + baseline drift — estimated from the data as the residual variance after subtracting sampling noise

We estimate the "true" within-school SD (σ_true) from the full population of lag-1 pairs:

σ²_observed = σ²_true + σ²_samplingσ²_true = max(0, σ²_observed − median(σ²_sampling))

Then each school's expected SD is √(σ²_true + σ²_sampling_i), and z = Δ / σ_expected.

Variance decomposition (lag-1 pairs, schools with n ≥ 10):

Z-score histogram. If the noise model is well-calibrated, this should resemble a standard normal (dashed curve). Fat tails mean the model underestimates true variation; a too-narrow distribution means it overestimates noise.

transitions (%) are normal (|z| < 1.5), (%) are notable, and (%) are surprising (|z| ≥ 2.5). Under a well-calibrated model, ~13% should be notable and ~1.2% surprising.

Surprise score vs school size. Small schools rarely produce surprising z-scores because their wide expected SD absorbs most changes. Large schools with big moves are the ones that stand out.

Most surprising changes

The 30 most surprising year-over-year transitions (|z| ≥ 1.5). These are the changes most likely to reflect genuine school-level shifts rather than noise.


Section 4: Model-Implied Estimates

The Shrunk Level (EB) model is run on all school histories. Its estimate for each school is a blend of last year's result and the school's own weighted-historical mean, with more weight on last year for larger schools. Model-implied estimates are joined back to the observed data for visual comparison.

Observed vs model-implied (2025). Color = assessed students (log, viridis). Large schools (small measurement noise) cluster on the diagonal — the model trusts last year's result. Small schools are pulled toward their long-run weighted mean and scatter further from the diagonal.

4-year trajectories for a small (5th-percentile), median, and large (95th-percentile) school. Solid black = observed; dashed = all four model estimates. For the small school, history-based models pull the estimate toward the long-run mean; for the large school all models converge toward last year's result (less shrinkage).

Residuals vs school size. Larger schools have smaller in-sample residuals, confirming that the noise model tracks school size.


Section 5: Predictive Validation

Train on 2022–2024, predict 2025, compare to actual. Every registered model in MODELS is evaluated automatically — adding a new model to school-models.js makes it appear here with no page changes.

Prediction error by size bin and model (train 2022–2024, predict 2025).

RMSE by school size. Models: Last year (random walk), Weighted mean (flat history), EMA half-life 2yr (time-decay mean — recent years weighted more, no trend extrapolation), Shrunk level EB (James-Stein blend of last year toward school's own mean — weight on last year scales with school size).

Calibration scatter (Shrunk level EB, 2025). Well-calibrated predictions scatter symmetrically around the diagonal. Systematic bow or offset would indicate model bias.


Section 5b: All-Combination Model Comparison

Predictive accuracy for all 4 models across every combination of grade (3/6), subject (Reading / Writing / Math), and metric (L3/L4%, Worst-case L3/L4%, L4%) — 18 combinations, no clicking needed. Train 2022–2024, predict 2025.

Each line is one model; x-axis shows grade × subject combinations. Facets split by metric. Lower is better — a model that sits consistently below the others is the best predictor. Note: raw RMSE is not comparable across metrics because they have different scales (e.g. L4% lives in a much narrower range than L3/L4%). Use the R² chart below for cross-metric comparison.

R² by metric — correlation R² (Pearson r²) between predicted and actual 2025 values. Measures how well the model tracks cross-school variation, independent of systematic bias from province-wide shifts between training and test periods. Unlike RMSE, r² is scale-invariant, always in [0, 1], and directly comparable across metrics. Higher is better. A metric whose r² is consistently higher across all models is genuinely more predictable — not just living on a narrower scale.

RMSE (pp) and R² for each model. Best = model with highest R² for that row. SD(actual) shows the cross-school standard deviation of 2025 values — metrics with smaller SD have less spread to predict, so raw RMSE will look lower even if the model explains no more of the variance.


Section 6: Methodology Notes

  • Why own-history smoothing: The school's own trajectory is the most relevant prior. School administrators trust their own data more than comparisons to board averages.
  • Shrunk level EB: Blends last year's result toward the school's own weighted-historical mean. The blend weight (λ) is calibrated from the empirical year-to-year drift variance: large schools have small measurement noise so λ → 1 (rely on last year); small schools have large measurement noise so λ → 0 (rely on history). This is the James-Stein / empirical-Bayes optimal predictor for a random-walk level with known noise.
  • EMA (half-life 2yr): Exponentially time-decayed weighted mean. Gives recent years more weight than older ones without projecting any slope forward. A useful complement when the question is "what does recent data suggest?" rather than "what is the school's long-run level?"
  • The noise model: 100/√n (not binomial MoE). See Measurement Reliability for full rationale.
  • Limitations: A genuine sudden shift (new principal, program change) in a small school will be partially smoothed away. The raw observed value is always shown alongside the model. The model assumes some year-to-year persistence in a school's underlying quality.
  • Out-of-sample validation: In-sample fit always looks good. The train-on-3/predict-4 test shows whether the model genuinely reduces uncertainty or just overfits.
  • Adding new models: Register a new entry in MODELS in school-models.js — it will appear automatically in Section 5's validation table with no page changes needed.