Measurement Reliability

How confident can we be in a school's reported achievement percentage? This page documents the reliability framework used across the dashboard.

Achievement numbers are less reliable when based on fewer assessments (small sample) or when the students assessed are not representative of the cohort (selection bias from low participation). Interpret small-school results and year-over-year changes with caution.

Why This Matters

Three situations compound the problem:

  1. Small sample size — a school with 20 assessed students has far more sampling noise than one with 200. Percentages from small schools bounce around from year to year even if nothing real changed.

  2. Low participation rate — if only 75% of registered students were assessed, the missing 25% may not be random. Exemption and absence patterns can systematically exclude lower-performing students, inflating the reported achievement percentage.

  3. Derived quantities amplify error — rankings, year-over-year changes, and subgroup gaps are all differences or comparisons. The noise in a difference is larger than the noise in either snapshot. A 5 pp "improvement" at a school with 30 students is well within noise.

Schools at the extremes of any ranking are especially likely to be there partly because of measurement error — regression to the mean guarantees that the most extreme results are disproportionately noisy.


Noise from Sample Size

We quantify sampling noise as:

noise (pp) = 100 / √n

where n is the number of fully participating (assessed) students. This gives the scale of random fluctuation in percentage points:

Assessed students Noise (pp) Interpretation
10 31.6 Essentially uninformative — a school at 60% could easily be anywhere from 30% to 90%
25 20.0 Very noisy — a 20 pp swing is expected by chance
50 14.1 Still substantial — ranking these schools is unreliable
100 10.0 Moderate — broad patterns are visible, fine distinctions are not
200 7.1 Reasonable for most comparisons
500 4.5 Board-level precision — most boards are here or above

Why not the binomial margin of error?

The standard formula sqrt(p(1-p)/n) gives a tighter interval when the percentage is near 0% or 100%. But this is misleading in our context: if the average school is at 50% and one school reports 90%, the binomial formula says that extreme value is more precise. In reality, the extreme school is more likely to be an outlier inflated by noise. Using 1/√n avoids placing false confidence at the tails.

For differences and gaps

When comparing two independent estimates (gender gap, ELL gap, year-over-year change), the noise of the difference is:

noise_diff = √(noise_a² + noise_b²)

A gender gap at a school with 40 female and 35 male assessed students has noise = √(15.8² + 16.9²) = 23.1 pp. A reported gap of 15 pp is well within noise.


Selection Bias from Low Participation

Participation rates below 100% introduce a risk of non-representative sampling that cannot be quantified from the data alone. However, the degree of departure from 100% is informative:

Participation penalty function

We model the penalty as a power function:

weight = (participation rate)^k

where k controls the severity of the penalty (default k=3). This is convex: small departures from 100% incur a mild penalty, but larger departures are penalized much more heavily — reflecting the intuition that a 4% absence rate is mostly random, while a 20% rate likely reflects systematic exclusion.

Participation k=2 k=3 k=4
100% 1.000 1.000 1.000
96% 0.922 0.885 0.849
90% 0.810 0.729 0.656
80% 0.640 0.512 0.410
70% 0.490 0.343 0.240

Composite Reliability Score

Combining both dimensions into a single number:

reliability = √n × participation^k

This can be interpreted as an "effective √n" — the precision you'd expect after accounting for both sample size and selection bias risk. Higher is better.

School profile Score (k=3)
n=100, 98% participation 9.4
n=25, 98% participation 4.7
n=100, 80% participation 5.1
n=25, 80% participation 2.6
n=400, 95% participation (board) 17.1

Subgroup Reliability

For demographic subgroups (gender, ELL, special needs), the relevant sample size is the subgroup count, not the school total. EQAO suppresses data when the subgroup has fewer than 10 assessed students, so the minimum reported subgroup count is 10 — giving a maximum noise of ~31.6 pp.

ELL subgroups are particularly sparse: the median school-level ELL count is around 19 students, meaning the typical ELL percentage has noise of ~23 pp. Most school-level ELL gaps are within noise.

Per-subgroup participation rates are not published by EQAO. The overall school participation rate is used as a proxy for subgroup participation risk.


Applying Reliability to the Dashboard

The reliability infrastructure supports several approaches, which can be applied page by page:

Visual encoding — less reliable data points rendered with lower opacity or smaller marks, so they visually recede while remaining available for inspection.

Threshold filtering — interactive controls (sliders) to set minimum sample size, minimum participation rate, or minimum composite reliability score. Points below the threshold are hidden or dimmed.

Annotation — asterisks or symbols next to data points that fall below reliability thresholds, with tooltip details showing n, participation rate, and noise.

Exclusion — for analyses like rankings or trend detection, dropping data points below a reliability threshold before computing ranks or fits, with the threshold clearly stated.

The specific thresholds are deliberately not fixed — different analyses may warrant different cutoffs, and exploring the sensitivity of conclusions to threshold choice is itself informative.


Data Available for Reliability Assessment

The dashboard parquets include the following columns for reliability computation:

G3/G6 (per subject: Read, Write, Math):

G9 (math only):

Shared JavaScript utilities are in reliability.js: noisePP(n), gapNoise(), participationWeight(), reliabilityScore(), and DuckDB SQL helpers.