Measurement Reliability
How confident can we be in a school's reported achievement percentage? This page documents the reliability framework used across the dashboard.
Why This Matters
Three situations compound the problem:
-
Small sample size — a school with 20 assessed students has far more sampling noise than one with 200. Percentages from small schools bounce around from year to year even if nothing real changed.
-
Low participation rate — if only 75% of registered students were assessed, the missing 25% may not be random. Exemption and absence patterns can systematically exclude lower-performing students, inflating the reported achievement percentage.
-
Derived quantities amplify error — rankings, year-over-year changes, and subgroup gaps are all differences or comparisons. The noise in a difference is larger than the noise in either snapshot. A 5 pp "improvement" at a school with 30 students is well within noise.
Schools at the extremes of any ranking are especially likely to be there partly because of measurement error — regression to the mean guarantees that the most extreme results are disproportionately noisy.
Noise from Sample Size
We quantify sampling noise as:
noise (pp) = 100 / √n
where n is the number of fully participating (assessed) students. This gives the scale of random fluctuation in percentage points:
| Assessed students | Noise (pp) | Interpretation |
|---|---|---|
| 10 | 31.6 | Essentially uninformative — a school at 60% could easily be anywhere from 30% to 90% |
| 25 | 20.0 | Very noisy — a 20 pp swing is expected by chance |
| 50 | 14.1 | Still substantial — ranking these schools is unreliable |
| 100 | 10.0 | Moderate — broad patterns are visible, fine distinctions are not |
| 200 | 7.1 | Reasonable for most comparisons |
| 500 | 4.5 | Board-level precision — most boards are here or above |
Why not the binomial margin of error?
The standard formula sqrt(p(1-p)/n) gives a tighter interval when the percentage is near 0% or 100%. But this is misleading in our context: if the average school is at 50% and one school reports 90%, the binomial formula says that extreme value is more precise. In reality, the extreme school is more likely to be an outlier inflated by noise. Using 1/√n avoids placing false confidence at the tails.
For differences and gaps
When comparing two independent estimates (gender gap, ELL gap, year-over-year change), the noise of the difference is:
noise_diff = √(noise_a² + noise_b²)
A gender gap at a school with 40 female and 35 male assessed students has noise = √(15.8² + 16.9²) = 23.1 pp. A reported gap of 15 pp is well within noise.
Selection Bias from Low Participation
Participation rates below 100% introduce a risk of non-representative sampling that cannot be quantified from the data alone. However, the degree of departure from 100% is informative:
- 96–100% — a few random absences. Low risk of bias.
- 90–95% — some exemptions, possibly systematic. Moderate risk.
- 80–89% — enough missing students to materially shift results. High risk.
- Below 80% — results should be interpreted with significant caution.
Participation penalty function
We model the penalty as a power function:
weight = (participation rate)^k
where k controls the severity of the penalty (default k=3). This is convex: small departures from 100% incur a mild penalty, but larger departures are penalized much more heavily — reflecting the intuition that a 4% absence rate is mostly random, while a 20% rate likely reflects systematic exclusion.
| Participation | k=2 | k=3 | k=4 |
|---|---|---|---|
| 100% | 1.000 | 1.000 | 1.000 |
| 96% | 0.922 | 0.885 | 0.849 |
| 90% | 0.810 | 0.729 | 0.656 |
| 80% | 0.640 | 0.512 | 0.410 |
| 70% | 0.490 | 0.343 | 0.240 |
Composite Reliability Score
Combining both dimensions into a single number:
reliability = √n × participation^k
This can be interpreted as an "effective √n" — the precision you'd expect after accounting for both sample size and selection bias risk. Higher is better.
| School profile | Score (k=3) |
|---|---|
| n=100, 98% participation | 9.4 |
| n=25, 98% participation | 4.7 |
| n=100, 80% participation | 5.1 |
| n=25, 80% participation | 2.6 |
| n=400, 95% participation (board) | 17.1 |
Subgroup Reliability
For demographic subgroups (gender, ELL, special needs), the relevant sample size is the subgroup count, not the school total. EQAO suppresses data when the subgroup has fewer than 10 assessed students, so the minimum reported subgroup count is 10 — giving a maximum noise of ~31.6 pp.
ELL subgroups are particularly sparse: the median school-level ELL count is around 19 students, meaning the typical ELL percentage has noise of ~23 pp. Most school-level ELL gaps are within noise.
Per-subgroup participation rates are not published by EQAO. The overall school participation rate is used as a proxy for subgroup participation risk.
Applying Reliability to the Dashboard
The reliability infrastructure supports several approaches, which can be applied page by page:
Visual encoding — less reliable data points rendered with lower opacity or smaller marks, so they visually recede while remaining available for inspection.
Threshold filtering — interactive controls (sliders) to set minimum sample size, minimum participation rate, or minimum composite reliability score. Points below the threshold are hidden or dimmed.
Annotation — asterisks or symbols next to data points that fall below reliability thresholds, with tooltip details showing n, participation rate, and noise.
Exclusion — for analyses like rankings or trend detection, dropping data points below a reliability threshold before computing ranks or fits, with the threshold clearly stated.
The specific thresholds are deliberately not fixed — different analyses may warrant different cutoffs, and exploring the sensitivity of conclusions to threshold choice is itself informative.
Data Available for Reliability Assessment
The dashboard parquets include the following columns for reliability computation:
G3/G6 (per subject: Read, Write, Math):
cntStudents_*— registered students (sample frame)cntFullyParticipating_*— assessed students (MoE denominator)pctFullyParticipating_*— participation rate (%)cntOverall{R,W,M}_{G1,G2,E1,S1}_Part— subgroup assessed counts
G9 (math only):
cntStudents_All— registered studentscntFullyParticipating— assessed studentspctFullyParticipating— participation ratecntFullyParticipating_{G1,G2,E1,S1}— subgroup assessed counts
Shared JavaScript utilities are in reliability.js: noisePP(n), gapNoise(), participationWeight(), reliabilityScore(), and DuckDB SQL helpers.