METHODOLOGY · v1
Brier score, calibration plot, reliability diagram, and per-sub-area small multiples — Opus-4.7-graded ground truth with founder spot-check.
Ground-truth disclosure. Opus-4.7-graded ground truth with founder spot-check, not founder-graded.
BRIER SCORE · 30-DAY
Lower is better. Brier score is mean squared error between the model’s predicted probability and the eventual binary outcome. Our public methodology gate is ≤ 0.30; current value is 0.224 with 95% Wilson CI [0.176, 0.281].
CALIBRATION
Each dot is one of ten predicted-probability deciles. The dashed grey line is perfect calibration; the solid line is the OLS fit through the deciles (slope 0.82, intercept 0.09).
PER-SUB-AREA · 7 CS.AI AREAS
Aggregate scores hide regression in any one sub-area. We track each of the seven cs.AI sub-areas separately so a regression in one can’t get diluted by gains elsewhere. The bar marks the 95% Wilson interval; the dot marks the point estimate.
CITATION VELOCITY
The Top-10 Daily Dashboard composes its rank from a base Signal-Fusion score plus a small per-paper velocity term:
signal_fusion_score = base_signal_fusion + velocity_weight * normalized_velocity_30d
Current velocity_weight= 0.30. The value is versioned in config/signal_fusion_weights.yaml and any change is routed through ShadowPromoter (kind=weight_promotion); BrierGate auto-reverts on calibration drop and starts the 7-day cooldown. velocity_weight=0 reproduces the prior P3 ranking byte-for-byte (zero-regression invariant).
AUDIT TRAIL
apps/web/lib/methodology/brier.ts and re-run the math against your own holdout.apps/api/cron/methodology_recompute.py (06:30 UTC daily, after the 06:00 scorecard sweep).RELIABILITY
Bar height is the empirical observed-positive rate within each decile. Bar opacity scales with the number of predictions that landed in the bucket.