Updated on
-
AI doesn't identify exceptional or clearly weak papers. → Figure 1
Humans use the full 1–5 range, giving the top score (1) to 12% of papers and the bottom score (5) to 7%. The best frontier AI models give the top score to fewer than 4% of the papers and the bottom score to fewer than 2%, and allocate 74–80% of their scores to 2 or 3. Prompting AI to be more or less critical does not fix the compression problem — it just shifts the whole distribution without singling out exceptional or weak papers. -
Frontier AI's per-paper agreement with human reviewers is weak — but matches, or modestly exceeds, the agreement between two humans. → Figure 2
Opus 4.7's score correlates with a randomly chosen human reviewer at 0.26. Two human reviewer scores correlate at 0.19, comparable to the reviewer-pair correlations Welch (2014) reports for the Journal of Finance (0.15), the Review of Financial Studies (0.14), and the SFS Cavalcade conference (0.20). So the best AI today is at or modestly above human-reviewer level on per-paper agreement. However, disagreement does not necessarily mean poor review quality. The human-to-human correlation of 0.19 translates to reviewers placing roughly 0.33 weight on a common (unobserved) signal — again, comparable to JF (0.32), RFS (0.31), and Cavalcade (0.36) in Welch (2014) — and 0.67 weight on their own idiosyncratic signal, such as taste, priors, or what they happen to find compelling. Idiosyncrasy does not mean wrong. The more important question is not who agrees with whom, but whose evaluations predict the papers' future impact. -
Where humans commit to a strong call — top or bottom — they predict future citations better than AI. In the middle, AI does better. → Figure 3
Among papers two reviewers rated near the top (mean ≤ 1.5) or near the bottom (mean ≥ 4.0), human scores predict future citations more strongly than AI scores do. Identifying the top and bottom of the distribution is important for the conference program — which papers to include and which to clearly reject — and human reviewers outperform AI in these regions. In the middle, the precise score has little bearing on the program outcome, and human scores carry essentially no signal for future citations. In this region, AI outperforms humans. -
AI is improving fast. → Figures 4–6
From November 2025 to April 2026, OpenAI released GPT‑5.1 → 5.4 → 5.5 and Anthropic released Opus 4.5 → 4.6 → 4.7. Over those five months, the distributional similarity to human reference, per-paper agreement with human reviewers, and predictability of future citations all improved significantly.
-
Opus 4.8 shows little to no improvement compared to Opus 4.7.
-
Fable 5 is the closest model yet to the human score distribution, breaking Opus 4.8's plateau.
-
Fable 5 agrees with a human reviewer clearly more than two human reviewers agree with each other.
-
Humans still hold the tails, where accept/reject decisions hinge; Fable 5's middle-region predictive edge is the largest yet.