Updated on
- AI struggles to identify exceptional papers. Human reviewers use the full 1 to 5 range and give the top score to 12% of papers. AI models give middle, uncontroversial scores of 2 or 3 to 74–80% of papers and almost never give the top score. Prompting AI to be more or less critical does not solve the problem, since it just shifts the whole distribution without singling out the exceptional papers.
- AI's rankings track human rankings only weakly. The highest correlation between AI scores and human scores is only 0.31, from Opus 4.7 (0 = random ranking, 1 = complete agreement). Who is right when they disagree? Using future citations as a quality proxy, humans clearly beat AI at flagging both exceptional papers and clear weak papers; AI's predictive advantage is concentrated in the middle of the distribution, where human reviewers tend to give middling scores and the human ranking is largely flat with respect to citations.
- The gap is closing fast. From November 2025 to April 2026, OpenAI released GPT‑5.1 → 5.4 → 5.5 and Anthropic released Opus 4.5 → 4.6 → 4.7. Across the two lineages, the distributional distance to the human reference (the L1 distance between the two score histograms) fell by an average of 47%, and the correlation rose by an average of 55%.