AI is being deployed at speed to review scientific work for conferences, journals, and grant panels. But can AI review? To find out, we use submissions to the Utah Winter Finance ConferenceOne of the most selective, longest-running, and highest-quality boutique finance conferences.Reinartz, S. J., and D. Urban, 2017, Finance conference quality and publication success: A conference ranking, Journal of Empirical Finance 42, 155–174. (UWFC), where each paper is scored 1 (best) to 5 (worst) by two members of the program committee — a rotating pool of 120 leading finance researchers from 2017 to 2026.[†] We ask AI to score the same papers the same way, then compare its scores against the humans'. We repeat the study for every new frontier flagship release from OpenAI and Anthropic, starting with the November 2025 releases, and will continue with each new release. Methodology in the companion paper.

[†] Only the anonymized 1–5 scores are used in the analysis — without reviewer identity or written comments.

Findings

Updated on

  1. AI doesn't identify exceptional or clearly weak papers. → Figure 1

  2. Frontier AI's per-paper agreement with human reviewers is weak — but matches, or modestly exceeds, the agreement between two humans. → Figure 2

  3. Where humans commit to a strong call — top or bottom — they predict future citations better than AI. In the middle, AI does better. → Figure 3

  4. AI is improving fast. → Figures 4–6

  1. Opus 4.8 shows little to no improvement compared to Opus 4.7.

  1. Fable 5 is the closest model yet to the human score distribution, breaking Opus 4.8's plateau.

  2. Fable 5 agrees with a human reviewer clearly more than two human reviewers agree with each other.

  3. Humans still hold the tails, where accept/reject decisions hinge; Fable 5's middle-region predictive edge is the largest yet.

Findings 1–3 · Look across all models

Figure 1 · Finding 1

Score Distributions All-model average vs human reference

Figure 2 · Finding 2

AI vs. Human Score Each paper scored by two humans and one AI model

Figure 3 · Finding 3

Citation Prediction by Region All-model average