Cambridge study shows why AI marking in higher education still needs human judgement

By Student Voice AI

Updated May 23, 2026

AI marking in higher education has just hit a clearer evidence limit. On 22 May 2026, the University of Cambridge published AI not yet good enough to mark university essays, rewarding 'style over substance', summarising a Cambridge-led study of 761 undergraduate psychology essays from three UK universities. For teams already following Jisc's AI marking and feedback pilot, the practical point is hard to miss: current frontier models are not yet reliable enough for primary marking, even when prompts are calibrated carefully.

What the Cambridge AI marking study found

The study tested three frontier models, Claude Opus 4.6, GPT-5.4, and Gemini 3 Flash, on 761 long-form essays from 125 students at the Universities of Cambridge, Nottingham, and Manchester Metropolitan. The essays came from 50 modules and 87 assignments completed between 2022 and 2025, spanning coursework, open-book assessments, and invigilated exams. The key result is stark: agreement on UK degree classification bands ranged from 35% to 63%, with the best performance at Cambridge and the weakest at Manchester Metropolitan.

What changed the conversation is not only the average accuracy, but the pattern of error. The Cambridge report says the models showed a central tendency bias, marking the strongest essays too low and weaker essays too high. It also found the systems were oversensitive to linguistic features such as length, vocabulary range, and sentence complexity. In practice, that means AI was often responding to writing style more than academic judgement, which is precisely the failure mode institutions need to understand before they attach AI to live marking workflows.

Cambridge's staff and student focus groups add a second warning, about legitimacy rather than pure performance:

"Many students said they would feel cheated if AI marked their work."

The feedback findings are more nuanced. The report says AI-generated feedback was typically three to eight times longer than human feedback, and that when responses were shortened to a similar length, staff and students often struggled to tell the difference. Even so, the report does not endorse AI as a sole marker. Instead, it points to narrower uses such as error detection, consistency checks, and flagging scripts that need human review.

What this means for AI marking in higher education

The first implication is that universities should separate three scenarios that often get collapsed together: AI as quality assurance, AI as a marking assistant, and AI as a primary marker. The Cambridge evidence suggests the first two may be workable with safeguards, especially where AI acts as a second pair of eyes or helps triage scripts for review. The third is a much harder case to defend. If the strongest and weakest work is where AI is least accurate, then the risk lands exactly where assessment decisions matter most.

The second implication is that local validation is non-negotiable. The same broad research design produced 63% band accuracy at Cambridge, 53% at Nottingham, and 35% at Manchester Metropolitan. That variation means universities cannot rely on a vendor demo, or even another institution's pilot, as evidence that a model will behave acceptably on their own assessments. Although Ofqual's January 2026 Principles of AI use in marking was written for regulated qualifications rather than university essays, its tests of validity, reliability, fairness, and transparency are still a useful benchmark for HE pilots.

The third implication is about trust. If institutions introduce AI into marking or feedback, they will need to explain where human judgement still sits, how borderline cases are handled, and what students can do if they believe a decision is wrong. This is not only a technical design issue. It is a student voice issue. Universities will need evidence on whether students see AI-assisted feedback as helpful, generic, fair, or credible, not just whether the workflow is quicker.

How student feedback analysis connects

If universities pilot AI-supported marking or AI-expanded feedback, the comments they collect will not come back as one simple AI theme. Students are likely to talk about fairness, how generic the feedback feels, speed, tone, criteria clarity, and whether the feedback still feels owned by staff. A governed workflow such as our NSS open-text analysis methodology helps teams separate those themes before they are flattened into a single headline about AI.

The Cambridge findings are also a useful reminder that generic frontier models and governed analysis workflows solve different problems. One is being asked to generate marks or feedback. The other is being used to interpret what students then say about the experience. That distinction matters when institutions compare generic LLM workflows with specialist student feedback analysis approaches, and it is why our student comment analysis governance checklist is a practical starting point before any AI marking pilot scales. If an institution wants to monitor whether students experience AI-assisted assessment as fair and useful, Student Voice Analytics can help track those themes across module evaluations with a reproducible method.

FAQ

Q: What should institutions do now if they are considering AI marking?

A: Pause any move towards sole or primary AI marking. Start with a local validation exercise using your own assessment materials, define exactly where humans review output, and collect a short round of student feedback on trust and usefulness before expanding the pilot.

Q: What is the timeline and scope of the Cambridge study?

A: Cambridge published the announcement on 22 May 2026, and the linked OpRaise report was released the same day. The study covers 761 undergraduate psychology essays from 125 students at Cambridge, Nottingham, and Manchester Metropolitan, using assessments completed between 2022 and 2025. It is a UK higher education evidence point, not a regulatory change, and its direct evidence is strongest for long-form written assessment.

Q: What is the broader implication for student voice?

A: The broader implication is that AI marking raises the evidential bar for student voice. Universities will need to distinguish faster feedback from better feedback, and automation from legitimacy, if they want assessment changes to hold up in quality review and in student trust.

References

[University of Cambridge]: "AI not yet good enough to mark university essays, rewarding 'style over substance'" Published: 2026-05-22

[University of Cambridge / OpRaise]: "AI in University Assessment: Evaluating the Opportunities and Risks of Automated Marking" Published: 2026-05-22

[Ofqual]: "Principles of AI use in marking" Published: 2026-01-14

Request a walkthrough

Book a free Student Voice Analytics demo

See all-comment coverage, sector benchmarks, and reporting designed for OfS quality and NSS requirements.

All-comment coverage with HE-tuned taxonomy and sentiment.
Versioned outputs with TEF-ready reporting.
Benchmarks and BI-ready exports for boards and Senate.

Prefer email? info@studentvoice.ai

UK-hosted · No public LLM APIs · Same-day turnaround

Related Entries

Are grading standards in environmental sciences fair and consistent?