When are student evaluations of teaching actually reliable?

By Student Voice AI

Updated Mar 08, 2026

At Student Voice AI, we work with universities that want student feedback to function as evidence, not just as sentiment. That makes one methodological question hard to avoid: how reliable are student evaluations of teaching in the first place? In a 2025 paper in Assessment & Evaluation in Higher Education, Erin M. Buchanan, Jacob F. Miranda and Christian Stephens examine that question directly in "The reliability of student evaluations of teaching". For UK institutions relying on module evaluations, NSS-style instruments, and free-text comments to judge teaching quality, the paper is a useful warning against treating any single score as more stable than it really is.

Context and research question

Student evaluations of teaching, often shortened to SETs, are still widely used to guide instructor feedback, quality processes, and administrative decisions. Yet much of the sector debate focuses on bias and validity, while spending less time on a more basic measurement issue: if the same instructor is evaluated again, how similar are the results likely to be?

Buchanan, Miranda and Stephens address that gap by analysing more than 30 years of SET data from a large public university in the United States. Using correlation coefficients and linear mixed-effects modelling, they examine how reliability varies across instructors, courses, and time. They also test a practical question that many institutions might assume has an obvious answer: if students think an instructor is fair, does that make the resulting evaluation data more reliable?

Key findings

The highest reliability appeared in the most tightly controlled comparison: the same instructor teaching the same course multiple times within the same semester. Even there, the reported reliability was only moderate, at around r ~ 0.50. That matters because this is close to a best-case scenario. If reliability is only moderate when the context is highly similar, institutions should be cautious about making stronger comparisons across different modules, years, or delivery conditions.

The paper's broader point is that reliability is not a fixed property of SETs as a whole. It changes depending on the context in which the data is collected and compared. That is important for UK higher education teams, because module evaluation dashboards often imply a level of comparability that the underlying data may not fully support.

"SET reliability is context-dependent and may diminish over time."

Reliability also declined slightly over the 30-year period studied, with the authors reporting b = -0.004, 95% CI [-0.005, -0.003]. The decline is modest, but the signal matters. It suggests institutions should not assume that an evaluation instrument behaves the same way year after year simply because the questions look familiar. Changes in survey administration, student expectations, grading culture, and teaching context can all affect how stable the results are.

Perhaps the most counterintuitive finding is that perceived fairness did not predict reliability. In other words, students rating an instructor as fair did not make the evaluation scores more stable. That does not mean fairness is unimportant. It clearly matters for trust in the student experience. But this paper suggests fairness should not be treated as evidence that a survey instrument is methodologically sound.

The study also sharpens a distinction universities sometimes blur: reliability is not the same as validity. A measure has to be reasonably stable before it can be trusted at all, but stability alone still would not prove that SET scores capture teaching effectiveness accurately. This paper focuses on the first hurdle, and shows that even that hurdle is not straightforward.

Practical implications

For UK universities, the first implication is to stop using SET scores as if they were universally comparable. Module evaluations are most defensible when they are compared like with like: the same module structure, similar cohorts, the same instrument, and repeated delivery in a short time window. Once comparisons stretch across different modules or long periods, confidence should fall accordingly.

Second, teaching review processes should treat SETs as one strand of evidence rather than the deciding metric. If a department is reviewing teaching quality, promotion cases, or enhancement priorities, student evaluations should sit alongside peer review, assessment design evidence, continuation data, and qualitative student feedback. This paper supports a more cautious, portfolio-based approach.

Third, institutions should put more weight on recurring patterns in open-text feedback than on small fluctuations in summary scores. If comments repeatedly point to unclear explanations, poor organisation, inconsistent marking, or weak communication, that is often more actionable than arguing over whether a mean score has shifted by a few tenths. This is where Student Voice Analytics fits naturally: it helps teams analyse repeated themes in free-text comments at scale, benchmark them across similar contexts, and avoid over-reading unstable headline ratings.

FAQ

Q: How can a university test whether its own module evaluations are reliable enough to use?

A: Start by checking stability in repeated like-for-like situations. Compare the same module taught multiple times with the same or very similar question set, and calculate consistency measures such as correlations or intraclass reliability. Review results at module level, not only across the whole institution, because this paper shows reliability is context-dependent rather than uniform.

Q: What should we make of the finding that fairness did not predict reliability?

A: It means fairness and reliability are different questions. Students can feel that an instructor grades fairly, and that still does not make the evaluation scores more stable over time. Universities should continue to collect and act on fairness concerns, but they should not assume those ratings validate the survey instrument itself.

Q: Does this mean student voice should play a smaller role in teaching enhancement?

A: No. It means student voice should be used more carefully. SET scores on their own are a weak foundation for high-stakes judgements, but student comments, survey responses, and other feedback remain valuable when they are triangulated with other evidence. The goal is not less student voice; it is better interpretation of student voice.

References

[Paper Source]: Erin M. Buchanan, Jacob F. Miranda and Christian Stephens "The reliability of student evaluations of teaching" DOI: 10.1080/02602938.2025.2504618

Request a walkthrough

Book a free Student Voice Analytics demo

See all-comment coverage, sector benchmarks, and reporting designed for OfS quality and NSS requirements.

All-comment coverage with HE-tuned taxonomy and sentiment.
Versioned outputs with TEF-ready reporting.
Benchmarks and BI-ready exports for boards and Senate.

Prefer email? info@studentvoice.ai

UK-hosted · No public LLM APIs · Same-day turnaround