Student evaluation scores are not automatically comparable across departments, programmes, or time

By Student Voice AI

Updated Mar 18, 2026

At Student Voice AI, we work with universities that want to benchmark student feedback responsibly, not just quickly. That makes Edgar Valencia's 2025 paper in Assessment & Evaluation in Higher Education, "On the comparability of SET scores: measurement invariance across programs, departments, and time", especially timely. For UK universities using module evaluations, internal teaching surveys, and free-text comments to judge teaching quality, the paper asks a deceptively simple question: when two groups give the same score, does that score actually mean the same thing?

Context and research question

Student evaluations of teaching, often shortened to SETs, are routinely aggregated and compared as if the scale is self-evident. Institutions benchmark departments, track trends over years, and sometimes fold those numbers into workload, recognition, or personnel decisions. But that only works if the questionnaire functions equivalently across those contexts.

Valencia approaches this as a measurement problem rather than a culture-war argument about whether student voice matters. Using data from a graduate school of education, he tests measurement invariance across three common administrative lenses: programme type, department, and academic term. In practice, measurement invariance asks whether students in different parts of the institution interpret the same survey items in the same way. For UK higher education teams, that is highly relevant. A module evaluation score of 4.2 may look neat in a dashboard, but it is not a fair benchmark if different departments read the questions through different disciplinary expectations.

Key findings

The core finding is not that SETs are useless, but that comparability has to be tested rather than assumed. The paper argues that students' beliefs about teaching and learning can differ across disciplines, and that their understanding of high-quality teaching can shift over time. That matters because most institutional reporting still treats score differences as straightforward evidence of better or worse teaching.

Valencia finds no support for the strict comparability of SET scores across programme types, departments, and academic terms. That is the most important result for decision-makers. If strict comparability is not established, leaders should be cautious about comparing averages as though they were directly interchangeable across organisational units or time periods.

"The results reveal no support for the strict comparability of SET scores"

Department-level benchmarking looks especially exposed. The source abstract highlights disciplinary differences in beliefs about teaching and learning, which helps explain why cross-department comparison can be fragile. A reasonable inference is that a common institutional survey may still produce scores that reflect local norms and interpretations, not only teaching quality itself.

The paper also reframes survey maintenance as an ongoing methodological responsibility. Comparability checks are not just for researchers building a new instrument. The article presents them as part of responsible institutional analysis, because they affect faculty development, curriculum review, and any decision that depends on interpreting SET data credibly. That is particularly relevant in UK settings where module evaluations can feed into enhancement plans, school reviews, and narratives about teaching quality.

The positive message is that institutions can use comparability analysis to improve their surveys over time. Rather than abandoning student evaluations, the paper points towards better governance: test the instrument, understand where it behaves differently, and refine how results are reported and used. That is a more rigorous version of student voice, not a smaller one.

Practical implications

For UK universities, the first implication is to stop treating SET and module evaluation scores as automatically benchmarkable across departments. Before publishing league-table style dashboards or drawing conclusions about relative performance, institutions should test whether the survey instrument is functioning equivalently across the groups being compared.

Second, evaluation governance should separate descriptive monitoring from high-stakes judgement. A department can still use its own results to track local improvement, but cross-department comparisons and personnel decisions need a stronger evidential standard. This paper suggests that the technical question of comparability belongs much closer to the centre of quality assurance practice than it usually does.

Third, universities should rely more heavily on open-text evidence when score comparability is uncertain. If two departments receive similar scaled scores but students write about very different problems, the comments are often a better guide to action. This is where Student Voice Analytics fits naturally: it helps institutions compare themes in free-text comments within like-for-like contexts, rather than overinterpreting headline averages that may not be measuring the same thing in the first place.

FAQ

Q: How can a university test whether its own module evaluation scores are genuinely comparable across departments?

A: The most defensible route is to test measurement invariance on the survey instrument before using the scores for benchmarking. In practice, that means checking whether the question set has the same underlying structure, item meaning, and scaling behaviour across departments or cohorts. Institutions do not need to run a full psychometric review every term, but they do need periodic checks whenever a survey is being used for cross-department comparison or high-stakes reporting.

Q: What is the difference between a reliable survey and a comparable survey?

A: Reliability asks whether the scores are reasonably stable and consistent. Comparability asks whether the same score means the same thing across groups or time periods. A survey can look internally consistent within a department and still be unsafe to compare across departments if students interpret the items differently. That is why this paper adds something important to the broader SET debate: it focuses on whether benchmarking is valid, not only whether the questionnaire is stable.

Q: Does this mean universities should trust student comments more than scaled questions?

A: It means the two forms of evidence should play different roles. Scaled items can be useful for local monitoring if the instrument is well designed and used carefully. But when institutions want to understand why scores differ, or whether two units with similar scores are facing the same problems, open-text feedback becomes essential. Student voice is strongest when numbers and comments are interpreted together, not when a single average is treated as the whole story.

References

[Paper Source]: Edgar Valencia "On the comparability of SET scores: measurement invariance across programs, departments, and time" DOI: 10.1080/02602938.2025.2508754

Request a walkthrough

Book a free Student Voice Analytics demo

See all-comment coverage, sector benchmarks, and reporting designed for OfS quality and NSS requirements.

All-comment coverage with HE-tuned taxonomy and sentiment.
Versioned outputs with TEF-ready reporting.
Benchmarks and BI-ready exports for boards and Senate.

Prefer email? info@studentvoice.ai

UK-hosted · No public LLM APIs · Same-day turnaround