Published Feb 15, 2026 · Updated Feb 15, 2026
At Student Voice AI, we are deeply interested in how the design of evaluation instruments shapes the feedback institutions receive — and whether that feedback is fair. A new paper published in Assessment & Evaluation in Higher Education by Cornejo Happel, Barbeau, Rohrbacher, Korentsides and Keebler (2025) offers a compelling and practical answer to one of the most persistent questions in student evaluation research: can the way we frame evaluation questions reduce gender bias?
Gender bias in student evaluations of teaching (SET) is well documented. Study after study has shown that women faculty receive lower ratings than men, even when objective indicators such as student learning gains are equivalent. The mechanisms are subtle: when students are asked to rate subjective traits — "How knowledgeable is this instructor?" or "How effective is this lecturer?" — they draw on stereotypes and heuristics rather than observed classroom behaviour. The result is an evaluation system that penalises women, particularly in disciplines where teaching is not stereotypically associated with femininity.
This matters not only for individual careers but for institutional governance. If the instruments we use to measure teaching quality are contaminated by bias, the decisions we make on the basis of those instruments — promotions, workload allocation, course assignments — will perpetuate inequity. For UK universities navigating the Teaching Excellence Framework and NSS results, the question of whether evaluation data can be trusted is not academic; it is operational.
Cornejo Happel and colleagues took a different approach from most bias studies. Rather than documenting that bias exists, they asked whether a specific instrument design could mitigate it. They analysed midterm evaluations from 967 students across 27 faculty members at multiple institutions and disciplines in the United States. Crucially, the evaluation instrument they used was behaviourally anchored: students were asked to assess specific, observable instructional behaviours rather than subjective impressions of the instructor.
The evaluation questions focused on concrete teaching practices — things a student could directly observe in the classroom, such as how the instructor organised material, whether they checked for understanding, and how they managed class time. This stands in contrast to typical end-of-term instruments that ask students to rate nebulous constructs like "overall effectiveness" or "enthusiasm".
No statistically significant gender differences were found when feedback was structured around specific instructional behaviours. This is the headline finding, and it is striking. In a literature saturated with evidence of gender bias in traditional SETs, the absence of gender effects in a behaviourally anchored instrument suggests that question framing is a key lever for fairness.
Instead of gender, the significant variations that emerged related to instructor age and teaching experience. Less experienced instructors received lower ratings on certain behavioural dimensions, which the authors argue is an expected and constructive finding — precisely the kind of developmental signal that formative feedback should provide.
"These results suggest that behaviorally anchored evaluation instruments can produce fairer, more constructive student feedback."
The study also found that the behaviourally anchored format produced feedback that was more actionable. Because students were commenting on specific practices rather than general impressions, the resulting data pointed to concrete areas for development. This aligns with a broader shift in the literature away from summative, high-stakes uses of student evaluations toward formative, developmental applications.
The practical implication is clear: the structure of the questions matters as much as who is being evaluated. By anchoring evaluations in observable behaviours, institutions can reduce the influence of stereotypes and produce feedback that is both fairer and more useful for teaching development.
For UK institutions, this research has immediate relevance. Module evaluations, the NSS open-text question, and internal teaching quality reviews all rely on student feedback — yet few institutions have systematically audited their evaluation instruments for susceptibility to bias.
Three actions follow from this research:
Audit existing evaluation instruments. Review the questions used in module evaluations and mid-semester feedback. Are they asking students to rate subjective traits, or to report on observable teaching behaviours? Where questions are trait-based, consider replacing them with behaviourally anchored alternatives.
Separate formative from summative uses. The evidence increasingly suggests that student evaluations work best as formative, developmental tools rather than as summative judgements tied to career consequences. Institutions should consider decoupling evaluation scores from promotion and appraisal decisions, and instead use them within a broader evidence portfolio that includes peer observation and learning outcome data.
Analyse free-text comments for bias patterns. Even when scaled questions are redesigned, bias can persist in open-text responses. Systematic text analysis of student comments — looking for gendered language, differential use of personal versus professional descriptors, and sentiment variation by instructor demographics — can reveal patterns that aggregate scores obscure.
Q: How can universities practically redesign their evaluation instruments to be behaviourally anchored?
A: The key principle is to replace questions about subjective impressions with questions about specific, observable teaching practices. For example, instead of asking "How effective is this lecturer?" an institution might ask "How well does the lecturer check whether students have understood the material before moving on?" The Critical Teaching Behaviors framework developed by Barbeau and Cornejo Happel (2023) provides a taxonomy of observable practices that can serve as a starting point. Universities should pilot revised instruments alongside existing ones to compare the resulting data and build confidence in the new approach.
Q: Does removing gender bias from scaled questions solve the problem entirely, or can bias still appear in free-text comments?
A: Redesigning scaled questions is an important step, but it does not eliminate bias from open-text feedback. Research on gender stereotyping in student nominations for teaching awards (Kwok and Potter, 2021) has shown that gendered language and differential expectations persist in qualitative comments. This is precisely where automated text analysis adds value: by categorising and benchmarking comment themes and sentiment at scale, institutions can identify where free-text feedback diverges by instructor demographics and investigate whether those differences reflect genuine variation in teaching practice or the operation of bias.
Q: Is there a risk that behaviourally anchored instruments are less useful for capturing the broader student experience?
A: Not necessarily. Behaviourally anchored questions are particularly well suited to formative, mid-semester feedback where the goal is to give instructors actionable information they can act on while the course is still running. For broader measures of the student experience — belonging, engagement, the quality of learning resources — institutions should continue to use instruments designed for those purposes, such as the NSS or bespoke experience surveys. The point is not to replace all evaluation with behavioural checklists, but to ensure that the specific task of evaluating teaching is done with instruments that are fair and fit for purpose.
[Paper Source]: Claudia Cornejo Happel, Lauren Barbeau, Chad Rohrbacher, Jenna Korentsides and Joseph R. Keebler "Rethinking bias in student evaluations: a multivariate analysis of observable instructional behaviors" DOI: 10.1080/02602938.2025.2548923
Request a walkthrough
See all-comment coverage, sector benchmarks, and reporting designed for OfS quality and NSS requirements.
UK-hosted · No public LLM APIs · Same-day turnaround
Research, regulation, and insight on student voice. Every Friday.
© Student Voice Systems Limited, All rights reserved.