Updated Mar 31, 2026
Small wording changes in evaluation forms can affect whether student feedback is fair enough to trust. A new paper published in Assessment & Evaluation in Higher Education by Cornejo Happel, Barbeau, Rohrbacher, Korentsides and Keebler (2025) suggests that asking about observable teaching behaviours, rather than subjective impressions, may be one practical way to reduce gender bias in student evaluations.
At Student Voice AI, we pay close attention to how evaluation design shapes both the feedback institutions receive and the decisions they make from it. This paper offers a useful, concrete intervention rather than another reminder that bias exists.
Gender bias in student evaluations of teaching (SET) is well documented. Study after study has shown that women faculty can receive lower ratings than men, even when objective indicators such as student learning gains are equivalent. One reason is that subjective prompts, such as "How knowledgeable is this instructor?" or "How effective is this lecturer?", invite students to lean on stereotypes and heuristics instead of reported classroom behaviour. The result is an evaluation system that can penalise women, particularly in disciplines where teaching authority is still gendered.
This matters not only for individual careers but for institutional governance. If the instruments we use to measure teaching quality are contaminated by bias, the decisions we make on the basis of those instruments, from promotions and workload allocation to course assignments, will perpetuate inequity. For UK universities navigating the Teaching Excellence Framework and NSS results, the question of whether evaluation data can be trusted is not academic; it is operational. Better question design is therefore not a cosmetic tweak; it affects how confidently institutions can use teaching feedback.
Cornejo Happel and colleagues took a different approach from most bias studies. Rather than documenting that bias exists, they asked whether a specific instrument design could mitigate it. They analysed midterm evaluations from 967 students across 27 faculty members at multiple institutions and disciplines in the United States. Crucially, the evaluation instrument they used was behaviourally anchored, so students were asked to assess specific, observable instructional behaviours rather than subjective impressions of the instructor.
The evaluation questions focused on concrete teaching practices that a student could directly observe in the classroom, such as how the instructor organised material, whether they checked for understanding, and how they managed class time. This stands in contrast to typical end-of-term instruments that ask students to rate nebulous constructs like "overall effectiveness" or "enthusiasm", the same weakness seen in module evaluation forms that drift towards likability rather than teaching quality. That distinction matters because it shifts the task from judging the person to reporting the practice.
No statistically significant gender differences were found when feedback was structured around specific instructional behaviours. This is the headline finding, and it is striking. In a literature saturated with evidence of gender bias in traditional SETs, the absence of gender effects in a behaviourally anchored instrument suggests that question framing is a lever institutions can actually pull.
Instead of gender, the significant variations that emerged related to instructor age and teaching experience. Less experienced instructors received lower ratings on certain behavioural dimensions, which the authors argue is an expected and constructive finding, precisely the kind of developmental signal that formative feedback should provide. That makes the instrument more useful, not less, because it points towards coaching and support rather than stereotype-driven judgements.
"These results suggest that behaviorally anchored evaluation instruments can produce fairer, more constructive student feedback."
The study also found that the behaviourally anchored format produced feedback that was more actionable. Because students were commenting on specific practices rather than general impressions, the resulting data pointed to concrete areas for development. This aligns with a broader shift in the literature away from summative, high-stakes uses of student evaluations and towards formative, developmental applications.
The practical implication is clear: if you want fairer, more useful teaching feedback, the structure of the questions matters as much as who is being evaluated. By anchoring evaluations in observable behaviours, institutions can reduce the influence of stereotypes and produce feedback that is more credible for review processes and more useful for teaching development.
For UK institutions, this research has immediate relevance. Module evaluations, the NSS open-text question, and internal teaching quality reviews all rely on student feedback, yet few institutions have systematically audited their evaluation instruments for susceptibility to bias. That leaves a blind spot at the point where evidence is collected.
Three actions follow from this research:
Audit existing evaluation instruments. Review the questions used in module evaluations and mid-semester feedback. Are they asking students to rate subjective traits, or to report on observable teaching behaviours? Where questions are trait-based, consider replacing them with behaviourally anchored alternatives, and co-designing teaching evaluation surveys with students and staff where possible. This gives you a cleaner evidence base before the data reaches dashboards or committees.
Separate formative from summative uses. The evidence increasingly suggests that student evaluations work best as formative, developmental tools rather than as summative judgements tied to career consequences. Institutions should consider decoupling evaluation scores from promotion and appraisal decisions, and instead use them within a broader evidence portfolio that includes peer observation and learning outcome data. That makes the feedback more useful for improvement and less likely to harden unfair patterns.
Analyse free-text comments for bias patterns. Even when scaled questions are redesigned, bias can persist in open-text responses. Systematic text analysis of student comments, using a defensible NSS open-text analysis methodology where relevant, and looking for gendered language, differential use of personal versus professional descriptors, and sentiment variation by instructor demographics, can reveal patterns that aggregate scores obscure. This helps teams spot where an instrument may be fairer on paper than it is in practice.
Q: How can universities practically redesign their evaluation instruments to be behaviourally anchored?
A: The key principle is to replace questions about subjective impressions with questions about specific, observable teaching practices. For example, instead of asking "How effective is this lecturer?" an institution might ask "How well does the lecturer check whether students have understood the material before moving on?" The Critical Teaching Behaviors framework developed by Barbeau and Cornejo Happel (2023) provides a taxonomy of observable practices that can serve as a starting point. Universities should pilot revised instruments alongside existing ones to compare the resulting data and build confidence in the new approach. The benefit is straightforward: teachers get feedback they can act on, and leaders get evidence they can trust more confidently.
Q: Does removing gender bias from scaled questions solve the problem entirely, or can bias still appear in free-text comments?
A: Redesigning scaled questions is an important step, but it does not eliminate bias from open-text feedback. Research on gender stereotyping in student nominations for teaching awards (Kwok and Potter, 2021) has shown that gendered language and differential expectations persist in qualitative comments. This is precisely where automated text analysis adds value: by categorising and benchmarking comment themes and sentiment at scale, institutions can identify where free-text feedback diverges by instructor demographics and investigate whether those differences reflect genuine variation in teaching practice or the operation of bias. In other words, redesigned forms still need monitoring once the comments start coming in.
Q: Is there a risk that behaviourally anchored instruments are less useful for capturing the broader student experience?
A: Not necessarily. Behaviourally anchored questions are particularly well suited to formative, mid-semester feedback where the goal is to give instructors actionable information they can act on while the course is still running. For broader measures of the student experience, such as belonging, engagement, or the quality of learning resources, institutions should continue to use instruments designed for those purposes, such as the NSS or bespoke experience surveys. The point is not to replace all evaluation with behavioural checklists, but to ensure that the specific task of evaluating teaching is done with instruments that are fair and fit for purpose.
[Paper Source]: Claudia Cornejo Happel, Lauren Barbeau, Chad Rohrbacher, Jenna Korentsides and Joseph R. Keebler "Rethinking bias in student evaluations: a multivariate analysis of observable instructional behaviors" DOI: 10.1080/02602938.2025.2548923
Request a walkthrough
See all-comment coverage, sector benchmarks, and reporting designed for OfS quality and NSS requirements.
UK-hosted · No public LLM APIs · Same-day turnaround
Research, regulation, and insight on student voice. Every Friday.
© Student Voice Systems Limited, All rights reserved.