When scores all move together: halo effects, bias, and what to do with student feedback

Audio briefing based on Student Voice Weekly issue #4.

This Week

This week, the episode discusses halo effects, gender bias, wellbeing surveys, MyGrades. Halo effects flatten evaluation scores, but they do not make the data useless The main topics are grouped below by student voice practice, research, sector developments, archive context, and practical application.

Main Topics Discussed

Student Voice Practice

First, a warm welcome to everyone who has signed up over the past few weeks.

Research Spotlight

Sector Watch

From the Archive

Practical Application

One pattern we see regularly is that a single analyst or small team codes thousands of student comments each cycle, and the results depend on who had time and how they interpreted the categories that year.

Subscribe to The Student Voice Weekly: https://www.studentvoice.ai/blog/newsletter/

Transcript

Hi, this is Student Voice Weekly. I'm Dr Stuart Grey, founder of Student Voice, and this week's theme is when feedback looks too neat: why evaluation scores often move together, and what that means for decisions in universities.

Today I'd like to talk about halo effects in student evaluations, and two common mistakes I see. Mistake one is assuming that if the scores are correlated, the data is useless. Mistake two is the opposite: assuming that because the dashboard is detailed, you can use tiny differences between items to diagnose exactly what is wrong.

I'm also going to connect this to two other signals from this week's issue. One is research on gender bias in teaching excellence nominations, which is a reminder that what students notice and reward is not evenly distributed. The other is a sector shift towards systems, not just surveys: King's running a wellbeing survey alongside NSS and PTES, and Glasgow rolling out MyGrades.

Let's start with halo effects.

You've seen the pattern. Overall satisfaction is high, and then everything else is high too: organisation, clarity, assessment, feedback, support. Or one bad experience and the whole block drops. Then someone says, students are just rating their mood, it's all one question.

The Cannon and Cipriani paper is useful because it tests that intuition instead of just arguing about it. They look for evidence of a global response tendency by adding something that should not be about teaching quality at all: satisfaction with the lecture room. If that ends up correlating with your teaching items, that's a sign students are responding in a more general way.

They do find strong correlations across questions, consistent with halo effects. The key thing, though, is what you do next. This does not automatically invalidate student feedback. It changes what the data can reliably tell you.

A practical way to think about it is this. Halo effects make fine-grained, item-by-item diagnosis within a module or a lecturer look more precise than it is. If you're treating a 0.2 gap between "organisation" and "feedback" as a clean diagnosis, make sure you slow down. That difference might be noise, or it might be a general positive or negative feeling leaking across every item.

But broader signals can still hold. If a module is consistently low across multiple runs, or consistently stands out compared with similar modules, that is still information. It becomes much more usable when you pair it with comments and operational data.

So what should you do with evaluation scores when you suspect halo effects.

First, treat quantitative scores as a routing tool, not a verdict. The score tells you where to look. It does not tell you what to do.

Second, make sure decisions match the precision of the data. Use scores for triage and trend, not micro-management. If you want to do diagnosis, you need supporting evidence: comments, assessment turnaround times, VLE analytics, complaint themes, extension patterns, and other indicators that describe what students actually experienced.

Third, be careful about how you use item breakdowns in performance processes. A halo effect is not just a technical detail. If you build a narrative about a staff member based on over-interpreted item differences, you can end up with something that looks objective but is not robust.

Now bring in the second research item, because it adds an important layer. Kwok and Potter look at teaching excellence nominations and show that gender shapes both who gets nominated and what qualities students associate with excellence.

The point here is not the unhelpful line, "student feedback is biased, so ignore it". That doesn't help anyone run a better university. The useful framing is: student judgements are patterned, and those patterns can reflect stereotypes. Your job is to detect those patterns and design processes that do not amplify them.

They use shifting standards theory, which basically means the bar can move depending on who is being judged. The same behaviour can be read differently. Some colleagues may need to be exceptional to be described as good, while others get described as exceptional for being good. That matters when nominations, quotes, and praise get reused in promotion, probation, and workload narratives.

So, a few practical implications.

If you run awards, do not treat nominations as a clean measure of teaching quality. They are a measure of noticed excellence in your local culture. That distinction is important.

If you use student comments to build "excellent teaching" case studies, make sure you check who is getting recognised, and how. Look at themes and thresholds, not just counts. Make sure your definition of excellence does not quietly narrow to one type of "excellent" that maps onto stereotypes.

And link this back to halo effects. If student responses are partly global, and what they notice is socially patterned, then neat dashboards are sitting on top of a messy perception system. That doesn't mean you stop listening. It means you interpret with care and you triangulate.

Let me shift to sector watch, because both items point to a direction of travel.

First, King's launching a wellbeing survey alongside NSS and PTES. The key thing here is not the exact question set. It's the move towards multiple instruments with different jobs, and the attempt to link feedback to action. If you do this, make sure you avoid survey sprawl. Be clear what each instrument is for, what decisions it informs, and what the student-visible follow-through is.

Second, Glasgow rolling out MyGrades after student feedback. This is a good example of turning "the student experience" into infrastructure. Students don't experience assessment as a policy document. They experience it as: where are my deadlines, where is my grade, where is my feedback, what happens next, and why do I have to check multiple systems to find it.

MyGrades matters because it responds to a concrete friction point by changing the system, not by asking students to be more patient or by adding another comms page.

And it links back to halo effects again. If a student spends weeks chasing feedback across disconnected systems, that frustration can spill into every evaluation item. Sometimes the best way to improve "feedback scores" is not to rewrite a question or run a workshop. It is to fix the process students are actually living through.

That brings me to comments, because this is where you can often separate mood from mechanism.

When you review free text in a context where scores move together, try a simple split.

One set of comments are global judgements: "great module", "awful organisation", "loved the lecturer". They are real, but they rarely tell you what to change.

The other set are mechanism comments, the ones that describe what happened: feedback arrived after the next deadline, the brief was unclear, the rubric didn't match what was taught, the lectures and seminars didn't connect, assessment information was hard to find. Those are the comments that point to interventions.

Make sure your analysis doesn't collapse both into a single theme and then act as if you have a diagnosis. A good process distinguishes dissatisfaction with an outcome, dissatisfaction with a process, and dissatisfaction with communication.

On the gender point, do something similar with positive comments. Separate praise for care and support from praise for expertise and clarity, then check whether those forms of praise are being applied differently across staff groups. The aim is not to police students' language. It's to make sure institutional recognition doesn't become skewed.

One practical thing to try this week.

Take a module evaluation report where the scores all move together, high or low. Draw two columns.

Column one: "what students felt". Column two: "what happened".

Put a handful of comments into each. Then ask: are we about to take action based on feelings, or based on what happened. You need both, but only one reliably tells you what to fix.

If you cannot fill the "what happened" column, that is also a finding. It suggests your questions are not eliciting diagnostic feedback, or students don't believe detail leads to change, or both.

That's it for this week.