Co-design your evaluations or you will not trust the data

Audio briefing based on Student Voice Weekly issue #6.

This Week

This week, the episode discusses better evaluations, OfS evidence trails, and PTES design. A six-year redesign study shows better evaluation data starts with co-design. The main topics are grouped below by student voice practice, research, sector developments, archive context, and practical application.

Main Topics Discussed

Student Voice Practice

I kept coming back to a simple point this week: if the feedback instrument is badly designed, the data it produces will never be as trustworthy as the dashboard suggests.

Research Spotlight

Sector Watch

From the Archive

Practical Application

A low-risk pilot is often the sensible way to test whether comment analysis will actually change how your team works.

Subscribe to The Student Voice Weekly: https://www.studentvoice.ai/blog/newsletter/

Transcript

Hello, and welcome to Student Voice Weekly. I'm Dr Stuart Grey, founder of Student Voice, and today's theme is evaluation design: why co-designed questions produce better data, and why your nice clean dashboard might be comparing things that are not comparable.

Today I'd like to talk about something that sounds a bit dry, but it quietly drives whether student voice work helps, or just creates noise. The key thing is simple: if the feedback instrument is badly designed, the data will never be as trustworthy as the dashboard suggests.

Most teams have seen this. You pull the module evaluation results, or a PTES cut, or you are looking ahead to NSS. The numbers look precise. Then you sit down with staff or student reps and the first part of the meeting is not, "what do we change". It is, "what did students think this question meant".

And because I still teach part-time at the University of Glasgow, I see this at ground level. Students do not experience "teaching quality" as one thing. They experience clarity, organisation, responsiveness, assessment design, feedback, and whether they feel able to ask questions. If your survey item bundles multiple ideas together, you can get a stable score that is impossible to act on.

So the first research story this week is a six-year redesign of teaching evaluations where staff and students co-designed the evaluation, iteratively, over time. The takeaway is not just that co-design is nice. It is that when you co-design questions, you get closer to what staff and students actually mean by good teaching, and you expose where you think you agree, but you do not.

In that study, there was agreement on core features like communication, respect, commitment, and course organisation. That is the baseline. Where things diverged more was around concepts like rigour and the learning environment.

Make sure you notice what that means for your own questionnaires. "Rigour" can mean "this is challenging in a fair way" or it can mean "this is unclear and I am being set up to fail". "Learning environment" can mean the room, the culture, accessibility, online spaces, and group dynamics. If you ask one vague item and expect it to do diagnostic work, it will not.

My judgement is that the sector spends a lot of time debating whether student evaluations are valid, whether students are biased, whether it measures popularity. Some of that matters. But a lot of the day-to-day problem is more basic. We ask muddled questions, we get muddled data, then we act surprised when colleagues do not trust it, or when students feel ignored because nothing changes.

Co-design is not a magic wand. It is a practical way to make sure the questions match the lived reality of teaching and learning, and to separate issues that require different actions.

Now, the second research item is the one I want planning and quality teams to sit with. It tests whether evaluation scores are comparable across departments, programme types, and over time. In plain terms: if two areas both get a 4.2, can you treat that as the same thing.

The answer is no, not automatically. The questionnaire can be identical, but the scores may not mean the same thing in different contexts.

This is where dashboards can mislead. Dashboards assume comparability because comparison is what they are built for. But comparability is an assumption, not a fact. If the instrument behaves differently in different contexts, ranking departments off those numbers can push you toward the wrong conclusion.

And the time dimension matters too. A course changes, the cohort changes, assessment changes, or expectations shift. You can see movement and assume it is teaching quality, when it might be context. So I am not saying "never compare anything". I am saying be disciplined about which comparisons you allow people to make casually.

A safer approach is like-for-like comparisons, and reading the comments alongside the scores. Comments are where you find what students thought the item was asking.

That brings me to sector watch, because this is where the operational detail becomes real.

The Office for Students published a quality assessment about a small provider where they could not find evidence that planned module evaluations and student surveys had actually been run, even though the processes were described in quality documentation.

I am not interested in piling on one provider. The wider lesson is about evidence trails. If your institution states, in policy or handbooks, "we run module evaluations, we review them, we act", you need to be able to demonstrate it. That means being able to retrieve the questionnaire used, the fieldwork dates, outputs, and records of discussion and action. Not for bureaucracy's sake, but because without it you cannot evidence that the feedback loop exists.

The risk, otherwise, is performative student voice. It looks good on paper, but when you need to show it is real and that it drives change, it is not there.

The other sector signal was Westminster's approach to PTES, pairing the survey with a proof-of-completion process for an incentive, and being explicit about confidentiality and identifiable comments.

This is a practical reminder that response-rate tactics are part of survey design, and they only work if you protect trust.

First, the incentive workflow must not compromise anonymity. If students think responses can be linked back to them, the most useful comments disappear.

Second, make sure you have the capacity to analyse and respond quickly after fieldwork closes. If you work hard to get responses and then nothing happens for months, students notice, staff notice, and trust drops for the next cycle.

So what does all of this mean when you are looking at what students are actually saying in free text.

In practice, I would separate comments into three buckets.

One: comments about the teaching and learning reality, things like clarity, organisation, feedback, assessment, learning resources. Those are for teaching teams and course teams to act on.

Two: comments about the meaning of the questions. Students will tell you when an item is vague or unanswerable, for example, "it depends which tutor" or "I do not know what you mean by support". Those are not nuisance comments. They are instrument-design data.

Three: comments about fairness and trust, things like "nothing changes" or "I do not want to be identified". Those are system issues. If you ignore them, response rates and candour will drop even if teaching improves.

The key thing is that each bucket has a different action owner. Teaching teams can act on experience. Survey and quality leads can act on question design. Leaders and governance need to act on trust, closing the loop, and confidentiality.

So one thing to try this week, in a real meeting.

Take your current module evaluation instrument and pick five items. For each item ask two questions.

First: what decision is this item meant to support. Not what it measures, but what decision it feeds. For example, "do we need to change assessment guidance" or "do we need to review feedback turnaround".

Second: could two reasonable students interpret this item differently. If yes, write down the two interpretations.

Then take a small sample of recent comments that sit near that item and check which interpretation students are using. That gives you an evidence-based way to rewrite questions, rather than rewriting by committee preference.

Do that for five items and you are already moving toward an instrument people can trust. And it changes the conversation with staff. You are not saying, "the score says you are bad". You are saying, "this question is doing too much work, here is what students mean by it, and here is how we can make the next cycle more actionable".

That is it for this week. The full set of links and summaries is in Student Voice Weekly.