Can Machine Learning Help Instructors Make Sense of Student Evaluation Comments?

By Student Voice AI

Updated Apr 02, 2026

At Student Voice AI, we believe that the qualitative comments students write in teaching evaluations are among the most valuable — and most underused — data universities collect. A new study published in Assessment & Evaluation in Higher Education tackles a question that sits at the heart of our work: what happens when you apply machine learning to those open-text comments, and how do the instructors who receive the results actually respond?

The paper, "Understanding university instructor responses to machine-learning analysis of mid-semester teaching evaluations", is authored by Valeriya Minakova, Ryan Patterson, Benjamin Potter, Shentong Wang, Lai Wei, Yang Xiao, Caroline Junkins and Sharonna Greenberg, a cross-disciplinary team spanning global development, computer science, mathematics, and chemistry at Queen's University and McMaster University in Canada. Published online on 27 January 2026, it reports on a qualitative study that brought together topic modelling, data visualisation, and in-depth instructor interviews to explore whether machine analysis of mid-semester evaluation comments can genuinely improve how teaching staff engage with student feedback.

The challenge: qualitative feedback at scale

Mid-semester evaluations of teaching (MSET) give students a chance to provide formative feedback while a course is still running, much like Westminster's Mid-Module Check-ins, in time for instructors to make adjustments. Unlike end-of-term surveys, they are designed to be developmental rather than summative. The problem, particularly in large-enrolment courses generating hundreds or thousands of free-text responses, is that reading and making sense of all those comments is time-consuming. Many instructors simply do not do it, or skim selectively. The researchers set out to test whether automated topic modelling could extract meaningful themes from MSET comments and present them to instructors in a way that was useful and actionable.

Methodology: topic modelling meets instructor interviews

The study recruited eleven instructors from different departments at a Canadian research university, all teaching large-enrolment courses. Students in those courses completed mid-semester evaluations, and the resulting open-text comments were analysed using topic modelling — a family of machine-learning techniques that identifies clusters of frequently co-occurring words to surface latent themes in a corpus of text. The team experimented with multiple approaches, including BERTopic, a neural topic modelling method that uses contextual language embeddings to group semantically similar comments.

The machine-generated topics were then presented to instructors through data visualisations, and each instructor participated in an in-depth interview to share their reactions. The interviews explored whether the topic summaries matched instructors' own reading of the comments, whether the visualisations were clear and useful, and what improvements instructors would suggest.

Key findings

Instructors were broadly positive about the machine-learning output. Most found that the automatically generated topics aligned well with what they already knew from reading comments manually, but the structured format made the feedback easier to digest and act upon. Several noted that seeing themes ranked by prevalence helped them prioritise which issues to address first — something that is difficult when scrolling through hundreds of individual comments.

Data visualisation mattered as much as the analysis itself. A significant portion of the interview feedback focused not on the accuracy of the topic modelling but on how the results were presented. Instructors offered detailed suggestions about how to improve the layout, labelling, and interactivity of the reports. This finding underscores a point often overlooked in text-analytics projects: even the most sophisticated NLP pipeline will fail to drive change if the output is not presented in a format that busy academics can quickly understand and trust.

The approach revealed the limits of response rates. The researchers initially targeted courses with 300 or more students, assuming that large enrolments would yield sufficient data even with low completion rates. In practice, some large courses had response rates as low as 1%, while smaller courses with proactive instructor promotion reached 50%. This is an important reminder that the volume of text available for analysis depends as much on survey culture and promotion as on class size, a point explored in our summary of non-response bias in student evaluations.

Instructors saw the tool as a complement to, not a replacement for, their own reading. Rather than wanting the machine to replace human interpretation, participants valued it as a way to surface patterns they might otherwise miss and to confirm their existing impressions with evidence. This aligns with the broader literature on instructor engagement with student evaluations, which emphasises that feedback is most effective when it supports reflective practice rather than delivering verdicts.

Practical implications for UK higher education

Although the study took place in Canada, its findings transfer directly to the UK context. British universities routinely collect free-text comments through module evaluations, the National Student Survey (NSS), the Postgraduate Taught Experience Survey (PTES), the Postgraduate Research Experience Survey (PRES), and the UK Engagement Survey (UKES). The challenge of processing these comments at scale is, if anything, greater in the UK given the sector's emphasis on using survey results for quality enhancement and regulatory accountability.

Several implications stand out:

Mid-semester feedback deserves the same analytical attention as end-of-year surveys. Formative feedback loops are only valuable if the data they generate can be turned around quickly enough for instructors to act. Machine-learning analysis can dramatically compress the time between data collection and insight.
Visualisation and report design should be co-designed with academic staff. The study shows that instructor buy-in depends heavily on how results are presented. Institutions investing in text analytics should involve end-users from the start, not just data scientists.
Low response rates remain a barrier to reliable analysis. Institutions need to invest in survey promotion strategies for teaching evaluations, including giving students clear evidence that their comments lead to change, if they want enough qualitative data for meaningful machine analysis.

"Overall feedback from instructors was positive, suggesting that machine learning can enhance instructor engagement with qualitative student feedback."

The study also highlights a broader strategic point for Pro-Vice-Chancellors and Student Experience teams: the value of student comments is not just in what they say, but in whether anyone reads and acts on them. If machine analysis can lower the barrier to engagement, it has the potential to close the feedback loop more effectively than simply publishing raw comments ever could.

FAQ

Q: How could UK universities apply mid-semester machine-learning analysis of student comments in practice?

A: Institutions could integrate topic modelling into their existing module evaluation platforms, then use text analysis software designed for education feedback to scale reporting across larger modules or departments. The key is to deliver results back to teaching staff within days, not weeks, so there is still time to make adjustments. Pairing machine-generated theme reports with brief guidance on interpretation — as the study recommends — would help instructors translate findings into concrete changes for the remainder of the term.

Q: What are the limitations of using topic modelling on student evaluation comments?

A: Topic modelling works best when there is a sufficient volume of comments to detect patterns. Very short responses (e.g. "good" or "fine") provide little information for the algorithm, and low response rates can skew the themes towards a vocal minority. The study also notes that machine-generated topic labels can sometimes be opaque and require human refinement. These are known limitations, and combining automated analysis with human review — as the researchers recommend — helps mitigate them.

Q: Does automated analysis risk reducing the richness of student feedback to a set of simplified themes?

A: This is a valid concern, and the study's instructors raised it themselves. The consensus was that automated analysis should complement, not replace, direct reading of comments. Machine-generated themes provide a useful overview and help prioritise attention, but individual comments — particularly outliers or detailed narratives — still need to be read in context. The most effective approach combines both: a high-level thematic summary for quick orientation, with the ability to drill down into the underlying comments for nuance and detail.

References

[Paper Source]: Valeriya Minakova, Ryan Patterson, Benjamin Potter, Shentong Wang, Lai Wei, Yang Xiao, Caroline Junkins & Sharonna Greenberg "Understanding university instructor responses to machine-learning analysis of mid-semester teaching evaluations" DOI: 10.1080/02602938.2026.2619428

Request a walkthrough

Book a free Student Voice Analytics demo

See all-comment coverage, sector benchmarks, and reporting designed for OfS quality and NSS requirements.

All-comment coverage with HE-tuned taxonomy and sentiment.
Versioned outputs with TEF-ready reporting.
Benchmarks and BI-ready exports for boards and Senate.

Prefer email? info@studentvoice.ai

UK-hosted · No public LLM APIs · Same-day turnaround