Published Apr 25, 2022 · Updated Mar 01, 2026
How do you spot what has changed across thousands of comments, without reading them all? Humans are naturally slow at recognising patterns in text, yet this ability is vitally important for many applications, from monitoring political discourse to assessing the performance of machine learning programs.
To address this challenge, Zhong et al. (2022) recently published a paper outlining methods to recognise and describe differences between two distributions of text. In short, they sought to create criteria that apply more often to one text distribution than the other. To illustrate this, the authors used a simple example: social media comments about the SARS-CoV-2 pandemic. If you take publicly posted comments from two consecutive years, each year forms a distinct distribution. A differentiating criterion might be that one year’s comments contain more optimistic language. In this way, machine learning could be used to efficiently gauge public opinion on the pandemic, or more specifically, on a government’s response to it. Crucially, the goal is to describe the difference in natural language, not just label it.
The method developed by Zhong et al. (2022) uses Generative Pre-trained Transformer 3 (GPT-3; Brown et al., 2020), a machine learning tool capable of producing human-like text in response to prompts. Here, the prompts were samples from the two text distributions being compared.
Because of how GPT-3 operates, large distributions cannot be used as inputs in their entirety. Instead, a sample from each distribution is used. From these samples, natural language hypotheses are proposed. These “candidate hypotheses”, as the authors describe them, should describe an aspect of a sample that enables differentiation between its source distribution and the other. In the social media example above, a candidate hypothesis might be “is optimistic about the pandemic” (Zhong et al., 2022).
Although GPT-3 produces human-like text, it was not specifically designed to generate hypotheses in this manner. As a result, there was no existing corpus, a library of written and spoken material used to train machine learning technologies, for the authors to utilise.
To overcome this challenge, they developed a system to fine-tune the hypothesis proposals generated. This involved three stages:
A list of candidate hypotheses was first made. This involved a combination of manually writing some hypotheses and using those generated by GPT-3. Using the social media example, a hypothesis might be “is optimistic about the pandemic”.
GPT-3 was then used to generate both positive and negative samples based on each candidate hypothesis. Positive samples are those that fulfil the criteria of a given hypothesis, negative samples are those that do not. A positive sample comment might be the optimistic statement, “hospitalisations are reducing”. A negative sample comment could be “hospitalisations are increasing”.
The generated samples were then manually screened to ensure they truly fulfilled the criteria of the respective hypothesis. Or, in the case of negative samples, truly did not fulfil the criteria.
At this point, for each hypothesis, the authors had a large selection of samples that definitively did or did not meet its criteria. These samples were then used to fine-tune GPT-3, so it could generate a hypothesis in natural language in response to samples from two distinct distributions.
In tests on 54 binary classification datasets (Zhong et al., 2021), the classifier developed by Zhong et al. (2022) produced outputs similar to manual human annotations in 76% of cases after fine-tuning.
There are several challenges to classifying text in this manner, as stressed by the authors. Natural language is inherently ambiguous and open to interpretation. Both the language used within distributions and the hypotheses themselves have the potential to be imprecise or biased. This can be further exacerbated by cultural and social differences. The results from the 54 classification datasets used to test this system were also validated by hand using manual annotation.
This is time- and resource-intensive. However, at present there is no alternative. Yet the significance of these challenges pales when the range of potential applications for this form of automated language analysis is considered. The authors stress that while general text distribution was the focus of this work, applications could include the analysis of anything that involves a language output. This could include analysing and comparing a vast range of human experiences, from tastes (Nozaki & Nakamoto, 2018) to traumatic events (Demszky et al., 2019).
Furthermore, it could even be applied to psychological profiling through the identification of writing patterns associated with specific psychological signatures (Boyd & Pennebaker, 2015). The authors suggest that, in effect, the list of potential applications is endless.
Key takeaway: Zhong et al. (2022) outline a workflow where a language model proposes natural-language hypotheses that distinguish one set of text from another. This can make large, messy text corpora easier to summarise and discuss, but it still relies on careful sampling and manual screening to manage ambiguity and bias.
Q: How does the approach of using GPT-3 for text analysis compare to traditional methods of text analysis in terms of efficiency and accuracy, especially in educational settings where Student Voice is paramount?
A: The approach described by Zhong et al. (2022) shows how GPT-3 can speed up text analysis compared with traditional manual coding. Manual approaches can be time-consuming and inconsistent, especially at scale. GPT-3 can process large volumes of text quickly and propose patterns and hypotheses for people to review, which can help educational teams surface themes in Student Voice sooner and support more timely decisions.
Q: What implications does the method developed by Zhong et al. (2022) have for enhancing Student Voice in educational research and policy-making?
A: The method developed by Zhong et al. (2022) suggests a way to analyse student feedback at scale, without losing everything to summary statistics. By using GPT-3 to compare text distributions, researchers and policymakers can more easily identify emerging themes, concerns, and changes over time. Used carefully, this can support more student-centred policies and practices because decisions can be grounded in a richer picture of what students are actually saying.
Q: How does the method address potential biases in text analysis, and what are the implications for analysing texts that represent diverse Student Voices?
A: Zhong et al. (2022) stress that natural language is ambiguous and open to interpretation, which means bias and imprecision are real risks. Their approach includes manual screening during fine-tuning to check whether generated samples truly match a hypothesis. In practice, reducing bias also depends on how you sample text and who is involved in screening and annotation, especially when analysing feedback from different cultural and social groups. This matters in education, because you want the outputs to reflect diverse Student Voices rather than just the most common language patterns.
[Source Paper] Zhong, R., Snell, C., Klein, D., Steinhardt, J. 2022. Summarizing differences between text distributions with natural language (Preprint).
DOI: 10.48550/arXiv.2201.12323
[1] Boyd, R. L. and Pennebaker, J. W. 2015. Did Shakespeare write double falsehood? Identifying individuals by creating psychological signatures with text analysis. Psychological science, 26(5):570–582.
DOI: 10.1177%2F0956797614566658
[2] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., et al. 2020. Language models are few-shot learners. NeurIPS. Vancouver, CA.
DOI: 10.48550/arXiv.2005.14165
[3] Demszky, D., Garg, N., Voigt, R., Zou, J. Y., Gentzkow, M., Shapiro, J. M., Jurafsky, D. 2019. Analyzing polarization in social media: Method and application to tweets on 21 mass shootings. NAACL, arXiv 2019
DOI: 10.48550/arXiv.1904.01596
[4] Nozaki, Y., Nakamoto, T. 2018. Predictive modeling for odor character of a chemical using machine learning combined with natural language processing. PloS one, 13 (6): e0198475.
DOI: 10.1371/journal.pone.0198475
[5] Zhong, R., Lee, K., Zhang, Z., Klein, D. 2021. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. Findings of the Association for Computational Linguistics 2856–2878. Dominican Republic.
DOI: 10.48550/arXiv.2104.04670
Request a walkthrough
See all-comment coverage, sector benchmarks, and reporting designed for OfS quality and NSS requirements.
UK-hosted · No public LLM APIs · Same-day turnaround
Research, regulation, and insight on student voice. Every Friday.
© Student Voice Systems Limited, All rights reserved.