Research on Education Text Analysis and Teaching Best Practice
By David Griffin
Humans are naturally slow to recognise patterns in text. However, this ability is vitally important for many applications, from monitoring political discourse, to assessing the functionality of machine learning programs.
To address this challenge, Zhong et al. (2022) recently published a paper outlining the methods they used to recognise and describe differences between two distributions of text. In short, they sought to create criteria which is more often applicable to one text distribution than the other. To illustrate this, the authors used the simple example of social media comments related to the current SARS-CoV-2 pandemic. If publicly posted comments from two consecutive years were considered, each year would form a distinct individual distribution. The criteria used to differentiate between the two distributions might be that one contains more optimistic language. In this way, machine learning could be employed to efficiently gauge public opinion on the pandemic, or even more specifically, a government’s response to it.
The method developed by Zhong et al. (2022) uses Generative Pre-trained Transformer 3 (GPT-3) (Brown et al., 2020), a machine learning tool capable of producing human-like text in response to inputs. In this case, the inputs were the two text distributions to be differentiated between.
Due to the way GPT-3 operates, if the distributions are large, they cannot be used as inputs in their entirety. Consequently, a sample from each distribution must be used instead. From these samples, natural language hypotheses are proposed about each sample. These ‘candidate hypotheses’, as the authors described them, should describe an aspect of the sample which enables differentiation between its source distribution and the other. In the social media comment example previously outlined, the candidate hypothesis could be “is optimistic about the pandemic” (Zhong et al., 2022).
Though GPT-3 produces human-like text, it was not specifically designed to generate hypotheses in this manner. As a result, there was no existing corpus (a library of written and spoken material which can be used for the training of machine learning technologies) for the authors to utilize.
To overcome this challenge, they developed a system to fine tune the hypothesis proposals generated. This involved three stages:
A list of candidate hypotheses was first made. This involved a combination of manually writing hypotheses by hand and using those generated by GPT-3. Using the social media example, a hypothesis might be “is optimistic about the pandemic”.
GPT-3 was then used to generate both positive and negative samples based on each candidate hypothesis. Positive samples are those which fulfill the criteria of a given hypothesis, negative samples are those which do not. A positive sample comment in the given example might be the optimistic statement, “hospitalisations are reducing”. A negative sample comment could be “hospitalisations are increasing”.
The generated samples were then manually screened to ensure they truly fulfilled the criteria of the respective hypothesis. Or, in the case of negative samples, truly did not fulfill the criteria.
At this stage in the process, for each given hypothesis, the authors had a large selection of samples which definitively fulfilled or did not fulfill it. These samples were then used to train GPT-3 to develop its own hypothesis. In this way, GPT-3 was fine-tuned to produce a hypothesis in natural language in response to inputted samples from two distinct distributions.
In testing using 54 binary classification datasets (Zhong et al., 2021), when compared with manual human annotations the classifier developed by Zhong et al. (2022) produced similar outputs in a promising 76% of cases after fine-tuning.
There are several challenges to classifying text in this manner, however, as stressed by the authors. Natural language is inherently ambiguous and open to interpretation. Both the language used within distributions and hypotheses have the potential to be imprecise or hold biases. This can be further exasperated by cultural and social differences. The results of the 54 classification datasets used to test this system were also validated by hand using manual annotation.
This is time and resource intensive, however, at present there is no alternative. Yet, the significance of these challenges pales, when the range of potential applications for this form of automated language analysis is considered. The authors stress that while general text distribution was the focus of this work, applications could include analysis of anything which involves a language output. This could include analysing and comparing a vast range of human experiences, from tastes (Nozaki & Nakamoto, 2018) to traumatic events (Demszky et al., 2019).
Furthermore, it could even be applied in forms of psychological profiling, through the identification of writing patterns associated with specific psychological signatures (Boyd & Pennebaker, 2015). The authors suggest that, in effect, the list of potential applications is endless.
 Boyd, R. L. and Pennebaker, J. W. 2015. Did Shakespeare write double falsehood? Identifying individuals by creating psychological signatures with text analysis. Psychological science, 26(5):570–582.
 Demszky, D., Garg, N., Voigt, R., Zou, J. Y., Gentzkow, M., Shapiro, J. M., Jurafsky, D. 2019. Analyzing polarization in social media: Method and application to tweets on 21 mass shootings. NAACL, arXiv 2019
 Nozaki, Y., Nakamoto, T. 2018. Predictive modeling for odor character of a chemical using machine learning combined with natural language processing. PloS one, 13 (6): e0198475.
 Zhong, R., Lee, K., Zhang, Z., Klein, D. 2021. Adapting language models for zero-shot learning by meta – tuning on dataset and prompt collections. Findings of the Association for Computational Linguistics 2856–2878. Dominican Republic.