By David Griffin
Natural language processing (NLP) refers to the intersection between linguistics and machine learning. It deals with the automated interpretation and production of language. Within the last decade, this field of science has become ever more present in our daily lives. Concerning, however, are the numerous examples of NLP system failures. These include biased grading systems against minorities (Feathers, 2019) and mistranslation issues resulting in inappropriate police responses (Hern, 2017).
Such concerns are often inbuilt in the technology due to insufficient or inappropriate system testing and training. This often includes the underrepresentation of minorities which in turn introduces inherent bias to the NLP system. Other concerns with the reliability of NLP systems include their sensitivity to noise (Goodfellow et al., 2015) and how they deal with out-of-distribution generalisation, a problem where the data distribution is unknown due to training data being different to that used in testing or reality (Fisch et al., 2019).
With these concerns in mind, Singaporean researchers recently published a paper proposing changes to the manner in which NLP systems are developed and evaluated (Tan et al., 2021). Their work considers systems which take text as an input and either categorise it through labelling or produce a text response as an output.
According to Tan et al. (Tan et al., 2021), most approaches to NLP system reliability (robustness) testing overestimate the worst-case performance of the system. This is due to them often relying on the evaluation technique known as Adversarial Attack, where the manner in which a system has been designed is used to exploit its weaknesses. In the context of NLP systems, this technique is often reliant on language models or language embedding (where synonyms are used for language interpretation). The authors argue that this technique is generally used with multiple ‘dimensions’ (or variables) such as wording, grammar or syntax tested concurrently. Consequently, the Adversarial Attacks may be unnatural or represent distant outliers in a normal distribution, resulting in infinitesimally improbable failures being flagged.
Reliability testing is also often done using uniquely created datasets which include differing writing styles or domains. Alternatively, they may use a human-versus-model method of dataset construction, where either human experts or non-experts in a field are used to train a dataset in its production of appropriate output responses. However, developing a unique challenge dataset for every new NLP system is impractical. As a result, free to use crowdsourced datasets are often utilised instead, but introduce their own inherent biases to the tested system.
To counter these concerns, the authors of the paper have proposed a method of NLP reliability testing which is dimension-specific and uses quantitative test measures to help ensure its safety and equity for users. They propose the use of reliability tests which can be subdivided into two categories: average-case tests and worst-case tests. The former estimate the scenarios within the normal bounds of use of a system; the latter consider those less likely to occur. Together, both groups complement one another and ensure the abilities of a system to deal with out-of-distribution generalisation are tested. According to the authors, implementing reliability tests in this manner can improve quality control of a system, while enabling more nuance in the discussion and interpretation of system failure. Within the published paper, the authors provide examples of their proposed code. Based on the average-case test and worst-case test, respective real-world and worst-case reliability scores can be generated, providing a given NLP system with quantitative measures of reliability.
In the next section, how the authors propose to apply these methods within a repeatable framework is outlined.
There are six steps for implementation based on the acronym DOCTOR.
The demographics of stakeholders and their values must be understood before testing is devised. The potential impact of the NLP system on the lives of the people who will interact with it must be considered. To do this, the authors suggest asking three questions:
Due to the sheer number of potential dimensions, stakeholder advocates and NLP experts should be involved in this step. To determine acceptable thresholds for worst-case tests, ethicists should be consulted. To further explain this step, the authors use the exemplar scenario of an Automated Text Scoring (ATS) system, which is used to grade exams and essays. The stakeholders here are the students and their schools. Demographic considerations which may need to be considered include the use of a given language by stakeholders and the fluency of particular societal or socioeconomic groups in that language.
Dimensions must be defined in terms of the operations used to describe them. That is to say, chosen dimensions need to be tested in the context in which they are intended to exist. This allows perturbed examples to be sampled from the results and recorded as tests. For average-case tests, test datasets are needed which reflect likely real-world distributions the NLP will encounter. For worst-case tests, likely scenario datasets are less of a requirement, since these tests are required to define possible perturbations beyond the likely. In the exemplar scenario of the ATS system, misspellings in a submitted text should be discounted as errors where grammar alone is being tested. In this manner, the vital dimensions and their contexts are defined.
Average-case tests, according to the authors, can be constructed either manually or using model-based perturbation generators like PolyJuic (Wu et al., 2021). Worst-case tests can be rule- or model-based, using systems like Morpheus or BERT-Attack, respectively (Tan et al., 2020; Li et al., 2020). The authors argue that some tests should be conducted using the black-box assumption; without knowledge of the parameters used to design the NLP model. This ensures stakeholders, and potentially regulators, can trust the results.
Testing can be divided according to levels within and beyond the company producing the NLP. The authors use a three-level example, made up of the internal development team, the company and the industry at large (or regulator).
The reliability tests devised by the development team are meant to identify weaknesses in the use of the NLP system by the target users. Using worst-case test examples constrained using the specific dimensions chosen earlier along with average-case test examples, the authors suggest, will provide the development team with insight into how different design factors affect reliability.
At the company level, internal ‘red teams’ should be utilised, which can use reliability tests to identify concerns with the NLP system’s safety and security. These tests are likely to be broader in scope than those used by the development team. Furthermore, reliability testing standards can be developed to ensure compliance across multiple NLP systems created within the company. The publication of standards developed and adhered to allow public critiquing and can foster trust between the company and users of its products.
Beyond the company level, industry regulators should provide further reliability testing, similar to the ISO testing and auditing required of similar industries. As such, these requirements would be more stringent for higher risk NLP systems and potentially less so for lower risk ones. Within these standards, it is possible for both average and worst-case test results to be included and published.
The authors stress the importance of monitoring the impact of the NLP system beyond its launch, enabling reliability tests, dimensions and their accepted thresholds to be updated in response. Stakeholders and users can be encouraged to provide feedback, raise issues or seek remediation through online resources or, for the products of larger companies, via community juries.
NLP systems can make positive contributions to technologies and our lives. However, those tasked with their development have a responsibility to avoid perpetuating dangerous stereotypes and damaging or underserving minority communities. To help with this charge, Tan et al. (2021) have outlined the potential benefits of replacing Adversarial Attacks with their suggested reliability tests. They have also presented a framework for their implementation. Furthermore, they have outlined the need for company and industry standards to ensure accountability in the field.
 Goodfellow IJ, Shlens J, Szegedy C. 2015. Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, San Diego, California.
 Fisch A, Talmor A, Jia R, Seo M, Choi E, Chen D. 2019. MRQA 2019 shared task: Evaluating generalization in reading comprehension. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 1–13, Hong Kong, China. Association for Computational Linguistics.
 Tan S, Joty S, Kan MY, Socher R. 2020. It’s morphin’ time! Combating linguistic discrimination with inflectional perturbations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2920–2935, Online. Association for Computational Linguistics.
 Li L, Ma R, Guo Q, Xue X, Qiu X. 2020. Bert-attack: Adversarial attack against BERT using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online. Association for Computational Linguistics.