- Reducing Bias in Natural Language Processing Systems

Reducing Bias in Natural Language Processing Systems

By David Griffin

Natural language processing (NLP) refers to the intersection between linguistics and machine learning. It deals with the automated interpretation and production of language. Within the last decade, this field of science has become ever more present in our daily lives. Concerning, however, are the numerous examples of NLP system failures. These include biased grading systems against minorities (Feathers, 2019) and mistranslation issues resulting in inappropriate police responses (Hern, 2017).

Such concerns are often inbuilt in the technology due to insufficient or inappropriate system testing and training. This often includes the underrepresentation of minorities which in turn introduces inherent bias to the NLP system. Other concerns with the reliability of NLP systems include their sensitivity to noise (Goodfellow et al., 2015) and how they deal with out-of-distribution generalisation, a problem where the data distribution is unknown due to training data being different to that used in testing or reality (Fisch et al., 2019).

With these concerns in mind, Singaporean researchers recently published a paper proposing changes to the manner in which NLP systems are developed and evaluated (Tan et al., 2021). Their work considers systems which take text as an input and either categorise it through labelling or produce a text response as an output.

According to Tan et al. (Tan et al., 2021), most approaches to NLP system reliability (robustness) testing overestimate the worst-case performance of the system. This is due to them often relying on the evaluation technique known as Adversarial Attack, where the manner in which a system has been designed is used to exploit its weaknesses. In the context of NLP systems, this technique is often reliant on language models or language embedding (where synonyms are used for language interpretation). The authors argue that this technique is generally used with multiple ‘dimensions’ (or variables) such as wording, grammar or syntax tested concurrently. Consequently, the Adversarial Attacks may be unnatural or represent distant outliers in a normal distribution, resulting in infinitesimally improbable failures being flagged.

Reliability testing is also often done using uniquely created datasets which include differing writing styles or domains. Alternatively, they may use a human-versus-model method of dataset construction, where either human experts or non-experts in a field are used to train a dataset in its production of appropriate output responses. However, developing a unique challenge dataset for every new NLP system is impractical. As a result, free to use crowdsourced datasets are often utilised instead, but introduce their own inherent biases to the tested system.

To counter these concerns, the authors of the paper have proposed a method of NLP reliability testing which is dimension-specific and uses quantitative test measures to help ensure its safety and equity for users. They propose the use of reliability tests which can be subdivided into two categories: average-case tests and worst-case tests. The former estimate the scenarios within the normal bounds of use of a system; the latter consider those less likely to occur. Together, both groups complement one another and ensure the abilities of a system to deal with out-of-distribution generalisation are tested. According to the authors, implementing reliability tests in this manner can improve quality control of a system, while enabling more nuance in the discussion and interpretation of system failure. Within the published paper, the authors provide examples of their proposed code. Based on the average-case test and worst-case test, respective real-world and worst-case reliability scores can be generated, providing a given NLP system with quantitative measures of reliability.

In the next section, how the authors propose to apply these methods within a repeatable framework is outlined.

The DOCTOR Framework

There are six steps for implementation based on the acronym DOCTOR.

  • Define reliability requirements.
  • Operationalize dimensions as distributions.
  • Construct tests.
  • Test the system and report results.
  • Observe deployed system’s behaviour.
  • Refine reliability requirements and tests.

Define Reliability Requirements

The demographics of stakeholders and their values must be understood before testing is devised. The potential impact of the NLP system on the lives of the people who will interact with it must be considered. To do this, the authors suggest asking three questions:

  • What dimensions need to be tested?
  • How will system performance be measured?
  • What are acceptable performance thresholds for the chosen dimensions?

Due to the sheer number of potential dimensions, stakeholder advocates and NLP experts should be involved in this step. To determine acceptable thresholds for worst-case tests, ethicists should be consulted. To further explain this step, the authors use the exemplar scenario of an Automated Text Scoring (ATS) system, which is used to grade exams and essays. The stakeholders here are the students and their schools. Demographic considerations which may need to be considered include the use of a given language by stakeholders and the fluency of particular societal or socioeconomic groups in that language.

Operationalize dimensions as distributions

Dimensions must be defined in terms of the operations used to describe them. That is to say, chosen dimensions need to be tested in the context in which they are intended to exist. This allows perturbed examples to be sampled from the results and recorded as tests. For average-case tests, test datasets are needed which reflect likely real-world distributions the NLP will encounter. For worst-case tests, likely scenario datasets are less of a requirement, since these tests are required to define possible perturbations beyond the likely. In the exemplar scenario of the ATS system, misspellings in a submitted text should be discounted as errors where grammar alone is being tested. In this manner, the vital dimensions and their contexts are defined.

Construct tests

Average-case tests, according to the authors, can be constructed either manually or using model-based perturbation generators like PolyJuic (Wu et al., 2021). Worst-case tests can be rule- or model-based, using systems like Morpheus or BERT-Attack, respectively (Tan et al., 2020; Li et al., 2020). The authors argue that some tests should be conducted using the black-box assumption; without knowledge of the parameters used to design the NLP model. This ensures stakeholders, and potentially regulators, can trust the results.

Test the system and report results

Testing can be divided according to levels within and beyond the company producing the NLP. The authors use a three-level example, made up of the internal development team, the company and the industry at large (or regulator).

The reliability tests devised by the development team are meant to identify weaknesses in the use of the NLP system by the target users. Using worst-case test examples constrained using the specific dimensions chosen earlier along with average-case test examples, the authors suggest, will provide the development team with insight into how different design factors affect reliability.

At the company level, internal ‘red teams’ should be utilised, which can use reliability tests to identify concerns with the NLP system’s safety and security. These tests are likely to be broader in scope than those used by the development team. Furthermore, reliability testing standards can be developed to ensure compliance across multiple NLP systems created within the company. The publication of standards developed and adhered to allow public critiquing and can foster trust between the company and users of its products.

Beyond the company level, industry regulators should provide further reliability testing, similar to the ISO testing and auditing required of similar industries. As such, these requirements would be more stringent for higher risk NLP systems and potentially less so for lower risk ones. Within these standards, it is possible for both average and worst-case test results to be included and published.

Observation and Refinement

The authors stress the importance of monitoring the impact of the NLP system beyond its launch, enabling reliability tests, dimensions and their accepted thresholds to be updated in response. Stakeholders and users can be encouraged to provide feedback, raise issues or seek remediation through online resources or, for the products of larger companies, via community juries.


NLP systems can make positive contributions to technologies and our lives. However, those tasked with their development have a responsibility to avoid perpetuating dangerous stereotypes and damaging or underserving minority communities. To help with this charge, Tan et al. (2021) have outlined the potential benefits of replacing Adversarial Attacks with their suggested reliability tests. They have also presented a framework for their implementation. Furthermore, they have outlined the need for company and industry standards to ensure accountability in the field.


Q: How do the proposed testing methods specifically address the underrepresentation of minority voices in NLP training datasets?

A: The proposed testing methods aim to address the underrepresentation of minority voices in NLP training datasets by focusing on dimension-specific reliability tests. These tests assess how well an NLP system can handle a variety of linguistic features, including those unique to the languages or dialects of minority groups. By operationalising dimensions as distributions, the framework ensures that the diversity of student voices, including those from minority backgrounds, is considered during the testing phase. This approach helps identify and mitigate biases that might arise from the underrepresentation of certain groups in the training data. By incorporating a wide range of linguistic variations reflective of the student population, these methods strive to make text analysis more equitable and inclusive.

Q: What specific measures are recommended to ensure that student feedback and concerns are effectively incorporated into the iterative refinement of NLP systems?

A: To ensure that student feedback and concerns are effectively incorporated into the iterative refinement of NLP systems, the DOCTOR framework recommends a continuous cycle of observation and refinement post-deployment. This involves actively soliciting feedback from students and other stakeholders through online platforms, surveys, and community forums. By observing the deployed system's behaviour and gathering direct input from users, developers can identify areas where the NLP system may not adequately represent or understand student voices. This feedback loop allows for the refinement of reliability requirements and tests to better accommodate the diverse needs and perspectives of students, particularly those from underrepresented groups. Engaging with student voice in this way ensures that text analysis tools remain fair, accurate, and responsive to the evolving linguistic landscape of the student body.

Q: How can educators and policymakers ensure that the application of these NLP reliability testing methods leads to equitable educational outcomes?

A: Educators and policymakers can ensure that the application of NLP reliability testing methods leads to equitable educational outcomes by actively involving a broad spectrum of stakeholders in the development and evaluation process. This includes consulting with students, educators, linguists, ethicists, and representatives from minority communities to define reliability requirements and acceptable performance thresholds. By prioritising the inclusion of diverse student voices in the operationalisation of dimensions and the construction of tests, the framework can address potential biases and ensure that text analysis tools are evaluated against the real-world linguistic diversity they will encounter. Additionally, implementing standards for transparency and accountability, such as publishing test results and adhering to industry or regulatory guidelines, can foster trust and ensure that NLP systems are used responsibly in educational contexts. Through these collaborative and transparent practices, the equitable treatment of all students in text analysis and automated scoring can be better guaranteed.


[Source Paper] Tan S, Joty S, Baxter K, Taeihagh A, Bennett GA, Kan MY. 2021. Reliability testing for natural language processing systems. Computers and Society.
DOI: 10.48550/arXiv.2105.02590

[1] Feathers T. 2019. Flawed algorithms are grading millions of students’ essays. Vice.
Retreived Here

[2] Hern H. 2017. Facebook translates ‘good morning’ into ‘attack them’, leading to arrest. The Guardian.
Retrieved Here

[3] Goodfellow IJ, Shlens J, Szegedy C. 2015. Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, San Diego, California.
DOI: 10.48550/arXiv.1412.6572

[4] Fisch A, Talmor A, Jia R, Seo M, Choi E, Chen D. 2019. MRQA 2019 shared task: Evaluating generalization in reading comprehension. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 1–13, Hong Kong, China. Association for Computational Linguistics.
DOI: 10.48550/arXiv.1910.09753

[5] Tan S, Joty S, Kan MY, Socher R. 2020. It’s morphin’ time! Combating linguistic discrimination with inflectional perturbations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2920–2935, Online. Association for Computational Linguistics.
DOI: 10.48550/arXiv.2005.04364

[6] Li L, Ma R, Guo Q, Xue X, Qiu X. 2020. Bert-attack: Adversarial attack against BERT using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online. Association for Computational Linguistics.
DOI: 10.48550/arXiv.2004.09984

[7] Wu T, Ribeiro MT, Heer J, Weld DS. 2021. Polyjuice: Automated, general-purpose counterfactual generation. arXiv preprint arXiv:2101.00288.
DOI: 10.48550/arXiv.2101.00288

Related Entries