Reducing Bias in Natural Language Processing Systems

By David Griffin

Updated Mar 03, 2026

When an algorithm grades an essay or translates a message, small language errors can have big consequences. Natural language processing (NLP), the intersection of linguistics and machine learning, automates how computers interpret and generate language, and it is increasingly present in daily life. There are, however, numerous examples of NLP system failures, including biased grading systems that disadvantage minorities (Feathers, 2019) and mistranslation issues that have led to inappropriate police responses (Hern, 2017; see AI and education equity challenges and opportunities for a higher education view of these risks).

These issues are often built into the technology through insufficient or inappropriate testing and training. Training data can underrepresent minority groups, which in turn introduces bias into an NLP system. Other reliability concerns include sensitivity to noise (Goodfellow et al., 2015) and out-of-distribution generalisation, where a model encounters text that differs from its training and test data and performance becomes less predictable in real-world use (Fisch et al., 2019).

With these concerns in mind, researchers in Singapore published a paper proposing changes to how NLP systems are developed and evaluated (Tan et al., 2021). Their work considers systems that take text as input and either classify it (by assigning labels) or produce a text response as output. The paper is particularly focused on testing reliability in ways that better reflect real-world use.

Tan et al. (2021) argue that many robustness tests overestimate worst-case failure because they rely on adversarial attacks. In adversarial attacks, inputs are crafted to exploit a model’s weaknesses, often using language models or embeddings (for example, swapping words for synonyms). The authors note that attacks are often generated across multiple ‘dimensions’ (or variables), such as wording, grammar, and syntax, at the same time. This can produce unnatural text or extreme outliers, so tests may flag failures that are vanishingly unlikely in normal use.

Reliability testing is also often done using bespoke datasets that include differing writing styles or domains. Alternatively, a human-versus-model approach may be used, where either human experts or non-experts help construct a dataset and its expected outputs. However, developing a unique challenge dataset for every new NLP system is impractical. As a result, free-to-use crowdsourced datasets are often utilised instead, but they introduce their own biases into the evaluation.

To address these concerns, the authors propose dimension-specific reliability testing using quantitative measures designed to improve safety and equity for users. This approach makes it easier to evaluate how a system behaves on the kinds of variation that matter in practice. They subdivide tests into two categories: average-case tests and worst-case tests. The former estimate scenarios within the normal bounds of system use, while the latter consider less likely, but still important, situations.

Together, these tests provide a more nuanced view of system failure and of a model’s ability to handle out-of-distribution generalisation. Within the paper, the authors provide code examples and show how to generate real-world and worst-case reliability scores, giving teams quantitative measures they can track over time (see the student feedback analysis glossary for definitions of benchmarking and reproducibility).

In the next section, the authors outline a repeatable framework for applying these methods.

The DOCTOR Framework

The DOCTOR framework summarises this approach in a practical six-step checklist:

Define reliability requirements.
Operationalize dimensions as distributions.
Construct tests.
Test the system and report results.
Observe deployed system’s behaviour.
Refine reliability requirements and tests.

Define Reliability Requirements

Before designing tests, developers need to understand stakeholders, their demographics, and their values. They should also consider the impact an NLP system could have on the lives of the people who interact with it. This step anchors reliability in real-world impact, not just technical metrics. To do this, the authors suggest asking three questions:

What dimensions need to be tested?
How will system performance be measured?
What are acceptable performance thresholds for the chosen dimensions?

Because there are many potential dimensions, stakeholder advocates and NLP experts should be involved in this step. To determine acceptable thresholds for worst-case tests, ethicists should also be consulted.

To further explain this step, the authors use the exemplar scenario of an Automated Text Scoring (ATS) system, which is used to grade exams and essays. The stakeholders here are the students and their schools. Relevant demographic considerations include the use of a given language by stakeholders and the fluency of particular societal or socioeconomic groups in that language.

Operationalize dimensions as distributions

Dimensions must be defined in terms of the operations used to describe them. Put differently, chosen dimensions need to be tested in the context in which they are intended to exist. This allows perturbed examples to be sampled and recorded as tests.

For average-case tests, test datasets are needed that reflect the likely real-world distributions the NLP system will encounter. For worst-case tests, likely scenario datasets are less of a requirement, since these tests are designed to define possible perturbations beyond the typical. In the exemplar ATS scenario, misspellings in a submitted text should be discounted as errors when grammar alone is being tested. In this manner, the vital dimensions and their contexts are defined, helping to isolate the dimension being tested rather than measuring unrelated noise.

Construct tests

According to the authors, average-case tests can be constructed manually or using model-based perturbation generators like PolyJuice (Wu et al., 2021). Worst-case tests can be rule- or model-based, using systems like Morpheus or BERT-Attack, respectively (Tan et al., 2020; Li et al., 2020). They argue that some tests should be conducted under a black-box assumption, without access to the parameters used to design the NLP model. This mix of approaches helps cover both typical variation and stress cases, while making results easier for stakeholders, and potentially regulators, to trust.

Test the system and report results

Testing can be divided across levels within and beyond the company producing the NLP. The authors use a three-level example, made up of the internal development team, the company, and the industry at large (or regulator). This framing treats reliability as a shared responsibility, not just a developer check.

The reliability tests devised by the development team are meant to identify weaknesses in the use of the NLP system by the target users. The authors suggest combining worst-case test examples constrained by the specific dimensions chosen earlier along with average-case test examples. This can provide the development team with insight into how different design factors affect reliability.

At the company level, internal ‘red teams’ can use reliability tests to identify concerns with the NLP system’s safety and security. These tests are likely to be broader in scope than those used by the development team. Furthermore, reliability testing standards can be developed to ensure compliance across multiple NLP systems created within the company. Publishing those standards, and showing that they are followed, allows public critique and can foster trust between the company and users of its products.

Beyond the company level, industry regulators should provide further reliability testing, similar to the ISO testing and auditing required in other industries. As such, these requirements would be more stringent for high-risk NLP systems and potentially less so for low-risk ones. Within these standards, it is possible for both average-case and worst-case test results to be included and published.

Observation and Refinement

The authors stress the importance of monitoring the impact of the NLP system beyond its launch. This enables reliability tests, dimensions, and accepted thresholds to be updated in response. Stakeholders and users can be encouraged to provide feedback, raise issues, or seek remediation through online resources or, for the products of larger companies, via community juries.

Conclusions

NLP systems can make positive contributions to technology and our lives. However, those tasked with their development have a responsibility to avoid perpetuating dangerous stereotypes and harming or underserving minority communities. Tan et al. (2021) outline the potential benefits of replacing adversarial attacks with their suggested reliability tests, alongside a framework for implementation. They also highlight the need for company and industry standards to ensure accountability in the field (see our student comment analysis governance checklist for practical controls when analysing student feedback). Taken together, DOCTOR provides a practical way to define reliability, measure it quantitatively, and refine it as systems are deployed.

FAQ

Q: How do the proposed testing methods specifically address the underrepresentation of minority voices in NLP training datasets?

A: The proposed testing methods aim to address underrepresentation by focusing on dimension-specific reliability tests. These tests assess how well an NLP system can handle a variety of linguistic features, including those unique to the languages or dialects of minority groups. By operationalising dimensions as distributions, the framework ensures that the diversity of student voices, including those from minority backgrounds, is considered during testing. This approach helps identify and mitigate biases that might arise when certain groups are underrepresented in the training data, making text analysis more equitable and inclusive (see algorithmic fairness in machine learning models of student performance for a related education example).

Q: What specific measures are recommended to ensure that student feedback and concerns are effectively incorporated into the iterative refinement of NLP systems?

A: The DOCTOR framework recommends a continuous cycle of observation and refinement post-deployment. This involves actively soliciting feedback from students and other stakeholders through online platforms, surveys, and community forums. By observing the deployed system’s behaviour and gathering direct input from users, developers can identify areas where an NLP system may not adequately represent or understand student voices. This feedback loop allows reliability requirements and tests to be refined to better accommodate diverse needs and perspectives, particularly for underrepresented groups.

Q: How can educators and policymakers ensure that the application of these NLP reliability testing methods leads to equitable educational outcomes?

A: Educators and policymakers can ensure that the application of these methods leads to equitable educational outcomes by involving a broad spectrum of stakeholders in development and evaluation. This includes consulting with students, educators, linguists, ethicists, and representatives from minority communities to define reliability requirements and acceptable performance thresholds. Prioritising the inclusion of diverse student voices in the operationalisation of dimensions and the construction of tests helps address potential biases and ensures that text analysis tools are evaluated against real-world linguistic diversity. Transparency and accountability also matter, such as publishing test results and adhering to industry or regulatory guidelines.

References

[Source Paper] Tan S, Joty S, Baxter K, Taeihagh A, Bennett GA, Kan MY. 2021. Reliability testing for natural language processing systems. Computers and Society.
DOI: 10.48550/arXiv.2105.02590

[1] Feathers T. 2019. Flawed algorithms are grading millions of students’ essays. Vice.
Retrieved Here

[2] Hern H. 2017. Facebook translates ‘good morning’ into ‘attack them’, leading to arrest. The Guardian.
Retrieved Here

[3] Goodfellow IJ, Shlens J, Szegedy C. 2015. Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, San Diego, California.
DOI: 10.48550/arXiv.1412.6572

[4] Fisch A, Talmor A, Jia R, Seo M, Choi E, Chen D. 2019. MRQA 2019 shared task: Evaluating generalization in reading comprehension. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 1–13, Hong Kong, China. Association for Computational Linguistics.
DOI: 10.48550/arXiv.1910.09753

[5] Tan S, Joty S, Kan MY, Socher R. 2020. It’s morphin’ time! Combating linguistic discrimination with inflectional perturbations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2920–2935, Online. Association for Computational Linguistics.
DOI: 10.48550/arXiv.2005.04364

[6] Li L, Ma R, Guo Q, Xue X, Qiu X. 2020. Bert-attack: Adversarial attack against BERT using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online. Association for Computational Linguistics.
DOI: 10.48550/arXiv.2004.09984

[7] Wu T, Ribeiro MT, Heer J, Weld DS. 2021. Polyjuice: Automated, general-purpose counterfactual generation. arXiv preprint arXiv:2101.00288.
DOI: 10.48550/arXiv.2101.00288

Request a walkthrough

Book a free Student Voice Analytics demo

See all-comment coverage, sector benchmarks, and reporting designed for OfS quality and NSS requirements.

All-comment coverage with HE-tuned taxonomy and sentiment.
Versioned outputs with TEF-ready reporting.
Benchmarks and BI-ready exports for boards and Senate.

Prefer email? info@studentvoice.ai

UK-hosted · No public LLM APIs · Same-day turnaround