Detecting hate speech online using machine learning models

By David Griffin

Updated Mar 07, 2026

Can machine learning reliably spot hate speech before it fuels real-world harm? Toraman et al. (2022) show that the answer depends not just on the model, but on the language, hate domain, and training data behind it.

Social media can bring people together. It creates connections between people who might otherwise never meet and strengthens communities separated by geography, gender, religion, or politics. Yet the same platforms can also spread hate speech, which can contribute to offline harm and violence against individuals and communities (Byman, 2021).

Hate speech detection on social media still relies heavily on manually labelled datasets. Researchers use these datasets to train and test machine learning systems for automated language analysis, but the labelling work is slow and labour-intensive. As a result, many datasets remain relatively small. Models often perform well on the same datasets they were tested on, yet struggle when applied to new data beyond the original test scenario (Arango et al., 2019). This problem is even sharper for languages with fewer labelled resources.

Online hate speech spans many topics. In this literature, these topics are often described as "hate domains". Most studies have treated hate speech as a single category rather than asking whether detection changes by domain (Toraman et al., 2022).

Toraman et al. (2022) address that gap by building large hate speech datasets in both English, a globally prevalent language, and Turkish, a less-resourced language in this context. Each dataset contains 100,000 Twitter posts. The authors also label posts by hate domain, focusing on five frequently observed categories: religion, racism, gender, sports, and politics.

The work described by Toraman et al. (2022) has three main objectives:

To construct two large-scale, manually labelled datasets for detecting hate speech on social media in English and Turkish, respectively.
To test different existing machine learning-based models for hate speech detection.
To test whether models trained on one hate domain can still recognise hate speech in another.

To build the datasets, the researchers retrieved about 20,000 tweets for each hate domain in each language. They selected tweets using keywords and additional criteria, including limits on the number taken from the same Twitter account and the exclusion of tweets shorter than five words.

Students manually labelled the tweets, a form of manual coding, using written guidelines and assigning each one to one of three categories: Hate, Offensive, or Normal. Hate tweets were defined as those inciting violence, threatening harm, or targeting an individual or group because of a characteristic or trait. Offensive tweets included insults, humiliation, discrimination, or taunting. All remaining tweets were labelled Normal.

The authors evaluated the models using three common metrics: precision, recall, and F1-score. Here, precision measures how many tweets flagged as Hate really were Hate. Recall measures how many of the Hate tweets in the dataset were correctly identified. F1-score combines both measures, giving a more balanced view of performance.

Using these metrics, they tested three types of models: a bag-of-words model, two neural models, and a transformer-based model.

Results varied considerably by model, language, and hate domain, but four findings stand out:

Transformer-based models delivered the strongest overall results.
Performance depended on language, which means a model that works well in one language may not transfer cleanly to another.
More training data improved detection, making hate speech harder to detect in languages with fewer labelled resources.
Models trained on one hate domain generally transferred well to others, but sports- and gender-related hate speech remained harder to generalise across domains.

The authors also highlight several reasons hate speech detection remains difficult:

Models still struggle with nuance and context. The phrase "I hate my life", for example, was often interpreted as hate speech even though it more commonly signals frustration with one's own situation.
Offensive words and slang are often used for emphasis or humour in online conversation. The vocabulary may sound aggressive even when the intended meaning is not, which leads models to over-flag content.
Words that are usually positive, such as "nice", can appear inside hateful or offensive messages. That can mislead models into classifying harmful posts as Normal.

For anyone using machine learning to analyse open text, this study offers a practical warning: performance depends on the language, domain, and data you train on, which echoes wider work on reducing bias in natural language processing systems. Toraman et al. (2022) provide useful evidence on what current models can do, where they fail, and why cross-domain results should be treated carefully. As online communication keeps expanding, more reliable approaches to detecting harmful language will matter for protecting people and for making text analysis systems more dependable.

Key takeaway: Large hate speech datasets improve model testing, but they do not remove the need for domain-specific evaluation. If you are comparing text analysis software for education, test it on the language, topics, and edge cases your teams actually see.

FAQ

Q: How were students involved in the manual labelling of tweets, and what ethical considerations were taken into account during this process? Considering the potentially distressing nature of hate speech, what measures were put in place to support the students' well-being during their participation in the research?

A: Students labelled tweets into hate, offensive, and normal categories using written guidelines. The study summary here focuses on the annotation workflow and model performance, not on participant support procedures in detail, so it would be risky to assume safeguards that are not reported. In any similar project, exposure to hateful content should be treated as a welfare issue, with clear protocols, informed consent, debriefing, and access to support for student annotators.

Q: How effective are these models in adapting to evolving online slang and expressions, particularly those that might emerge from student or youth cultures? Are there ongoing efforts to update the models to better understand the context and subtext of language used in social media?

A: These models can struggle with evolving slang because meaning changes quickly and often depends on context, community, and irony. Toraman et al. (2022) already show that models misread phrases when offensive words are used humorously or ordinary words appear inside hateful posts. Keeping models useful therefore means updating training data regularly and testing them on current language, especially language used by students or youth communities.

Q: Beyond detecting hate speech, how can the findings and methodologies from this study be applied to other areas of text analysis and online behavior monitoring, especially in educational settings where student voice is critical? For instance, could these models be adapted for monitoring cyberbullying among students or analyzing student sentiments in online learning environments?

A: The methods can extend to other safeguarding and feedback-analysis tasks, but only if the categories and training data match the use case. In educational settings, that could include screening for cyberbullying, identifying wellbeing concerns, or tracking themes in online learning comments, often alongside sentiment analysis for UK universities. The key lesson is not that one hate-speech model fits every problem, but that domain-specific data and careful human review remain essential.

References

[Source] Toraman, C., Sahinuc¸ F., Yilmaz, E.H. (2022). Large-scale hate speech detection with cross-domain transfer.
DOI: 10.48550/arXiv.2203.01111

[1] Arango, A., P´erez, J., and Poblete, B. (2019). Hate speech detection is not as easy as you may think: A closer look at model validation. In Proceedings of the 42nd International ACM SIGIR Conf. on Research and Development in Information Retrieval, p 45-54, New York, NY, USA. Association for Computing Machinery.
DOI: 10.1145/3331184.3331262

Byman, D. L. (2021). How hateful rhetoric connects to real-world violence. Available here

Request a walkthrough

Book a free Student Voice Analytics demo

See all-comment coverage, sector benchmarks, and reporting designed for OfS quality and NSS requirements.

All-comment coverage with HE-tuned taxonomy and sentiment.
Versioned outputs with TEF-ready reporting.
Benchmarks and BI-ready exports for boards and Senate.

Prefer email? info@studentvoice.ai

UK-hosted · No public LLM APIs · Same-day turnaround

Detecting hate speech online using machine learning models

FAQ

References

Book a free Student Voice Analytics demo

Related Entries

The Student Voice Weekly