Cookie Preferences

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Close icon
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

How well can we detoxify comments online? 🙊

The issue of online toxicity is one of the most challenging problems on the internet today. We know it can amplify discord and discrimination, from racism and anti-semitism, to misogyny and homophobia. In some cases, toxic comments online can result in real life violence [1,2].

Laura Hanu

Table of contents

Toxicity/hate speech classification with one line of code

Detoxify results
Example of results using Detoxify 🙊

Hate speech online is here to stay

The issue of online toxicity is one of the most challenging problems on the internet today. We know it can amplify discord and discrimination, from racism and anti-semitism, to misogyny and homophobia. In some cases, toxic comments online can result in real life violence [1,2].

Human moderators are struggling to keep up with the increasing volumes of harmful content, which can often lead to PTSD. It shouldn’t come as a surprise therefore, that the AI community has been trying to build models to detect such toxicity for years.


Toxicity detection is difficult

In 2017, Alphabet’s Perspective API, an AI solution to detect toxic comments online, was met with criticism when users found examples of racial, sexual orientation or disability bias. In particular, they found a positive correlation between toxic labels and comments containing identity terms such as race, religion, or gender. Because of this skew in the data, models were likely to associate these neutral identity terms with toxicity. For example, “I am a gay black woman” received a toxicity score of 87%, while “I am a woman who is deaf” received 77%.

Examples of bias in toxicity models, Source:

This led to them creating 3 Kaggle challenges in the following years aimed at building better toxicity models:

  • Toxic Comment Classification Challenge: the goal of this challenge was to build a multi-headed model that can detect different types of of toxicity like threats, obscenity, insults, or identity-based hate in Wikipedia comments.
  • Jigsaw Unintended Bias in Toxicity Classification: the 2nd challenge tried to address unintended bias observed in the previous challenge by introducing identity labels and a special bias metric aimed to minimise this bias on Civil Comments data.
  • Jigsaw Multilingual Toxic Comment Classification: the 3rd challenge combined data from the previous 2 challenges and encouraged developers to find an effective way to build a multilingual model out of English training data only.

The Jigsaw challenges on Kaggle have pushed things forward and encouraged developers to build better toxic detection models using recent breakthroughs in natural language processing.

What is detoxify 🙊?

Example using detoxify 🙊

Detoxify is a simple python library designed to easily predict if a comment contains toxic language. It has the option to automatically load one of 3 trained models: original, unbiased, and multilingual. Each model was trained on data from one of the 3 Jigsaw Toxic Comment Classification challenges using the 🤗 transformers library.

Quick Prediction

The library can be easily installed in a terminal and imported using Python.

$ pip install detoxify

The multilingual model has been trained on 7 different languages so it should only be tested on: English, French, Spanish, Italian, Portuguese, Turkish, and Russian.

You can find more details about the training and prediction code on unitaryai/detoxify.

Training Details

During the experimentation phase, we tried a few transformer variations from 🤗 HuggingFace, however, the best ones turned out to be those already suggested in the Kaggle top solutions discussions.


Originally introduced in 2018 by Google AI, BERT is a deep bidirectional transformer pre-trained on unlabelled text from the internet, which presented state-of-the-art results on a variety of NLP tasks, like Question Answering, and Natural Language Inference. The bidirectional approach resulted in a deeper understanding of context compared to previous unidirectional (left-to-right or right-to-left) approaches.


Build by Facebook AI in July 2019, RoBERTa is an optimised way of pre-training BERT. What they found was that removing BERT’s next sentence objective, training with much larger mini batches and learning rates, and training for an order of magnitude longer, resulted in better performance on the masked language modelling objective, as well as on downstream tasks.


Proposed in late 2019 by Facebook AI, XLM-Roberta is a multilingual model built on top of RoBERTa and pre-trained on 2.5TB of filtered CommonCrawl data. While trained on 100 different languages, it managed to not sacrifice per-language performance and be competitive with strong monolingual models.

Bias Loss and Metric

The 2nd challenge required thinking about the training process more carefully. With additional identity labels (only present for a fraction of the training data), the question was how to incorporate them in a way that would minimise bias.

Our loss function was inspired from the 2nd solution which combined the weighted toxicity loss function and identity loss function to ensure the model is learning to distinguish between the 2 types of labels. Additionally, the toxicity labels are weighted more if identity labels are present for a specific comment.

This challenge also introduced a new bias metric, which calculated the ROC-AUC of 3 specific test subsets for each identity:

  • Subgroup AUC: only keep the examples that mention an identity subgroup
  • BPSN (Background Positive, Subgroup Negative) AUC: only keep non-toxic examples that mention the identity subgroup and the toxic examples that do not
  • BNSP (Background Negative, Subgroup Positive) AUC: only keep toxic examples that mention the identity subgroup and the non-toxic examples that do not

These are then combined into the Generalised mean of BIAS AUCs to get an overall measure.

Generalised mean of BIAS AUCs, Source: Kaggle

The final score combines the overall AUC with the generalised mean of BIAS AUCs.

Final bias metric, Source: Kaggle

The combination of these resulted in less biased predictions on non-toxic sentences that mention identity terms.

'Unbiased' Detoxify model scores
'Unbiased' Detoxify model scores.

Limitations and ethical considerations

If words that are associated with swearing, insults, hate speech, or profanity are present in a comment, it is likely that it will be classified as toxic (even in the unbiased model), regardless of the tone or intent of the author e.g. humorous/self-deprecating. For example: ‘I am tired of writing this stupid essay’ will give a toxicity score of 99.70%, while removing the word ‘stupid’ — ‘I am tired of writing this essay’ will give 0.05%.

However, this doesn’t necessarily mean that the absence of such words will result in a low toxicity score. For example, a common sexist stereotype such as ‘Women are not as smart as men.’ gives a toxic score of 91.41%.

Some useful resources about the risk of different biases in toxicity or hate speech detection are:

Moreover, since these models were tested mostly on the test sets provided by the Jigsaw competitions, they are likely to behave in unexpected ways on data in the wild, which will have a different distribution to the Wikipedia and Civil Comments in the training sets.

Last but not least, the definition of toxicity is itself subjective. Perhaps due to our own biases, both conscious and unconscious, it is difficult to come to a shared understanding of what should or should not be considered toxic. We encourage users to see this library as a way of identifying the potential for toxicity. We hope this can help researchers, developers, or content moderators to flag extreme cases quicker and fine-tune it on their own datasets.

What the future holds

While it seems that current hate speech models are sensitive to particular toxic words and phrases, we still have a long way to go until algorithms can capture actual meanings without being easily fooled.

For now, diverse datasets that reflect the real world and full context (e.g. accompanying image/video) are one of our best shots at improving toxicity models.

About Unitary

At Unitary we build visual understanding AI capable of interpreting visual content in context and our mission is to stop online harm.

You can find more about our mission and motivation in our previous post.