While AI has progressed in classifying content, human moderators remain key for online platforms. AI lacks nuanced understanding of context and culture needed for some moderation decisions. Laws also mandate human oversight. However, AI will likely take a larger role, requiring responsible implementation with transparency and accountability.
Among the notable, if not unique, features of the UK’s recently passed Online Safety Bill, is the duty imposed on platforms to take “proportionate measures” to prevent users ever encountering certain categories of content, including: fraudulent advertising, some categories of illegal content, and, when the user is a child, some categories of “harmful” content. As humans cannot be expected to review every piece of content, the bill specifies (in section 137) considerations that Ofcom, the British regulator, must take when imposing the use of “proactive technology” on a platform.
As other automated means such as simple keyword or hash matching are only of limited utility, these imposed technologies are likely to be AI classifiers. (Freedom House’s recent report has identified 22 countries requiring the use of automated technology for banned content removal, but, despite their framing, it is less clear how many of these have AI classifiers in mind.) These are already in use by many platforms, but they only perform a portion of content moderation today. Despite the advances in AI content classification technologies such as natural language processing and computer vision in recent years, the widespread availability of foundation models, as well as a growing infrastructure to support the process of data labeling and model training, much of content moderation is still performed by tens of thousands of full-time human moderators. The question is—why?
The most obvious reason why platforms still need human moderation is that even when classifiers are used, they won’t always catch everything (imperfect recall) and they will flag or remove content that does not in fact violate platform policy (imperfect precision). In particular, there are a number of ways in which classifiers, historically, have been more likely to be inaccurate:
Even the best classifiers will not be 100% reliable on both metrics, but some policy categories are particularly challenging for many classifiers as they require a lot of context to accurately adjudicate. For some categories, over 99% of moderated content items at top platforms are picked up by automated means, including AI classifiers, but for other violation areas that typically require a lot more context about the relationships between users and their history of communications such as bullying and harassment, a far greater proportion of moderation actions are taken by humans. (It is also worth noting that the spam category, which should be reasonably susceptible to automated detection, has figures in such astronomical quantities that even a well-tuned battery of automated detection tools still requires human intervention for hundreds of millions of actions every year at the largest platforms.) For images, the distinction between showing male versus female nipples in breastfeeding versus pornagraphic contexts has been a long standing trial for both policy and detection teams.
A very large platform like Facebook has huge amounts of data that can be used to train classifiers: content posted on its own surfaces that do and do not violate its own policies. The models that are trained on this data will more accurately classify future Facebook posts than a general-purpose classifier, or one trained on thin data from a platform that is much smaller and has a shorter history.
Studies have shown that some classifiers are more likely to identify content written in African American Vernacular English (AAVE or “Black English”) as violative, and both activists and platforms have long recognized that attempts to make hate speech classifiers color-blind makes them miss the crucial element of asymmetry in racism. Additionally, machine learning systems always have a tendency to perpetuate any biases (not only ones following traditional prejudices) present in the training data, meaning that tendencies to be overly strict or lenient in past moderation decisions used to train classifiers will continue to be found and may be less easier to correct than giving feedback to human moderators.
While it may be hard to find human moderators who are proficient in all the languages in which platform users contribute content, and in their relevant cultural context, relying on multilingual foundation models to provide classifiers to moderate this content poses its own challenges. So-called “low-resource languages” have a limited body of text on which to train, often from very particular sources such as the Bible and Wikipedia, which makes them less adept at classifying more conversational text and unable to pick up on nuances in specific dialects. They may also suffer from a problem known as “translationese” which causes defects when machine-translated texts are included to fill out the training data for these languages. Multilingual models also have to trade off accuracy between their different languages, and benchmarking and correcting classifiers across all languages can be very difficult.
Users can be quick to adapt to enforcement approaches based on specific keywords by adopting what has become known as algospeak. If platforms only have access to basic pre-trained models, simple evasion efforts may impact the accuracy of their classification.
While large platforms have highly trained teams dedicated to developing and implementing AI classifiers, smaller platforms have to divert limited engineering resources to bring externally developed models online, whether they are off-the-shelf or custom built for the platform. In cases where the platform already has an infrastructure for small-scale human moderation, the switching cost may be significant, especially as some level of human moderation will almost certainly need to be retained. Those running primitive forums may not have any ability to introduce any non-native moderation tools at all. Additionally, as policies evolve and needs change, models will need to be tweaked or switched, which is a different proposition than simply issuing new guidelines to human moderators.
As media becomes more sophisticated, from text to audio to images, video, livestream audio/video and even VR content, or niche formats like 3D printing files, classifiers become harder to develop or to obtain, whether in-house, as open-source products or from vendors. This constitutes another barrier to adoption for those platforms including these media.
The EU Digital Services Act (DSA) requires many platforms (above 50 employees / 10 million EUR annual turnover) to include in their transparency reports the details of any classifiers used for content moderation purposes, including accuracy scores and numbers of decisions made by them. It also requires that the statement of reasons for content moderation actions provided to users and to the EU’s database note the involvement of automated means in detection or removal of the content. India’s Information Technology (Intermediary Guidelines and Digital Media Ethics Code) Rules also require some details of automated content moderation in regular transparency reports. The Online Safety Bill similarly requires “proactive technologies” to be detailed in risk assessments and transparency reports.
Some human oversight is already required for complaint handling under the DSA, and the Online Safety Bill includes automated decisions as a complaints category that must be appropriately handled. India’s regulations also require human oversight and review of automated content moderation systems. As the EU and other jurisdictions develop general AI regulations, platforms will have to keep aware of any obligations that may fall on them due to their use of AI classifiers for content moderation. (Canada appears to be considering including content moderation as a high impact category in forthcoming legislation.) In particular, users may have specific rights regarding appeal of any automated decisions to human adjudication, preventing the wholesale removal of humans from the moderation process.
Smaller-scale platforms and those that have largely autonomous communities, whether that is their entire platform (like Discord or Reddit) or they are included as a component of a larger platform (like Facebook Groups) may also rely more heavily on human moderation. While the scale of larger forums requires detailed internal policies for maximally consistent replicability, smaller platforms may appropriately decide to retain more discretion as moderation is performed by a small number of insiders who have a deep understanding of the platform values and community. All of these contexts are also more likely to include idiosyncratic elements in their community guidelines that are less suited to off-the-shelf solutions, and are not resourced well enough to develop custom classifiers.
Lastly, some have raised philosophical objections to leaving something as intrinsically human as content moderation to automated means. Tarleton Gillespie, a pioneer of the academic study of content moderation, raises this in a 2020 paper, writing,
“Calling something hate speech [for example] is not an act of classification, that is either accurate or mistaken. It is a social and performative assertion that something should be treated as hate speech, and by implication, about what hate speech is.”
However, he does not call for the complete exclusion of AI classifiers, but for these tools to “support” and not “supplant” human moderators.
Many of the weaknesses of AI for content moderation above have been described in a series of reports authored between 2017 and 2020. However, in the last few years, the relevant technologies have seen significant progress. The recent huge steps forward in large language models have facilitated the production of more easily fine-tuned and customizable classifiers for content moderation. Larger platforms have invested in improving their models in a variety of directions.
These years have also seen a proliferation of vendors (including Unitary) dedicated to developing more accurate content classification models and making them readily available to platforms of all types and sizes, sometimes even customizing the models for their clients’ context and platform. This includes significant strides in improving classifier performance with multiple languages, including low-resource languages. Social media sites relying on community moderation have rolled out more tools to help volunteer moderators, and third party tools have also emerged. With the speed of progress in this area, platforms will need to constantly re-evaluate their options to judge if they are making best use of the content classifier technology that is now available.
Even as AI tools for moderation and other aspects of trust and safety work improve and platforms increasingly adopt them, it is clear that humans will never be eliminated from the process of policy development and that human review will still play an important role, for practical and regulatory compliance purposes—and for values-led reasons too. It is easy to think of user-generated content as a low-stakes area when compared with the AI application categories that are typically defined as high-risk, like healthcare, hiring and criminal justice. However, many people depend on online platforms for their livelihood, and political speech and human connection are foundational to our societies. For this reason, those implementing any level of automated decision making into their processes should consider incorporating responsible AI principles, such as transparency, accountability and inclusivity into their ethical impact planning.