Cookie Preferences

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Close icon
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Applying Goodhart’s law to AI generated content detection: Misuse & Innovation

Deepfake detection often feels like a vicious cycle. As methods are developed to detect deepfakes, the AI systems generating content evolve and bad actors change tactics. And it's not long before the metrics and thresholds for detecting deepfakes become outdated. In this article we explore how Goodhart's law, 'when a measure becomes a target, it ceases to be a good measure', can help us see things under a different light. 

Ippolita Magrone

Table of contents

What can Goodhart's law teach us about deepfake detection?

Goodhart’s law was originally formulated in 1975 by British Economist Charles Goodhart as “any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” Today it is commonly referred to as:

 When a measure becomes a target, it ceases to be a good measure.

Applying Goodhart’s timeless adage to artificial intelligence (AI) misuse and innovation in the context of AI generated content (AGC) detection is valuable for several reasons. Oftentimes, it seems like we are stuck in a cat-and-mouse dynamic. As methods are developed to detect AGC, the AI systems generating content evolve and bad actors change tactics. Soon the metrics and thresholds for detecting AGC become outdated. 

A multi-pronged or multi-faceted approach, common in the cybersecurity industry, could help us break free from this arms race. What Goodhart’s law implies is that instead of solely relying on one measure, we must diversify our strategies of attack. Ideally, this would mean having multiple prongs (i.e. detection tools) simultaneously approaching the deepfake from different angles. This should be further fortified by a wider regulatory, policy and ethical regime that is supportive of the processes in place.   

How does Goodhart’s law apply to deepfake detection? 

For the sake of this discussion, let’s define ‘AI-generated content’ (AGC) as any media content (text, audio, image, video etc.) that is produced by an AI system with little to no substantial human intervention. The role of humans is typically to provide initial prompts, training data and guidance to the AI system, which then autonomously generates the content. This effectively means that the AI system takes an active ‘generative’ role. 

Deepfakes are a form of AGC created using deep learning techniques where neural networks are trained on large datasets of visual and audio content. This type of synthetic media mimics existing content often in a convincing fashion which prevents viewers or listeners from discerning deepfakes from authentic human-created content. 

Goodhart’s law - when a measure becomes a target, it ceases to be a good measure is quite relevant when considering the current state of deep fake detection

For example, inconsistencies in facial movements are often regarded as a measure for detecting deep fakes. Initially this measure may prove effective in distinguishing real from fake. However, once this measure becomes a target, bad actors can optimise their algorithms to specifically address and minimise those inconsistencies, effectively bypassing the detection method. 

In other words, the measure becoming a target means that the people trying to make the AI detection tool more robust against bad actors might erroneously only optimise for this measure as a proxy for detecting fake content. This means that by focusing only on this proxy they can lose sight of the bigger problem of detecting generated content, making it easier for bad actors to circumvent AI detection. 

Using Large Language Models to detect deep fakes?

Large Language Models (LLMs) are a class of natural language processing systems that are trained on huge datasets of natural language in order to learn nuanced patterns and statistical relationships between words and concepts. Commercially available examples include OpenAI’s ChatGPT, Anthropic’s Claude or Google’s BARD. 

We ran a little experiment to see whether LLMs like ChatGPT-4 and BLIP2 could classify “an image of Pope Francis wearing a large fashionable puffy coat” as real or AI-generated. Although the Pope’s deepfake went viral in April 2023, ChatGPT’s dataset only goes up to 2021, meaning that it lacks knowledge about the actual event. Therefore, all it uses is its learnt world model, through which it can predict what comes next in different scenarios. In this case, by having a sense of 'how the world works' and considering large volumes of past information, ChatGPT-4 is able to determine that “it is not common to see the Pope in casual or trendy fashion wear.” ChatGPT-4 makes this claim without even ‘seeing’ the picture, but just through contextual knowledge of past events and our description of the picture. 

Screenshot of conversation with ChatGPT-4 when asked "Is an image of Pope Francis wearing a large fashionable puffy coat likely to be real or generated?"

It is not surprising that BLIP 2, a tool that enables LLMs to understand images, is not as reliable in classifying the same image as a deepfake. Arguably, this is because it was trained using smaller and less powerful LLMs such as Flan T5 and OPT and consequently its world model is much smaller than that of ChatGPT-4.  Even if it allows for the image to be uploaded, its knowledge of how the world works is limited, which is why its answers are less reliable (first yes and then no). 

Although LLMs can be used to provide an analysis on a description of visual content, using them does present a few challenges. One of them being that LLMs don’t actually ‘understand’ visual content. Since they use textual inputs and outputs, all content needs to be described in text. This presents two challenges: a) it's an opportunity for biases and errors to creep in and influence the results and b) if the reasoning capabilities of LLMs become a target, they may no longer be a good measure. 

However, this doesn’t mean that LLMs should be discarded in the battle against deepfakes. They could be used, for example, as part of a multi-pronged approach, combining AI reasoning with other detection techniques. Overall, it is precisely this process of experimenting with and combining different strategies that leads to adversarial progress (i.e innovation) and a better understanding of AI technology as a whole. 

Could AI misuse actually lead to innovation?

Could AI misuse by bad actors actually lead to innovation? An intriguing question to say the least, and a controversial one at best. On the one hand, one could argue that it doesn’t because these systems are still being subverted with the same pressure each time. If you are stuck in an infinite loop, the outcome is survival, not innovation. 

What’s for certain is that AI misuse is not something we should hope for in order to innovate. Improving technological systems is good, but it shouldn’t happen at the expense of humans. Misguiding models with malicious inputs can have real consequences on people’s lives and disproportionately impact minorities. Critics have suggested that the “move fast break things” approach used in tech, does not quite apply here. We can’t release models irresponsibly and see what happens, worst of all we can’t release them irresponsibly and hope for innovation. 

On the other hand, the more optimistic view is based on the idea of adversarial progress. That is, as deepfake generators become more advanced, so do the detection algorithms. With each iteration, both systems learn from each other, contributing to the overall development of AI technologies as a whole (even outside the realm of AGC detection). 

There is no clear-cut answer and arguments exist on both sides. Ultimately, a nuanced view is that some adversarial stress testing can be productive to foster innovation, but too much focus on adversarial dynamics as the status quo can be counterproductive. 

Goodhart’s law: key takeaways? 

If there is one thing Goodhart’s law puts emphasis on is that systems that solely rely on narrow measures are more vulnerable than those which rely on a broader set of measures. Imagine a state building with a very rigid security system, equipped with both guards and technical infrastructure. The very human and adaptive nature of guards makes them to an extent unpredictable, whereas the technical security system (alarm, clocks, surveillance) etc. is more predictable and therefore prone to exploitation.

Ultimately,  a mix of both is more secure than sole reliance on one. When dealing with bad actors and detection tools, having multiple detection tools is essential. Ultimately, they are trying to fool you, which is why you need to try and fool them. One cannot over-rely on a single measure. Instead, continuously adapting and updating detection methods as the technology evolves is key. Of course, this process cannot exist in isolation, but should be supported by online policies, regulatory bodies and ethical frameworks that facilitate the undertaking.