Cookie Preferences

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Close icon
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

GenAI Watermarking: A Trust & Safety Primer

The implementation of watermarking for AI-generated content poses several challenges. This Deep Dive explores the complexities of watermarking for different media types, including images, audio, video, and text. And highlights the difficulties in creating robust watermarking techniques that can withstand various attacks.

Tim Bernard

Table of contents

Those concerned with misinformation and other forms of deception have been worrying for some time about generative AI (GenAI) blurring the lines between genuine and ersatz. And many systems promising to detect AI-generated content are just not very good (which is why OpenAI recently shut down their own tool that was meant to detect ChatGPT output). As the models improve, detection will continue to get harder and harder. Seven of the most prominent companies involved with generative AI recently committed to a number of responsible AI measures in coordination with the White House. Under the rubric of  “trust”, they included a commitment to:

Develop and deploy mechanisms that enable users to understand if audio or visual content is AI-generated, including robust provenance, watermarking, or both, for AI-generated audio or visual content.

Watermarking here is sure to refer to invisible watermarking, designed to be machine-detectable, rather than an embedded visual marker as commonly found on stock photos, which either obscures the image or can be removed very easily. That is, these companies are committing to embed their GenAI outputs with an identifier that, when processed by a detection system (potentially before a platform user ever encounters it), will reveal that the content was produced by an AI system, rather than by a human.

A commitment like this is unsurprising: firms like these have been open about the risks of AI, and eager to show themselves to be good actors, attending meetings with lawmakers and administrators—perhaps also hoping to fend off regulation that could stunt their own business. Watermarking is now required for generative AI content in China, and some US lawmakers have also been proposing such a regulation, though it is unclear if that would be compatible with the American free speech legal regime. 

Despite this growing enthusiasm for watermarking GenAI content, and a flurry of related research over the last few years, the details of how this will be implemented are unclear. Experts such as Sam Gregory from the non-profit WITNESS and Dr. Florian Kerschbaum, professor of computer science and a member of the Cybersecurity and Privacy Institute at the University of Waterloo, have expressed skepticism at whether these systems will be as robust as some expect.

What Does this Mean for Trust & Safety?

Discussion of detecting or watermarking GenAI content usually takes the perspective of an individual looking at a particular item of content: how can this person tell if it was created by a model? For images, audio, and video, this is typically a journalist verifying a source or a layperson trying to understand if a piece of media truly represents something that has happened. For text, we most often think about educators evaluating if a class submission was truly written by the student, a use-case called out in the release note for OpenAI’s now-deprecated detector.

For Trust & Safety purposes, the problem is somewhat different. Some specific-purpose websites (perhaps art marketplaces or review sites) or surfaces (perhaps profile photos) may choose to ban GenAI content outright, but many other platforms would seek to clearly label it so that other users can then easily understand it for what it is without being misled. (The EU Commission has begun calling on platforms to do just this.) There may also be contexts where this content should be algorithmically demoted in a feed or recommendation section as it is more likely to be low-quality; in other cases it may be flagged for further investigation as evidence of spam or inauthentic behavior. Whatever the purpose, the classification of the content will have to be done via automated means and at scale. 

Watermarking by Media Type

Watermarking technology is entirely dependent on the type of media in question, and so each category must be considered separately.


Images are perhaps the locus of most attention for watermarking, and are a key issue for GenAI misinformation concerns, especially as systems like Midjourney can already produce reasonably convincing photorealistic images. In the T&S world, AI-generated profile images have been a known issue for some time. A study published in 2022 reviews watermarking techniques that have been proposed over the last several years and assesses how robust they are in the face of a range of possible attacks, concluding that there are several promising schemes, while some challenges remain.

More recently, and specifically focussing on the application of watermarks to GenAI images, a preprint from researchers at Duke suggests that current watermarking technology is insufficient. The paper divides watermarking schemes into two categories: the more traditional kind, where an algorithm is applied to add the watermark information by altering certain image pixels. This kind of watermarking is already in use by Stable Diffusion. The second type is more sophisticated, where the encoding and decoding are themselves performed by machine learning systems. These can benefit from adversarial training against common post-processing methods used in attacks (as may be familiar from hash-matching evasion). 

However, the authors were generally successful in obscuring the watermarks in both categories without significant visible damage to the images. They developed an improved version of established attack algorithms to hone in on a successful post-processing treatment by checking repeated iterations against the decoder—in real life this would mean querying an API or submitting the image to a platform and seeing if it was rejected. This confirms the comments of the experts cited earlier that there are known attacks that can overcome current watermarking techniques.


Audio watermarking is also well-established, and used primarily in digital rights management contexts, though another interesting current use-case is Zoom’s option to embed an audio watermark of the attendee’s email address in their audio. Watermarking for audio often uses a similar technique to traditional image watermarking, known as spread-spectrum watermarking, where the data is encoded throughout the normal audible range, with efforts to keep it imperceptible to human hearing. Echo modulation, where the placement of unnoticeable echos encodes a watermark, is an alternate approach.

Recent efforts have been made to increase the robustness of audio watermarking to protect them from attack, including, as with images, machine learning-based systems. At least one GenAI audio producer, Resemble AI, has announced the rollout of such a system to mark content that they generate, though it remains to be seen how the watermark will be decoded to enable its identification by platforms or individuals.


Video straddles the history of well-known GenAI: “deepfake” video emerged into public consciousness back with the establishment of the eponymous Reddit channel in 2017, but realistic fully-generated content is still hard to access for most people. As with images and audio, there has been considerable research into this area, even some time ago, with the film industry famously concerned with piracy. A notable difference from images and video, is the increased importance of techniques that are not too computationally expensive to apply or to decode.

A further complication is that video is often a container for still images and audio, some of which may well be AI-generated, as in the recent US political ad where Donald Trump’s “voice” was played reading one of his social media text posts. This makes tracing any possible watermarks on the original content all the more difficult.


Traditional filters have a harder time finding LLM-generated spam as it uses far more variable wording, and a new study from the Observatory on Social Media at Indiana University exposed and examined a Twitter botnet (associated with crypto-related scamming) that seemingly made use of ChatGPT to “supercharge” their operation. One of the challenges of plain text is that, unlike images, video and audio, there is much less opportunity to slip identifying data into the encoding.

A method of watermarking LLM outputs has been developed by University of Maryland researchers, where certain words and letter combinations are used with a different frequency (the New York Times has published a well-illustrated explanation of the approach). OpenAI is working on a watermarking implementation along similar lines. This system is far more probabilistic than the digital watermarking techniques for other media, and some, such as Berekly’s Hany Farid, believe it will not have much utility for short comments or social media posts, which are perhaps the most common surfaces for trust and safety concerns (The Times quotes one of the Maryland paper’s co-authors claiming that a typical tweet is long enough).

As noted (in the Times article) by David Cox from the MIT-IBM Watson A.I. Lab, the prospects of a “magic tool” that can reliably identify LLM output are poor. Demonstrating the intrinsic difficulty of detection of mere words, a second team from the University of Maryland has shown that this sort of watermarking is vulnerable to attack by putting the output text through five recursive rounds of a paraphrasing model. 

Platforms that host “single” items of content consisting of combinations of media types (e.g.  TikTok videos with separate sound, video, and text elements), will have to develop approaches for internal detection and external labelling that account for this complexity.

Provenance Standards

Some of these approaches to watermarking media will require a secret key or a proprietary model to decode. This may be a cause for trepidation on the part of platform teams, who will need to run each item in scope against the APIs of every producer of GenAI content to check if it contains any of their watermarks. And these APIs will need to process the massive quantities of media that are uploaded to large platforms every day in a timely manner without being overwhelmed. Thankfully, there are also emerging standards for digital media provenance, which presents an alternate approach to watermarking, including: 

IPTC Metadata

The IPTC is a body that establishes technical standards for news media. Their photo metadata standards (relying on the XMP technology for embedding metadata into image files) are used well beyond the news industry, by many photographers and others who work with images. This metadata can be integrated into platforms, as can be seen in Google image search results for licensable images. The IPTC has recently introduced terms and guidance for describing a range of images that were wholly or partially generated by AI under its Digital Source Type category.  Google and Midjourney have both committed to include this metadata in GenAI outputs.


XMP metadata is freely editable using widely available software; anyone can write or edit it to say anything. That is where the C2PA standard, created by the Coalition for Content Provenance and Authenticity comes into play. C2PA is a means to cryptographically bind a metadata manifest to a piece of media (image, audio, video) to establish who created it and any additional information, including, as of now, if it was AI-generated. Further editing or processing of the media is then recorded as an additional layer of the manifest. The C2PA specifications now contain guidance for adding manifests to GenAI output, specifying the source and type of output and recommending other information such as the timestamp and prompt.

This technology was designed to affirmatively prove the authenticity of, for example, a photo taken by a photojournalist at a particular time, and show that it hasn’t been significantly edited. Even the GenAI-specific metadata essentially just proves which system generated the media and how it classifies it. Disproving that media is authentic is a somewhat different use-case: it seems that current implementations of C2PA allow for the manifests to be stripped from the media, and simply taking a screenshot will also remove the manifest. Nevertheless, Microsoft reportedly intends to use C2PA to add metadata to the media produced by their GenAI systems.

A Long Road Ahead

For examples like the Pope and Donald Trump’s arrest images from Midjourney linked earlier, which were not intended to deceive, it may be sufficient for mainstream GenAI models to append IPTC standard metadata (with or without C2PA) and for platforms to build features that read it and label the images accordingly. And this may satisfy many regulators. 

However, much of Trust and Safety work involves dealing with determined bad actors, who may be using their own implementations of open-source models. Even if the developers of these models embed watermarking into the generation process (unlike Stable Diffusion, for which one can apparently remove the watermarking code with ease), and platforms and GenAI developers can create a workable system to check media for all watermarks at scale (perhaps through a centralized trusted system), there are, as we have seen, measures that can be used to avoid detection. One further concern is the potential for watermarks or provenance statements to be faked, which could be used to undermine authentication writ large and sow distrust. 

There is clearly a lot of energy and hard work going into this endeavor, from GenAI companies, government bodies, researchers, and standards bodies. Several of the examples in this article have only been at initial research stages, both on watermarking and watermark evasion, though we can surmise that the development and application of these methods at scale are likely to proceed more or less in step with each other. It is therefore clear that there is still a long way to go and platforms will still have much to do if they are to deal effectively with the challenges posed by identifying GenAI media.

After this article was finalized, Google announced the beta release of SynthID, a new GenAI invisible watermarking tool for their Imagen image-generating product (though they hope to expand its use in the future). This is a model-generated watermark that Google claims is robust against common manipulations, though they admit that it is not impossible to remove the watermark by more extreme transformations. This aligns with the strengths and vulnerabilities described above for the best available technologies.