Why do we need multimodal algorithms?
Algorithms have always been integral to IT evolution, but as the volume and complexity of data increases, so does the complexity of interpreting and managing it. This is particularly true in the era of big data operations which rely on drawing actionable insights from unstructured data sets.
Traditional algorithms are no longer capable of dealing with the contextual nature of data applications – and this is no more apparent than in the area of brand and platform safety. This is where multimodal algorithms can assist.
What are modalities?
Consider an average web page – it contains a body of text with images that help to simplify or reinforce the overall message of the page. It is likely that the image has a caption to help explain its purpose. And behind the scenes, descriptive tags attached to the image provide further context for screen readers and accessibility.
Each of these elements – the text, images, captions and meta descriptions – is a ‘modality'. Traditional single-mode algorithms are extremely capable at analysing and processing a specific element – but they cannot provide any context in relation to the other modalities.
What does this look like in practice? A modal algorithm designed to analyse and categorise text cannot be applied to assessing the contents of an image for instance. It may successfully identify harmful wording on a webpage, but it cannot do the same for a harmful image. Nor can it address the context of the image in relation to the accompanying text.
A multimodal algorithm takes a more nuanced approach, accepting multiple modalities for analysis. In this way, each mode can be considered in isolation and in context simultaneously.
Not all multimodal algorithms are equal
Like data analysis itself, multimodal algorithms are constantly evolving and improving. As a result, not all multimodal algorithms are equally effective or efficient.
Indeed, some multimodal algorithms actually operate like a collection of modal algorithms. Taking the example of a simple webpage again, these systems have an algorithm for text and another for images. The text on the page is analysed and assigned a weighted score that indicates likelihood of the content being considered acceptable. A second algorithm performs the same test on each image on the page, again generating a weighted score. Finally, a third algorithm assesses both weighted scores to provide an overall pass/fail assessment for the complete page.
Although these algorithms work, the disjointed nature of the approach means that the nuances of context cannot be properly assessed. The weighted scores can be tweaked and enhanced, but the risk of content being categorised incorrectly remains slightly higher. It is also likely that this type of algorithm will be slower, more resource intensive and therefore more costly to operate over time.
A more effective approach is to apply a truly multimodal algorithm that accepts any input and assesses it in relation to all the other modalities. This would see all text, images, and coded comments being analysed together to provide a more accurate overall understanding of each element and how they exist in context. With an understanding of context, the multimodal algorithm can make categorisation decisions with greater precision, reducing the risk of harmful or unwanted content ‘sneaking through’.
Multimodal algorithms will continue to increase in importance in line with the need to process vast amounts of data. But because not all multimodal algorithms are the same, businesses will need to seriously consider the underlying architecture and its potential implications for their brand protection strategies.
Read more about how multimodal algorithms can ensure brand safety.


