Personal tools

Multimodal AI and Unimodal AI

University of Pennsylvania_060221A
[University of Pennsylvania]
 
 

- Overview

The field of artificial intelligence has made tremendous progress over the past decade. While traditional AI models and techniques have primarily focused on data analysis, current technologies such as deep learning, machine learning, natural language processing, and generative AI take a broader approach to processing data.

To this end, developers and data scientists have come up with a variety of techniques to enhance large-scale data processing. One such technology is multimodal AI. This revolutionary technology integrates information from disparate sources to better understand the data at hand, allowing organizations to unlock new insights and support a wider range of applications.

To understand what multimodal AI is, you first need to understand the concept of modality. Modality, in its simplest form, refers to how something happens or is experienced. From this perspective, anything that involves multiple modalities can be described as multimodal.

Multimodal models are machine learning (ML) models that can process information from different modalities, including images, videos, and text. For example, Google's multimodal model Gemini can receive a photo of a plate of cookies and produce a written recipe in response, or vice versa.

 

- Multimodality

Multimodality is a relatively new term used to describe something extremely old: how people have made sense of the world since the dawn of humanity. People receive information from a variety of sources through their senses, including sight, sound, and touch. The human brain combines these different forms of data into a highly detailed picture of overall reality. Communication between people is multimodal.

They use words, sounds, emotions, expressions, and sometimes photos. These are just some of the obvious ways to share information. In view of this, it can be very safely assumed that future communications between humans and machines will also be multimodal.

We're not there yet. The greatest progress in this direction has occurred in the fledgling field of multimodal AI. The problem is not a lack of vision. While technology that can switch between modes is clearly valuable, its execution is much more complex than unimodal AI.

 

- Multimodal Models vs. Unimodal Models

Multimodal and unimodal models represent two different approaches to developing AI systems. Unimodal models focus on training a system to perform a single task using a single source of data, whereas multimodal models seek to integrate multiple sources of data to comprehensively analyze a given problem.

Here is a detailed comparison of the two approaches:

  • Scope of data: Unimodal AI systems are designed to process a single data type, such as images, text, or audio. In contrast, multimodal AI systems are designed to integrate multiple data sources, including images, text, audio, and video.
  • Complexity: Unimodal AI systems are generally less complex than multimodal AI systems since they only need to process one type of data. On the other hand, multimodal AI systems require a more complex architecture to integrate and analyze multiple data sources simultaneously.
  • Context: Since unimodal AI systems focus on processing a single type of data, they lack the context and supporting information that can be crucial in making accurate predictions. Multimodal AI systems integrate data from multiple sources and can provide more context and supporting information, leading to more accurate predictions.
  • Performance: While unimodal AI systems can perform well on tasks related to their specific domain, they may struggle when dealing with tasks requiring a broader context understanding. Multimodal AI systems integrate multiple data sources and can offer more comprehensive and nuanced analysis, leading to more accurate predictions.
  • Data requirements: Unimodal AI systems require large amounts of data to be trained effectively since they rely on a single type of data. In contrast, multimodal AI systems can be trained with smaller amounts of data, as they integrate data from multiple sources, resulting in a more robust and adaptable system.
  • Technical complexity: Multimodal AI systems require a more complex architecture to integrate and analyze multiple sources of data simultaneously. This added complexity requires more technical expertise and resources to develop and maintain than unimodal AI systems. 

 

- Multimodal: AI's New Frontier

In practice, generative AI tools use different strategies for different types of data when building large data models - the complex neural networks that organize vast amounts of information. 

For example, those that draw on textual sources segregate individual tokens, usually words. Each token is assigned an “embedding” or “vector”: a numerical matrix representing how and where the token is used compared to others. 

Collectively, the vector creates a mathematical representation of the token’s meaning. An image model, on the other hand, might use pixels as its tokens for embedding, and an audio one sound frequencies.

A multimodal AI model typically relies on several unimodal ones. This involves “almost stringing together” the various contributing models. Doing so involves various techniques to align the elements of each unimodal model, in a process called fusion. 

For example, the word “tree”, an image of an oak tree, and audio in the form of rustling leaves might be fused in this way. This allows the model to create a multifaceted description of reality.

 

[More to come ...]


Document Actions