Unlocking the Potential of Multimodal AI: Advancements, Applications, and Future Directions

September 9, 2024

Executive Summary

Artificial Intelligence (AI) has made significant strides with the introduction of Large Language Models (LLMs), leading to the development of multimodal AI. This advanced technology integrates various input types—text, speech, images, and video—to enhance interaction and deliver more accurate and contextually relevant outputs.

Multimodal AI can now process and respond to multiple forms of data simultaneously, improving applications in customer support, multimedia content creation, security, and healthcare. However, it faces challenges such as the need for extensive high-quality data and concerns about privacy and data misuse.

Looking ahead, multimodal AI is expected to drive further innovations in autonomous systems, user interaction, and robotics. Organizations are encouraged to explore these advancements to improve efficiency and stay competitive.

Introduction

With the inherent human need to streamline processes and become more productive, comes the emergence of artificial intelligence (AI) as the most modern and technologically advanced tool that helps people accomplish tasks more efficiently than ever before. Users can now interact with these tools in the same way we interact with each other, with natural language. Having said this, one might wonder; how exactly does artificial intelligence understand human language? Well, an AI assistant understands natural language by implementing what is known as a Large Language Model (LLMs).

A Large Language Model (LLM) is a type of artificial intelligence that has been trained with an extraordinary amount of text to identify the relationships between words and sentences. LLMs use a structure called a Neural Network that is inspired by the learning mechanisms of neurons in the human brains. Using this structure, an LLM can identify meaningful semantic relationships in sentences instead of memorizing and storing millions of examples of written text. LLMs represent a milestone in the field of AI because AI tools/assistants can now comprehend vast amounts of data and learn from them, thus addressing specific user demands more robustly. However, the evolution towards a more versatile and comprehensive AI has demanded the integration of non-traditional inputs, such as images, calls, and videos. Hence, multimodal generative artificial intelligence is created to interpret and generate coherent responses from different types of inputs. This analysis will explore its characteristics, applications, challenges, and prospects.

 

How does multimodal AI Work?

As mentioned, multimodal AI accepts multiple types of input modalities. These input methods go beyond written language and consider other forms of communication much like the way ordinary people communicate. This is, precisely, the main advantage of multimodal AI as it allows the user to describe their objective more comprehensively. After all, some forms of communication are better suited for certain tasks. Instead of describing an object with words one can simply show the picture of the object itself for a more complete, and usually faster, description. Before going too deep into the various combinations of inputs, let’s first go over some primary inputs independently.

  • Text – many standard AI tools use a text-based approach in which the user can present a text document or an on-demand text input to interact with the model. These models use Natural Language Processing (NLP) to make sense of the request and produce an output. Text input can include different languages as well as code input.
  • Speech input involves converting spoken words into text by using Automatic Speech Recognition (ASR), after which it is like Text inputs. On the other hand, more sophisticated models can incorporate different methods to distinguish differences in tones and identify specific users.
  • Images are a input type, especially for object recognition and image classification.
  • Video inputs are like image inputs in the sense that a video is simply a string of changing images in a sequence. To understand video input, AI Models must be able to identify potentially moving objects and recognize their actions and activities.

Exhibit 1:
Multimodal generative AI can interpret and generate coherent responses from different types of inputs.
Differences between unimodal and multimodal generative AI

Multimodal AI

This list is not exhaustive but summarizes how multimodal AI can better address different use cases and applications, specifically when considering the combinations of input and output methods.

One such application of a combination of inputs is the added functionality of file attachments in ChatGPT 4o. Attaching a file, such as an image, and supplementing it with a text prompt allows users to place the image in a particular context and ask specific questions about it. This combination of supplementing files with text prompts benefits entire sectors like healthcare as discussed in more detail below. Other multimodal AI services accept images/text as inputs and generate other images as outputs. One such example is OpenArt, which allows you to create images from simple sketches or even create images from simple text prompts.It also allows you to generate text/audio for captions and subtitles for your videos. In this example, we use Synthesia to create a video. Here, we simply ask it to generate what we want through a text input, and it returns an avatar saying what we wanted to express in the language in which we wrote it.

Exhibit 2:
Users can now create realistic videos from just a text input.
Video created just by simply inputting the following prompt: “Hi there I’m Alex and I’m an avatar created entirely by artificial intelligence. This is the power of the multimodal generative AI.”

 

Multimodal Applications

With an understanding of how multimodal generative AI functions and the distinctions between multimodal and unimodal AI, it is time to review some current applications of multimodal AI.

Customer Assistance and Support Systems: Most companies have benefited from artificial intelligence recently. Still, those offering operational services such as customer service have a robust and well-developed tool to assist their consumers. The increasingly sophisticated and multimodal advancement of chatbots and virtual assistants allows users to interact more easily with the systems and resources of an institution. An example of how this technology is used in customer service is the Amazon search box, which allows users to search for objects that they like by using pictures. For example, if a customer sees a lamp she likes at a restaurant, she can take a picture and use it as the search prompt in the Amazon search box, and Amazon will find the lamp and similar lamps in the database.

Exhibit 3:
Users can use images to search indexed databases instead of having to explain what they want.
Amazon example of AI search, where the customer inputs a picture of a lamp, and Amazon finds the lamp and similar lamps in the store

 

 

Image and Video Recognition and Processing: Today, multimodal AI is widely used for generating images and videos, as it not only allows the creation of these contents through simple commands but also facilitates the customization of how these changes are made. For example, a marketing company that needs to update a company’s logo can use a multimodal program. Through commands, the user can ask AI to generate various ideas to renew the logo based on the context provided by the text. This type of application is used in multiple industries, such as:

  • Marketing: AI can enhance marketing by personalizing customer experiences, optimizing ad campaigns, and providing deep insights through predictive analytics. It also automates tasks such as content creation and customer service, improving efficiency and effectiveness. Companies like Nike already use multimodal AI to generate assisted advertising.  For example, they have an ad campaign called “Nike: Never done Evolving feat Serena. [1],” where they created an AI generated match between Serena’s younger self, specifically her first Grand Slam in 1999 and a more modern version of herself from the 2017 Australian Open.
  • Multimedia: AI can revolutionize multimedia by automating content creation, enhancing image and video quality, and enabling advanced editing techniques. It also powers personalized recommendations and improves accessibility through features like automatic captioning and translation. An example of this is DALL-E, which uses a text input provided by the user. The model interprets, through natural language, what the user wants and generates an image based on the context of what is expressed.

Exhibit 4:
Companies can use AI to create first drafts for their clients and save time before committing to final creations.
A home designer uses DALL-E to create a design in a setting based on customer’s interests. The client has received the initial design, but then requested a change in colors, which the designer used in the second prompt.

  • Security: Creating surveillance and prevention systems assisted by image and video recognition to detect faces or threats. Advanced surveillance systems use multimodal AI to analyze video feeds combined with audio data. A real example of this is the company Suspicious Behavior and Activity Detection | AI Solution From Angelcam [2], which uses various methods such as facial recognition, license plate identification, and the detection of falls and accidents through cameras to identify risks and generate alerts, all powered by artificial intelligence.
  • Translation and Transcription Systems: By implementing inputs such as videos, images, documents, and voice and combining these elements, it is possible to perform faster actions to translate them, as well as to analyze and summarize the input context. This type of application is used in industries like education for video and document transcription and media for content translation. At V2A, we developed an innovative tool to transcribe songs and generate their lyrics.

Exhibit 5:
The V2A song transcription tool receives a WAV audio file input and outputs the text lyrics.

 

Furthermore, in the healthcare industry, companies like Nuance (DAX), Abridge, and Suki employ multimodal AI to capture doctor-patient conversations in real time. Tools like these allow for hands-free operation, eliminating the need for doctors to divide their time between taking notes and summarizing patient information.

Real-Time Monitoring and Diagnostics (4): Implementing image recognition and document intelligence can greatly benefit industries like healthcare, finance, and security.

  • Healthcare: Here is an example of how AI is improving cancer diagnosis: Northwestern Medicine’s Feinberg School[5] of Medicine has developed an AI model that enhances breast cancer detection.
  • Finance: Using multimodal tools to analyze large amounts of historical data and economic news and performing market analyses is an excellent help for financial entities. Large banks use these tools to carry out actions that typically require extensive processing time. For example, JP Morgan developed a multimodal AI tool called IndexGPT[6] to read large amounts of
  • Security: Multimodality allows risk management companies to handle advanced threat detection systems almost autonomously. An example could be an intrusion detection system for network systems that can even carry out automated responses to risk exploitation.

Risks and challenges in multimodal AI:

To get to where it stands today, this emerging AI industry has overcome major challenges in risk management. These risks are mostly a function of the growing complexity and effectiveness of its use cases, that continues to accelerate. In other words, the more advanced a model gets, the more challenges it must overcome. This increase in challenges is especially true for multimodal AI because it has a significantly higher amount of use cases than unimodal models. Leaving aside the risks and challenges of text-only generative AI, the following are new risks and challenges that multimodal AI faces.

  • Large amount of quality data: The most crucial phase in deploying any AI model is the training phase. In this phase, the AI model takes in training data and learns by establishing relationships between the input and output portions. Having learned these “relationships,” the model can create its outputs from just a set of inputs. This is an important detail because the accuracy of a model is related to the quality and quantity of data used to train it. Below are a few examples of how these challenges can manifest:
    • Higher Data Requirements: Multimodal AI systems require large and diverse datasets that cover all the modalities they are designed to handle. Collecting and labeling such large datasets can be expensive and time-consuming.
    • Data Fusion Complexity: Effectively integrating multimodal data is challenging due to varying noise levels, temporal misalignment, and the diverse nature of the data. Each modality may require different preprocessing and feature extraction techniques.
    • Data Availability and Skew: Obtaining sufficient high-quality data across multiple modalities can be difficult. Moreover, certain modalities might be overrepresented, leading to biased models.
  • Risk Management and AI oversight: Like all recent technologies, it is essential to create risk mitigation and oversight plans to ensure proper use. This is especially true when it comes to Artificial Intelligence. AI in some industries poses legal and moral implications that must be dealt with accordingly. Below are some challenges and risks that can arise from improper training, implementation, or oversight.
    • Privacy violations: Multimodal AI can create privacy concerns from two distinct sources. The first and the most evident is the use of sensitive training data. There must be adequate regulations and strict oversight so that AI models are not trained with classified information. Models trained using classified information must comply with strict rules to ensure proper use and implementation. The second and not-so-intuitive concern is that highly sophisticated AI models can identify patterns and create outputs that can be considered a breach of privacy even if trained with non-sensitive data. Multimodality exaggerates this risk because increasing the input modes allows the AI model to identify “hidden” relationships that unimodality would not have covered.
      “… information being collected and used may extend beyond what was originally knowingly disclosed by an individual. Part of the promise of predictive technologies is that deductions can be made from other (seemingly unrelated and innocuous) pieces of data.”
      One example of such an instance can come from an AI system created to analyze college applications more efficiently. A sufficiently sophisticated AI assistant could infer a candidate’s political or ideological tendency from the information supplied. Whether or not it affects the admissions process, it does represent a breach of privacy as that specific information was not intentionally disclosed (7).
    • Deepfakes:A popular yet extremely unsettling risk of adopting multimodal AI is deepfakes and fully AI-generated images and video. Deepfakes are ultra-realistic synthetic media generated by artificial intelligence, used primarily to manipulate original content. This content typically involves real people and modifying or creating content to make it appear like they said or did something that they did not.  The risk is that ill-intentioned people can create deepfake videos to influence public opinion or to scam people into sending money. For example, see the following deepfake video where the Primer Minister of Canada is depicted reading a book about how he stole freedom. Episode 13 : How The Prime Minister Stole Freedom, presented by Justin Trudeau. Although not exhaustive, this list of risks and challenges shows that technological advancement is not a smooth route to success. In fact, some of these risks will only increase in difficulty as AI advances. It is important to remember that appropriate oversight and regulation can help maximize the benefits of artificial intelligence while minimizing its risks.

Next steps for multimodal AI

Although generative multimodal AI already has extensive use within some companies, it remains a costly and customized technology. As multimodal AI continues to develop and organizations rely more on its benefits, particularly on the daily activities, we expect multimodal AI to be particularly focused on the following areas and use cases:

  • Autonomous healthcare systems: Although multimodal AI is already used to improve processes, the trend is to develop an autonomous healthcare system that, through images, analysis, and clinical notes, offers an accurate and rapid diagnosis of a patient’s symptoms.
  • User interaction: Multimodal AI will help to understand and handle natural language in a more friendly and accurate way. One implementation is for AI to recognize and create visual scenarios through voice commands.
  • Robotics: In the manufacturing world, the goal is for machines to become increasingly automated. AI will enable these machines to perform their tasks better by combining expert modalities and integrating them into an autonomous system with greater precision and efficiency. Additionally, with the proliferation of sensors and multimodal AI, it will be possible to modernize component failure prevention methods, thereby avoiding damage to production machines.
  • Intelligent assistance: Virtual assistants can offer greater assistance by combining all modalities through cameras and sensors that allow them to understand less natural languages such as sign language, drawings, and videos.

Conclusion:

As the next significant advancement in artificial intelligence tools, it is important that firms begin evaluating how they can leverage artificial intelligence to streamline processes, increase operational efficiency, and gain a competitive advantage. From simple machine learning algorithms to multimodal applications, V2A can assist in integrating AI tools in business workstreams and help impulse our clients to the next steps of their digital transformation.

Learn more about our AI services in https://v2aconsulting.com/artificial-intelligence/ or email us for a personal consultation al info@v2aconsulting.com.


References:

  1. Luna, J. C. (2024, February 22). What is Multimodal AI? https://www.datacamp.com/blog/what-is-multimodal-ai?utm_source=google&utm_medium=paid_search&utm_campaignid=21057859163&utm_adgroupid=157296745137&utm_device=c&utm_keyword=&utm_matchtype=&utm_network=g&utm_adpostion=&utm_creative=705187007036&utm_targetid=aud-517318241987:dsa-2222697810678&utm_loc_interest_ms=&utm_loc_physical_ms=9197589&utm_content=DSA~blog~Artificial-Intelligence&utm_campaign=230119_1-sea~dsa~tofu_2-b2c_3-es-lang-en_4-prc_5-na_6-na_7-le_8-pdsh-go_9-nb-e_10-na_11-na-july24&gad_source=1&gclid=CjwKCAjwzIK1BhAuEiwAHQmU3jAQg7rajVU0dtzDyvXjtklEtFtFGlMRhJk0sn7X_FHq490yq_zfVxoCKHAQAvD_BwE
  2. Suspicious Behavior and Activity Detection | AI Solution from Angelcam. (n.d.). https://www.angelcam.com/ai-solution/suspicious-behaviour-and-activity-detection
  3. (2024, May 6). Multimodal AI: Working, Benefits & Use Cases. Apptunix Blog. https://www.apptunix.com/blog/multimodal-ai-working-benefits-use-cases/
  4. (2020, January 2). AI model improves breast cancer detection. News Center. https://news.feinberg.northwestern.edu/2020/01/02/ai-model-improves-breast-cancer-detection/
  5. Social, M. (2024, January 8). DocLLM: JPMorgan’s new AI for visually rich multimodal document intelligence. https://medium.com/aimonks/docllm-jpmorgans-new-ai-for-visually-rich-multimodal-document-intelligence-f981b8baebc2
  6. Office of the Victorian Information Commissioner (2018). Artificial Intelligence and Privacy – Issues and Challenges. [online] Office of the Victorian Information Commissioner. Available at: https://ovic.vic.gov.au/privacy/resources-for-organisations/artificial-intelligence-and-privacy-issues-and-challenges/.
  7. Chen, H., & Magramo, K. (2024, February 4). Finance worker pays out $25 million after video call with deepfake ‘chief financial officer.’CNN World. https://edition.cnn.com/2024/02/04/asia/deepfake-cfo-scam-hong-kong-intl-hnk/index.html
  8. (2022, September 7). Episode 13 : How The Prime Minister Stole Freedom, presented by Justin Trudeau [Video]. YouTube. https://www.youtube.com/watch?v=xVVWK4oK93I

Subscribe
to our newsletter

Scroll to Top