All Articles
    Technical
    Published June 15, 202611 min read

    Multimodal AI Models: How AI Search Reads Images, Voice and Text

    Multimodal AI models read text, images, audio and video in one pass, and that now powers visual and voice search. Here is how it works, which models lead in 2026, and what brands should do about image quality, alt text, video and structured data.

    Matiss Katanenko

    Matiss Katanenko

    Co-founder, Honeyb

    Google Lens now handles nearly 20 billion visual searches every month, and Google calls Lens queries one of the fastest growing query types on Search (Google). That single figure tells you the input has changed. For most of the search era the input was a line of typed text and the output was a list of blue links. The models behind modern AI search no longer read only words. They look at photographs, listen to spoken questions, parse the layout of a screenshot, and watch video frames. This shift is called multimodality, and it changes both how people search and how brands need to present themselves to be found.

    This guide explains what multimodal AI models are in plain terms, how they power the visual and voice features inside AI search, which models lead in mid 2026, and what the change means for the images, audio, video and structured data on your site.

    What multimodal actually means

    A modality is simply a type of input or output. Text is one modality. An image is another. Audio and video are two more. A model is multimodal when it can take more than one of these as input, reason across them together, and often produce more than one as output.

    The important word is *together*. An older system might have run an image through one tool to generate a caption, then fed that caption as text into a separate language model. The model never saw the image. A natively multimodal model is different: it is trained from the start to represent text, pixels, audio waveforms and video frames in a shared internal space, so it can reason about the relationship between them in a single pass. Google says its Gemini models are built to be multimodal from the ground up, which lets them handle image captioning, classification and visual question answering without training a separate specialised model.

    In practice that lets a model answer questions no single modality could resolve alone. Show it a photo of a bookshelf and ask which titles are by the same author, and it has to recognise the books, read the spines, and connect that to what it knows about those authors. That is three modalities working as one. The same shared representation underpins newer multimodal embedding models, which map text, images, video, audio and PDFs into a single space so they can be searched together.

    The four modalities, briefly

    Most current models work across four input types, though support and quality vary by engine.

    • Text is the baseline: typed questions, documents, code and structured data.
    • Image covers photos, screenshots, charts, diagrams, product shots and scanned pages. This is the most widely supported non-text modality.
    • Audio covers spoken questions and, increasingly, real-time voice conversation rather than a single recorded clip.
    • Video covers uploaded or streamed footage, analysed frame by frame with the accompanying audio.

    Output modalities matter too. Some models can speak back, generate images, or produce both text and audio in one response. For brand visibility the input side matters most, because that is how an engine takes in the question and the evidence it uses to answer.

    How multimodality powers AI search

    Multimodality is not a lab curiosity. It is already the mechanism behind three everyday behaviours in AI search.

    Visual search. Google has connected Lens with Gemini inside AI Mode so a user can, in Google's words, "snap a photo or upload an image, ask a question about it and get a rich, comprehensive response with links to dive deeper". The model understands "the entire scene in an image, including the context of how objects relate to one another and their unique materials, colors, shapes and arrangements", rather than matching a single object (Google blog). The scale is no longer marginal: Lens drives those 20 billion monthly visual searches, and adoption skews young, with Google noting users aged 18 to 24 engage most with it.

    Screenshot and document understanding. People paste screenshots of error messages, spreadsheets, menus and pricing tables and ask the model to explain or act on them. Anthropic's documentation notes that multiple images, up to 100 per API request on models with a 200k-token context window, can be analysed jointly when Claude forms a response, which is well suited to comparing tables and charts side by side (Anthropic vision docs). This is why the legibility of the charts and tables you publish now affects whether a model can quote them correctly.

    Voice queries. Spoken search has moved from a single recorded clip to live conversation. The newest realtime models process and generate audio directly through one model rather than chaining speech-to-text, a language model and text-to-speech, which cuts latency and preserves tone. OpenAI's gpt-realtime-2, released in May 2026, brought GPT-5-class reasoning to live spoken conversation, so the model can pause to think before it answers. Spoken questions tend to be longer and more conversational than typed ones, which changes the phrasing your content should answer.

    Google Gemini, the multimodal model behind AI Mode visual search
    Gemini powers the visual understanding inside Google AI Mode.

    The query fan-out that sits underneath

    One detail is worth understanding because it shapes how much of your content gets seen. When a user submits an image or a complex question, Google's AI Mode does not run a single lookup. It uses a technique Google calls query fan-out, issuing "multiple queries about the image as a whole and the objects within the image, accessing more breadth and depth of information than a traditional search" (Google blog).

    The practical consequence is that a single visual query can touch many more sources than a traditional search. Breadth of coverage across the web matters more, and a brand described consistently across many pages is more likely to surface than one mentioned once. Our overview of how AI search works goes deeper into the retrieval and synthesis steps behind this, and our primer on generative engine optimization explains how to earn that breadth of coverage.

    Want to see this in action?

    Check how AI models talk about your brand — free, instant, no signup required.

    Free AI Check

    The leading multimodal models in 2026

    The major engines all support text and image input. They diverge most on audio and video, and on resolution and context length. Capabilities change frequently, so treat the table below as a mid-2026 snapshot and verify against each provider's own documentation before you rely on a specific figure.

    EngineTextImageAudioVideoNotes
    Google GeminiYesYesYesYesNative multimodal across all four inputs on the Gemini 3 family; powers AI Mode visual search and a Gemini Live API for real-time voice and vision
    OpenAI GPTYesYesYesPartialImage input across current GPT-5 models; a separate gpt-realtime line for speech-to-speech and streaming audio
    Anthropic ClaudeYesYesNoNoStrong image and document reasoning, up to 100 images per API request; no live audio or video input at present
    PerplexityYesYesVariesVariesSearch-focused; multimodal features depend on the underlying model selected for a given query
    Microsoft CopilotYesYesYesPartialVision and voice in consumer apps, built on partner foundation models

    A few points are worth drawing out. Google's Gemini is the broadest on raw input modalities and the engine most tightly integrated with visual search; its Gemini Live API is generally available on Vertex AI and streams audio and video in and out for real-time agents. OpenAI has invested heavily in live voice, with a dedicated gpt-realtime line built for speech-to-speech. Anthropic has focused its multimodal effort on still images and documents rather than live audio or video; its recent models, starting with Opus 4.7, raised the maximum image resolution to 2,576 pixels on the long edge, up from 1,568 pixels on prior models, which helps with screenshot and document analysis (Anthropic vision docs). For a side-by-side of how the engines actually rank brands, see our Perplexity vs ChatGPT brand ranking comparison.

    Google AI Mode brings visual search into the AI answer experience
    AI Mode combines Lens visual identification with Gemini's reasoning.

    Native versus bolted-on multimodality

    Not all multimodal support is equal. The distinction that matters is whether vision and audio were part of the original training, or added afterwards through a separate pipeline. Models built natively multimodal tend to reason more reliably across modalities because they were never asked to translate an image into a text caption before thinking about it. The same logic applies to voice: a single model that handles audio end to end, like OpenAI's realtime line or Gemini's Live API, holds onto tone and timing that a transcribe-then-read pipeline throws away. When you read provider documentation, look for language about a model being multimodal "from the ground up" or processing inputs in a "unified" representation. That phrasing is a signal of how deeply the modalities are integrated, not a marketing flourish.

    What multimodality means for brands

    If models can now see and hear, then the parts of your site you used to treat as decoration have become readable content. Five areas deserve attention.

    Image quality and clarity. A model that interprets a scene can only work with what is legible. Blurry product shots, low-contrast charts and screenshots with tiny text reduce what an engine can extract. Publish images at a resolution and clarity that survives downscaling, and make sure any text inside an image, such as a pricing table rendered as a graphic, is also available as real text on the page.

    Alt text and surrounding context. Alt text remains the most direct way to tell an engine what an image shows. Write it to describe what is actually in the frame, specifically and accurately, rather than stuffing keywords. The caption, heading and paragraph around an image give the model context for how it relates to the rest of the page, so treat the whole neighbourhood as part of the description.

    Video and audio content. Multimodal models can watch and listen, but they index faster and more accurately when you hand them text to anchor on. Provide transcripts for video and audio, use descriptive titles and chapter markers, and keep the spoken content consistent with the written page. Inconsistency between what your video says and what your text says weakens the signal an engine uses to judge authority.

    Structured data. Schema markup tells an engine explicitly what a media asset is, rather than leaving it to infer. Google's documentation supports `ImageObject` for image metadata and `VideoObject` to influence the description, thumbnail, upload date and duration shown in results, and uses JSON-LD as its primary example format (Google Search Central). Connect those media objects to the parent `Article` or `Product` schema so the relationships are explicit. Our guide to schema markup for AI visibility covers the implementation in detail.

    Consistency across formats. Because a visual query can fan out across many sources, the same product or claim described the same way in your text, your images, your alt text and your video transcript builds a coherent picture. Contradictions force a model to choose, and you may not like the choice it makes. The same principle that governs how AI models choose which brands to recommend applies across modalities, not just text.

    How to check what engines see

    You cannot manage what you cannot observe. Spot-checking by asking one engine one question on one day tells you little, because answers vary by model, phrasing and time, a problem we cover in why spot-checking fails. The practical approach is to monitor how engines describe your brand across modalities and over time, then trace the descriptions back to the sources feeding them. Honeyb runs scheduled scans across the major answer engines and benchmarks your brand against competitors, and our free AI visibility checker gives a first read on how engines currently describe you. If multimodality is new to you, the what is AI search primer is a good starting point before you optimise.

    The direction of travel

    The trend is clear even where specific figures are not. Input is becoming less about typing and more about showing and speaking. Twenty billion monthly Lens searches, real-time voice models with frontier reasoning, and document understanding are no longer edge features. They are how a growing share of questions get asked, and the share skews towards younger users who will define the next decade of demand. Brands that treat images, video, audio and structured data as first-class content, described consistently and marked up clearly, will be the ones these models can read, understand and recommend. The ones that leave that content unlabelled and inconsistent will be harder for a multimodal model to interpret, and harder for it to cite.

    Frequently asked questions

    What is a multimodal AI model in simple terms?

    It is an AI model that can take in more than one type of content, such as text, images, audio and video, and reason across them together rather than handling each separately. A natively multimodal model is trained on these inputs from the start, so it can connect what it sees, hears and reads in a single response.

    How does multimodality affect AI search?

    It powers three common behaviours: visual search, where a user uploads or photographs an image and asks about it; screenshot and document understanding, where people paste images of tables, charts or errors; and voice queries, where spoken questions are answered in real time. Google's AI Mode, for example, combines Lens with Gemini to understand a whole scene and then issues multiple queries about it. Lens alone now handles nearly 20 billion visual searches a month.

    Which AI models are the leading multimodal ones in 2026?

    Google's Gemini 3 family is the broadest on raw inputs, supporting text, image, audio and video natively, with a Gemini Live API for real-time voice and vision. OpenAI's GPT-5 models support text and image input, with a separate gpt-realtime line for speech. Anthropic's Claude focuses on still images and documents with strong visual reasoning but does not currently take live audio or video. Capabilities change often, so verify against each provider's documentation.

    Does alt text still matter for multimodal AI search?

    Yes. Even though models can interpret pixels directly, alt text remains the most direct way to tell an engine what an image shows. Write it to describe what is actually in the frame, accurately and specifically, and keep the surrounding caption and text consistent so the model has clear context.

    What structured data should I add for images and video?

    Use ImageObject for image metadata and VideoObject for video, which can influence the description, thumbnail, upload date and duration an engine displays. Google uses JSON-LD as its primary example format, and you should connect these media objects to the parent Article or Product schema so the relationships are explicit.

    How can I tell what multimodal engines see about my brand?

    Spot-checking a single engine on a single day is unreliable because answers vary by model, phrasing and time. Use scheduled monitoring that scans the major answer engines, benchmarks you against competitors, and traces descriptions back to their sources. Honeyb's free AI visibility checker gives a quick first read.

    Matiss Katanenko

    About the author

    Matiss Katanenko

    Co-founder, Honeyb

    My name is Matiss Katanenko and I co-founded Honeyb, the AI visibility platform that tracks how ChatGPT, Gemini, Claude, Perplexity and the other major AI engines talk about brands. I'm based in Riga, Latvia. Before Honeyb I spent years on the agency side running SEO and content programs for fast-growing brands across the US and Europe. That work is where I watched AI search start to compress the entire discovery channel into a four-brand short list, and decided to build the tool I wished agencies had. In my free time I'm in the sauna, on a padel court, or behind a drum kit.

    Connect on LinkedIn
    Honeyb

    Free, instant, no signup

    See your brand through every major AI model.

    Run a free check in 30 seconds. The picture is usually different than you'd expect.

    ChatGPTChatGPT
    ClaudeClaude
    GeminiGemini
    PerplexityPerplexity