Multimodal GEO: making images and video citable
The engines driving AI search — Gemini, GPT-4o, Llama 4 — process text, images, video, and audio together. But in retrieval they still reach your non-text content through its text scaffolding: alt text, captions, structured data, subtitle tracks, and transcripts. If that scaffolding is missing, your best media is invisible. This guide covers what to add, per medium.
Published June 2026 · 8 min read
The engines are multimodal. Retrieval is still textual.
A multimodal model can describe an image when you paste one into the chat. But when an answer engine retrieves sources for a query, it works from indexed signals — and for media, those signals are overwhelmingly text. An instructional video with no transcript is a black box at retrieval time; the same video with captions, a transcript, and VideoObject schema is quotable, attributable content. AI engines index the transcript, not the pixels.
Images: alt discipline, captions, image sitemap
- Descriptive alt text on every meaningful image. Describe what the image shows and why it matters — "GeoReady audit dashboard with GEO score and crawler access matrix", not "dashboard.png" or "image". Filename-like and decorative-length alts are noise to engines and screen readers alike.
- Use
<figure>+<figcaption>for charts, screenshots, and diagrams. A caption is quotable text bound to the visual — exactly the unit an answer engine can lift and attribute. - Ship an image sitemap. List your indexable visuals with title and caption so discovery doesn't depend on crawl luck. Serve modern formats (WebP with srcset) with explicit width/height — performance and layout stability are part of the crawl budget story.
Video: VideoObject, captions, transcript
- VideoObject JSON-LD with
name,description,thumbnailUrl, anduploadDate— the minimum for engines to treat the video as a first-class content item rather than an opaque embed. This applies to YouTube/Vimeo embeds too: the schema lives on your page. - Closed captions via
<track kind="captions">for native<video>elements. Captions are time-aligned text — retrievable, quotable, accessible. - A text transcript on the page (or one click away, clearly labeled "transcript"). This is the single highest-leverage item: it converts N minutes of video into citable prose, and it is what answer engines actually quote.
Audio and podcasts
Same principle, smaller surface: mark episodes up with
AudioObject or
PodcastEpisode schema and publish a transcript per
episode. Podcast SEO and podcast GEO are the same work — the
transcript page becomes the citable artifact, and it can anchor a
topic cluster
of its own.
Audit it: the multimodal readiness check
The GEO Optimizer audit includes a multimodal readiness check that measures exactly this scaffolding — alt coverage, captions, VideoObject/AudioObject schema, subtitle tracks, transcripts — and grades only the media actually present on the page:
19. MULTIMODAL READINESS Images: 8/16 with descriptive alt (50%) | captions: 3 Audio: present | schema: ❌ | transcript: ❌ Readiness: basic
It's informational — it doesn't move the 100-point GEO score — but
its recommendations go straight into your fix list. Run it from the
free audit
or with geo audit --url in the open-source CLI.
The 10-minute multimodal checklist
- Every meaningful image has a descriptive alt (≥ a short sentence).
- Charts and screenshots are wrapped in figure/figcaption.
- Pages with video carry VideoObject schema with the four core fields.
- Native videos have a captions track; every video page links a transcript.
- Podcast episodes have AudioObject/PodcastEpisode schema + transcript.
- An image sitemap lists your indexable visuals with titles and captions.
Further reading
- Generative Engine Optimization: the practical guide — the pillar this checklist belongs to.
- How to check if AI engines cite your brand — measure whether the work pays off.
- AI visibility checklist — all signals in one place.