Multimodal GEO: making images and video citable

Multimodal AI engines reach images, video, and audio through their text scaffolding: alt text, captions, VideoObject schema, subtitle tracks, and transcripts. A practical guide.

Juan Camilo Auriti · June 11, 2026 · Updated July 17, 2026

The engines are multimodal. Retrieval is still textual.

A multimodal model can describe an image when you paste one into the chat. But when an answer engine retrieves sources for a query, it works from indexed signals — and for media, those signals are overwhelmingly text. An instructional video with no transcript is a black box at retrieval time; the same video with captions, a transcript, and VideoObject schema is quotable, attributable content. AI engines index the transcript, not the pixels.

Editorial diagram showing image, video, and audio media passing through text signals before becoming a citable AI answer — Multimodal content becomes retrievable when text signals connect the media to a citable answer.

Images: alt discipline, captions, image sitemap

Descriptive alt text on every meaningful image. Describe what the image shows and why it matters — "GeoReady audit dashboard with GEO score and crawler access matrix", not "dashboard.png" or "image". Filename-like and decorative-length alts are noise to engines and screen readers alike.
Use <figure> + <figcaption> for charts, screenshots, and diagrams. A caption is quotable text bound to the visual — exactly the unit an answer engine can lift and attribute.
Ship an image sitemap. List your indexable visuals with title and caption so discovery doesn't depend on crawl luck. Serve modern formats (WebP with srcset) with explicit width/height — performance and layout stability are part of the crawl budget story.

Diagram showing a central landscape image connected to descriptive alt text, a figcaption, and an image sitemap entry — A citable image is a package: visual content, descriptive alt text, a visible caption, and an image sitemap entry.

Video: VideoObject, captions, transcript

VideoObject JSON-LD with name, description, thumbnailUrl, and uploadDate — the minimum for engines to treat the video as a first-class content item rather than an opaque embed. This applies to YouTube/Vimeo embeds too: the schema lives on your page.
Closed captions via <track kind="captions"> for native <video> elements. Captions are time-aligned text — retrievable, quotable, accessible.
A text transcript on the page (or one click away, clearly labeled "transcript"). This is the single highest-leverage item: it converts N minutes of video into citable prose, and it is what answer engines actually quote.

Four-step diagram showing video content becoming a citable AI source through captions and a transcript — Captions and a page-level transcript turn an opaque video into text an answer engine can retrieve and cite.

Audio and podcasts

Same principle, smaller surface: mark episodes up with AudioObject or PodcastEpisode schema and publish a transcript per episode. Podcast SEO and podcast GEO are the same work — the transcript page becomes the citable artifact, and it can anchor a topic cluster of its own.

Audit it: the multimodal readiness check

The GEO Optimizer audit includes a multimodal readiness check that measures exactly this scaffolding — alt coverage, captions, VideoObject/AudioObject schema, subtitle tracks, transcripts — and grades only the media actually present on the page:

It's informational — it doesn't move the 100-point GEO score — but its recommendations go straight into your fix list. Run it from the free audit or with geo audit --url in the open-source CLI.

The 10-minute multimodal checklist

Every meaningful image has a descriptive alt (≥ a short sentence).
Charts and screenshots are wrapped in figure/figcaption.
Pages with video carry VideoObject schema with the four core fields.
Native videos have a captions track; every video page links a transcript.
Podcast episodes have AudioObject/PodcastEpisode schema + transcript.
An image sitemap lists your indexable visuals with titles and captions.