✨ From vibe coding to vibe deployment. UBOS MCP turns ideas into infra with one message.

Learn more
Carlos
  • Updated: January 30, 2026
  • 6 min read

Voyage AI Unveils Multimodal 3.5 with Video Support and Advanced Retrieval

Voyage Multimodal 3.5 is the newest multimodal embedding model from Voyage AI that adds native video support, Matryoshka‑based flexible embeddings, and delivers up to 4.65 % higher retrieval accuracy compared with leading competitors.


Voyage Multimodal 3.5 launch illustration

Voyage Multimodal 3.5 Launch: A New Frontier in Multimodal Retrieval

On January 15 2026, Voyage AI announced the general availability of Voyage Multimodal 3.5, a next‑generation large language model (LLM) that unifies text, images, PDFs, and for the first time, video frames into a single vector space. The model builds on the success of Voyage Multimodal 3, extending its capabilities with video‑level understanding, Matryoshka embeddings for adjustable dimensionality, and a pricing structure that scales with token usage. This release positions Voyage AI at the forefront of the rapidly evolving multimodal AI market, offering developers, researchers, and enterprises a powerful tool for building search, recommendation, and analytics pipelines that span multiple data modalities.

Feature Highlights

Native Video Support

Voyage Multimodal 3.5 treats a video as an ordered sequence of image frames. Each frame is tokenized at a rate of 1 token per 1,120 pixels, allowing up to 32 k tokens per request. This design enables:

  • Text‑to‑video retrieval with cosine similarity scoring.
  • Scene‑level indexing by splitting long videos into logical segments.
  • Alignment of video segments with transcript timestamps for richer semantic matching.

Best‑practice guidance recommends reducing frame resolution or FPS when a video exceeds the token limit, ensuring that each segment remains within the model’s context window while preserving semantic fidelity.

Unified Multimodal Retrieval

Unlike CLIP‑style architectures that maintain separate towers for text and images, Voyage Multimodal 3.5 uses a single transformer encoder. This unified backbone eliminates the “modality gap,” allowing queries that mix text and visual cues to retrieve the most semantically relevant results regardless of the underlying data type. Whether the corpus contains screenshots, annotated PDFs, tables, or video frames, the model embeds everything into a shared vector space.

Matryoshka Embeddings & Flexible Dimensionality

Matryoshka learning lets the model produce embeddings at 2048, 1024, 512, or 256 dimensions from a single forward pass. Users can select the dimensionality that best balances latency, storage cost, and downstream performance. Quantization options—including 32‑bit float, signed/unsigned 8‑bit integer, and binary precision—further reduce memory footprints with minimal quality loss.

Performance Benchmarks

Extensive evaluation across 18 multimodal datasets demonstrates the model’s superiority:

Task Dataset(s) Metric (NDCG@10) Improvement vs. Best Competitor
Visual Document Retrieval ViDoRe, MIRACL‑VISION, etc. ≈ 65 % +4.56 % over Cohere Embed v4
Video Retrieval MSR‑VTT, YouCook2, DiDeMo ≈ 58 % +4.65 % over Google Multimodal Embedding 001
Pure‑Text Retrieval 38 datasets across law, finance, code, etc. ≈ 71 % +3.49 % over Cohere v4

Across all benchmarks, Voyage Multimodal 3.5 remains within 0.3 % of the state‑of‑the‑art text‑only model while costing $0.06 less per million tokens.

Why Voyage Multimodal 3.5 Matters in the AI Landscape

Multimodal AI is transitioning from research prototypes to production‑grade services. Voyage Multimodal 3.5 addresses three critical gaps:

  1. End‑to‑end retrieval across media types – Enterprises can now index mixed‑format knowledge bases (e.g., product manuals with embedded videos) without maintaining separate pipelines.
  2. Cost‑effective scalability – Matryoshka embeddings and quantization reduce storage by up to 75 % while preserving accuracy, a decisive factor for large‑scale deployments.
  3. Developer‑friendly API – The model is exposed through the same voyageai Python client used for earlier versions, simplifying migration and integration.

For AI researchers, the unified transformer architecture offers a clean testbed for probing cross‑modal attention patterns. For enterprise decision‑makers, the model’s pricing model and performance gains translate directly into lower total cost of ownership for search‑heavy applications such as legal document review, e‑learning platforms, and media asset management.

Pricing, Availability, and Getting Started

Voyage Multimodal 3.5 is available today under a token‑based pricing plan. The first 200 M tokens and 150 B pixels are free, after which usage is billed at a competitive per‑token rate. Detailed pricing can be reviewed on the UBOS pricing plans page, which provides a clear comparison of tiered options for startups, SMBs, and enterprises.

To start using the model, developers need an API key from Voyage AI and can invoke the multimodal_embed endpoint. The following snippet demonstrates a simple text‑image‑video embedding request:

import voyageai
from voyageai.video_utils import Video
import PIL.Image

client = voyageai.Client()  # reads VOYAGE_API_KEY from env
inputs = [
    ["Explain the concept of quantum tunneling.",
     PIL.Image.open("quantum_diagram.png"),
     Video.from_path("quantum_demo.mp4", model="voyage-multimodal-3.5")]
]
result = client.multimodal_embed(inputs, model="voyage-multimodal-3.5")
print(result.embeddings)

Real‑World Use‑Case: Intelligent Video‑Enhanced Knowledge Base

Consider a global manufacturing firm that maintains a knowledge base of equipment manuals, safety videos, and troubleshooting guides. Using Voyage Multimodal 3.5, the firm can:

  • Ingest PDFs and associated instructional videos in a single pipeline.
  • Generate Matryoshka embeddings at 512 dimensions to keep index size modest.
  • Enable engineers to type a query like “how to replace the hydraulic pump” and retrieve the exact video segment, the schematic PDF page, and a concise text summary.

This workflow reduces mean‑time‑to‑resolution by an estimated 30 %, according to internal pilot studies, and demonstrates the tangible ROI of multimodal retrieval.

Accelerate Development with UBOS Solutions

UBOS offers a suite of tools that complement Voyage Multimodal 3.5, allowing teams to prototype, deploy, and monitor multimodal AI applications without writing extensive infrastructure code.

Whether you are a startup looking for rapid MVP delivery, an SMB seeking cost‑effective AI, or an enterprise aiming for large‑scale deployment, UBOS provides the scaffolding to bring Voyage Multimodal 3.5 into production faster.

Further Reading

The official announcement can be read on Voyage AI’s blog: Voyage Multimodal 3.5 launch post. For a broader perspective on multimodal trends, see our AI news hub and the dedicated multimodal AI section.


Conclusion

Voyage Multimodal 3.5 marks a decisive step toward truly unified AI systems that understand text, images, and video as a single semantic fabric. Its performance gains, flexible embedding sizes, and developer‑first API make it a compelling choice for anyone building next‑generation search, recommendation, or analytics solutions. By pairing the model with UBOS’s low‑code platform, organizations can accelerate time‑to‑value while keeping costs predictable. As multimodal AI continues to mature, early adopters of Voyage Multimodal 3.5 will enjoy a competitive edge in delivering richer, more intuitive user experiences.


Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

Sign up for our newsletter

Stay up to date with the roadmap progress, announcements and exclusive discounts feel free to sign up with your email.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.