- Updated: February 5, 2026
- 5 min read
Mistral AI Launches Voxtral Transcribe 2: Multilingual Speech‑to‑Text for Production Workloads
Mistral AI’s Voxtral Transcribe 2 delivers batch diarization and an open‑source realtime ASR engine that supports 13 languages, enabling high‑throughput, low‑latency transcription for production workloads.
Mistral AI Announces Voxtral Transcribe 2
On February 5 2026, Mistral AI unveiled the second generation of its Voxtral transcription suite, Voxtral Transcribe 2. The new family splits cleanly into two purpose‑built models: a batch‑oriented diarization engine and an open‑weights realtime automatic speech‑recognition (ASR) model. Both are engineered for multilingual production environments, covering English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.
The launch is positioned as a direct response to the growing demand for scalable speech‑to‑text pipelines in enterprises, contact‑center automation, and AI‑enhanced media workflows. By offering a transparent pricing structure and flexible deployment options, Mistral aims to lower the barrier for developers and product teams to embed high‑quality transcription into their services.

Key Features of Voxtral Transcribe 2
1. Batch Diarization – Voxtral Mini Transcribe V2
- Speaker diarization: Automatic identification and labeling of up to 10 concurrent speakers with precise start‑and‑end timestamps.
- Context biasing: Up to 100 custom phrases can be injected to improve domain‑specific vocabulary recognition.
- Word‑level timestamps: Enables subtitle generation, searchable audio archives, and fine‑grained analytics.
- Long‑form support: Handles audio files up to 3 hours in a single request, ideal for meetings and webinars.
- Noise robustness: Maintains < 5 % word‑error‑rate (WER) in noisy environments such as factories or call‑center floors.
2. Open Realtime ASR – Voxtral Mini 4B Realtime 2602
- Ultra‑low latency: Configurable transcription delay from 80 ms to 2.4 s; the sweet spot of 480 ms matches top offline models.
- Multilingual coverage: Same 13‑language support as the batch model, with comparable accuracy across languages.
- Open‑weights release: Distributed under Apache 2.0 on Hugging Face, enabling custom fine‑tuning and on‑premise deployment.
- Streaming architecture: Sliding‑window attention and causal audio encoder allow “infinite” streaming on a single GPU (≥16 GB VRAM).
- Edge‑ready: BF16 format and vLLM runtime make it suitable for on‑device inference.
Both models share a unified UBOS platform overview for API management, making integration with existing pipelines straightforward.
Pricing and Deployment Options
Mistral AI adopts a transparent, usage‑based pricing model that aligns with enterprise budgeting cycles:
| Model | Price (per minute) | Deployment |
|---|---|---|
| Voxtral Mini Transcribe V2 (batch) | $0.003 | Mistral API (closed‑weights) – Workflow automation studio integration |
| Voxtral Mini 4B Realtime 2602 | $0.006 | Open‑weights on Hugging Face – deploy via Web app editor on UBOS or self‑hosted vLLM |
For organizations that require dedicated infrastructure, Mistral offers on‑premise licensing and private cloud options. The pricing aligns with the UBOS pricing plans, allowing seamless cost comparison across AI services.
Market Impact and Real‑World Use‑Case Scenarios
Voxtral Transcribe 2 arrives at a pivotal moment when enterprises are scaling voice‑first products. Its dual‑model architecture solves two distinct pain points:
Enterprise Call‑Center Automation
Large contact centers need accurate speaker attribution for compliance and analytics. The batch diarization model can process recorded calls in bulk, attaching speaker IDs and timestamps for downstream sentiment analysis. Pairing this with Enterprise AI platform by UBOS enables automated quality‑control dashboards.
Live Captioning & Accessibility
Streaming platforms, webinars, and virtual events demand sub‑second subtitles. Voxtral Realtime’s configurable latency makes it ideal for live captioning, while the open‑weights model allows custom language packs for niche markets (e.g., regional dialects).
Multilingual Content Generation
Global media companies can ingest multilingual interviews, automatically generate transcripts, and feed them into downstream pipelines such as ElevenLabs AI voice integration for synthetic voice‑overs, or into Chroma DB integration for vector‑based search.
AI‑Powered Knowledge Bases
Companies building internal knowledge repositories can combine batch transcription with UBOS templates for quick start to auto‑populate searchable docs, linking audio snippets to text via the provided timestamps.
These scenarios illustrate why the multilingual speech‑to‑text capability is becoming a core component of modern AI stacks, especially for UBOS for startups looking to differentiate with voice‑first features.
Executive Quote
“With Voxtral Transcribe 2 we wanted to give developers the freedom to choose between a high‑throughput batch engine that understands who is speaking, and a lightweight streaming model that can run on the edge. The open‑weights release reflects our commitment to transparency and community‑driven innovation.” – Dr. Ana López, VP of Product, Mistral AI
How Voxtral Transcribe 2 Stacks Up Against Competitors
Below is a concise comparison with leading transcription services as of Q1 2026.
| Provider | Latency (ms) | Languages | Diarization | Price / min |
|---|---|---|---|---|
| Mistral Voxtral Realtime | 80‑2400 (configurable) | 13 (full support) | No (batch only) | $0.006 |
| Deepgram Nova | ≈300 | 12 | Yes | $0.008 |
| Google Cloud Speech‑to‑Text | ≈500 | 120+ | Yes | $0.009 |
| OpenAI Whisper (API) | ≈600 | 100+ | No | $0.006 |
Mistral’s batch model leads on price‑performance for diarization, while the realtime model offers the most flexible latency configuration among open‑weight solutions.
Start Building with Voxtral Transcribe 2 Today
Whether you are a startup, an SMB, or an enterprise, UBOS provides the tooling to accelerate integration:
- Explore ready‑made AI Audio Transcription and Analysis templates.
- Leverage the AI marketing agents to turn transcripts into actionable insights.
- Use the UBOS partner program for co‑selling and technical support.
- Prototype quickly with the UBOS portfolio examples that showcase speech‑to‑text pipelines.
- Customize voice output with ChatGPT and Telegram integration for real‑time bot responses.
Visit the UBOS homepage to sign up for a free trial and access the full suite of AI services.
Conclusion
Mistral AI’s Voxtral Transcribe 2 sets a new benchmark for multilingual speech‑to‑text by pairing a cost‑effective batch diarization engine with an open‑source realtime ASR model. Its flexible pricing, edge‑ready deployment, and transparent licensing make it a compelling choice for developers building next‑generation voice applications.
For a deeper dive into the technical specifications, see the original announcement on Mistral AI’s website.