Voxtral Transcribe 2 by Mistral favicon

Voxtral Transcribe 2 by Mistral

Voxtral Transcribe 2: State-of-the-Art Speech-to-Text with Real-Time Latency and Precision Diarization

Introduction:

Voxtral Transcribe 2 is a next-generation suite of speech-to-text models by Mistral AI, featuring Voxtral Mini Transcribe V2 for batch processing and Voxtral Realtime for live applications. Offering industry-leading accuracy, ultra-low latency down to 200ms, and advanced speaker diarization, Voxtral provides a scalable, cost-effective solution for automating AI workflows, meeting intelligence, and voice-first applications across 13 languages.

Added On:

2026-02-06

Monthly Visitors:

7963.5K

Voxtral Transcribe 2 by Mistral - AI Tool Screenshot and Interface Preview

Voxtral Transcribe 2 by Mistral Product Information

Voxtral Transcribe 2: Revolutionizing Speech-to-Text with Next-Generation AI

In the rapidly evolving landscape of artificial intelligence, Voxtral Transcribe 2 emerges as a state-of-the-art solution for high-precision audio transcription. Developed by Mistral AI, this family of models is designed to optimize AI workflows with scalable, efficient, and highly accurate speech-to-text capabilities. Whether you are seeking to automate processes or build responsive voice agents, Voxtral provides the tools necessary to transform audio into actionable data at the speed of sound.

What's Voxtral Transcribe 2?

Voxtral Transcribe 2 is a suite of next-generation speech-to-text models that deliver industry-leading transcription quality, speaker diarization, and ultra-low latency. The family is divided into two primary offerings: Voxtral Mini Transcribe V2 for batch processing and Voxtral Realtime for live, streaming applications.

Unlike traditional models that process audio in chunks, Voxtral Realtime utilizes a novel streaming architecture, transcribing audio as it arrives. This allows for a delay configurable down to sub-200ms. Furthermore, Voxtral Mini Transcribe V2 sets a new benchmark for efficiency, achieving the lowest word error rate (WER) at a highly competitive price point of $0.003/min. These models are natively multilingual, supporting 13 languages including English, Chinese, French, German, and more.

Features of Voxtral

Voxtral is packed with enterprise-ready features designed to handle complex acoustic environments and specific business needs.

1. Speaker Diarization

Voxtral includes precision diarization, which generates transcriptions with speaker labels and precise start/end times. This feature is essential for multi-party calls and interviews, ensuring that every word is attributed to the correct speaker.

2. Context Biasing

To ensure accuracy for technical terms or proper nouns, Voxtral allows users to provide up to 100 words or phrases. This guides the model toward the correct spelling of domain-specific vocabulary, which standard models often miss.

3. Ultra-Low Latency

Voxtral Realtime is purpose-built for applications where speed is critical. With latency as low as 200ms, it enables a new class of voice-first applications and responsive voice interfaces.

4. Open-Weights and Privacy

Voxtral Realtime is released under the Apache 2.0 license with open weights. This allows for edge deployment, ensuring privacy-first applications and secure on-premise setups for sensitive data.

5. Robust Language Support

Both models support 13 languages with high accuracy:

  • English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.

Use Case Scenarios

Voxtral powers diverse workflows across multiple industries, helping teams cut costs and improve performance.

  • Meeting Intelligence: Transcribe multilingual recordings with clear speaker attribution. This is ideal for meeting analysis and large-scale content annotation.
  • Voice Agents and Virtual Assistants: By connecting Voxtral Realtime to an LLM and TTS pipeline, developers can build conversational AI that feels natural and responsive.
  • Contact Center Automation: Transcribe calls in real-time to analyze sentiment, suggest responses to agents, and automatically populate CRM fields.
  • Media and Broadcast: Generate live subtitles with minimal latency. Use context biasing to handle technical terminology and proper names during live broadcasts.
  • Compliance and Documentation: Create precise audit trails for regulatory compliance with word-level timestamps and diarization.

How to Use Voxtral

Getting started with Voxtral is seamless through Mistral Studio or API integration.

Using the Audio Playground

  1. Log in to Mistral Studio and navigate to the audio playground.
  2. Upload Files: Upload up to 10 audio files (supported formats: .mp3, .wav, .m4a, .flac, .ogg) up to 1GB each.
  3. Configure Settings: Toggle speaker diarization on or off and choose your preferred timestamp granularity.
  4. Add Context: Input domain-specific vocabulary into the context biasing field.
  5. Transcribe: Run the transcription to receive high-accuracy text instantly.

API Integration

  • Voxtral Mini Transcribe V2: Available at $0.003 per minute for batch processing.
  • Voxtral Realtime: Available at $0.006 per minute for live streaming applications. You can also download the open weights from the Hugging Face Hub for local deployment.

FAQ

Q: What is the pricing for Voxtral? A: Voxtral offers usage-based pricing. Voxtral Mini Transcribe V2 is $0.003/min, while Voxtral Realtime is $0.006/min. For enterprise-scale needs, solutions typically start around €5K/month.

Q: How does Voxtral compare to other models? A: Voxtral Mini Transcribe V2 outperforms GPT-4o mini Transcribe and Gemini 2.5 Flash on accuracy while processing audio approximately 3x faster than ElevenLabs’ Scribe v2 at one-fifth the cost.

Q: Does Voxtral support long audio recordings? A: Yes, Voxtral can process recordings up to 3 hours in a single request.

Q: Is the platform secure for enterprise use? A: Absolutely. Voxtral supports GDPR and HIPAA-compliant deployments via secure on-premise or private cloud configurations.

Q: What languages are supported? A: It natively supports 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.

Loading related products...