French AI company Mistral released a new open source text-to-speech model on Thursday that can be used by voice AI assistants or in enterprise use cases like customer support. The model, called Voxtral TTS, is designed to help enterprises build voice agents for sales, support, and customer engagement — putting Mistral in direct competition with established players like ElevenLabs, Deepgram, and OpenAI.
Voxtral TTS supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. This multilingual capability makes it a strong fit for global businesses that need voice solutions across different markets and customer bases.
Small Enough for a Smartwatch
What sets Voxtral TTS apart from many competitors is its remarkably small footprint. Pierre Stock, Mistral's VP of science operations, told TechCrunch that customers had been requesting a speech model, so the company built one small enough to fit on a smartwatch, smartphone, laptop, or other edge devices. He added that the cost is a fraction of anything else available on the market while still offering top-tier performance.
The model is based on Ministral 3B, Mistral's compact language model, which allows it to deliver powerful results without requiring heavy computing resources. For businesses looking to deploy voice AI locally on devices rather than relying on cloud servers, this could be a game-changer in terms of both cost and latency.
Voice Cloning in Under Five Seconds
One of Voxtral TTS's most impressive features is its voice adaptation capability. The model can clone a custom voice from a sample of less than five seconds of audio, and it captures subtle characteristics like accents, inflections, intonations, and natural irregularities in speech flow.
The model can also switch between languages seamlessly without losing the characteristics of the original voice, which opens up practical applications in areas like dubbing, real-time translation, and multilingual customer service. Stock emphasized that the company wanted the model to sound human rather than robotic.
Built for Real-Time Speed
Speed is critical for voice AI applications, and Mistral has designed Voxtral TTS with real-time performance in mind. The model has a time-to-first-audio (TTFA) of 90 milliseconds for a 10-second sample of 500 characters. That means the model starts producing speech almost instantly after receiving text input.
Additionally, the model has a real-time factor of 6x, meaning it can render a 10-second audio clip in roughly 1.6 seconds. These speed metrics make it well-suited for live conversation scenarios like customer support calls, voice assistants, and interactive applications where any noticeable delay would degrade the user experience.
Building a Complete Voice Platform
Voxtral TTS is not Mistral's first move into voice technology. Earlier this year, the company launched a pair of transcription models — one for large batch processing and another for real-time, low-latency use cases. With the addition of this text-to-speech model, Mistral is clearly building toward a full voice AI suite for enterprise customers.
Stock outlined the company's broader vision, saying Mistral plans to offer an end-to-end platform capable of handling multimodal streams of input and output, including audio, text, and images. He explained that the main advantage of such a system is the richer information you get from an agentic platform that supports audio as both input and output.
Open Source as a Competitive Edge
In a market dominated by proprietary voice AI solutions, Mistral is betting that its open source approach will be the differentiator. The company's positioning is that its open source model and customization options will encourage enterprises to adopt Voxtral TTS over competitors, since businesses can fine-tune the model to their specific needs.
This strategy aligns with Mistral's broader philosophy across all its AI products. By giving enterprises full control over the model, Mistral appeals to organizations that prioritize data privacy, on-premise deployment, and the ability to customize AI tools without being locked into a vendor's ecosystem.
With Voxtral TTS, Mistral has made it clear that the voice AI race is no longer limited to American tech giants. A small, fast, and free open source model that runs on edge devices could reshape how businesses think about deploying voice technology at scale.
Let me know if you'd like any edits!







