Miso TTS: Miso Labs' 8B Open-Weight Voice Model

Miso TTS is an 8-billion-parameter open-weight emotive text-to-speech model released by Miso Labs on June 3, 2026. The model was announced by co-founders Aoden Teo (CEO) and Cassidy Dalva (President) as the first public artifact of Miso Labs’ research program on "the most emotive foundation models for voice." Architecturally, MisoTTS is a hierarchical residual vector quantization (RVQ) Transformer in the spirit of Sesame’s Conversational Speech Model, pairing a 7.7B Llama 3.2-style backbone with a 300M autoregressive depth decoder over 32 codebooks of 2,048 entries each. It generates Mimi audio codes from text and optional audio context (meaning the model can condition on the speaker’s voice and tone, not just the words). Miso Labs claims roughly 110ms latency, against 700ms for ElevenLabs and 300ms for Sesame on their internal comparisons. Weights are distributed under a modified MIT license on Hugging Face; an API is announced but not yet available.

The release matters for three reasons. Open-weight TTS at the 8B scale is a meaningful step up from the smaller open-source voice models that have dominated the public-weights landscape for the past 18 months. The RVQ architecture, while not unique to MisoTTS, is well-documented in the company’s release blog with the math written out clearly enough for builders to understand the design tradeoffs (rather than the usual marketing-driven announcement). And the one-shot voice cloning workflow (paste a 10-second clip, get persistent voice continuity across long generations) is functional today and runs entirely locally, which matters for organizations that can’t send voice data to a third-party API for compliance reasons. This post covers what MisoTTS actually is, how the RVQ approach works at a level builders can use, the architecture in plain language, the current limitations (half-duplex, single-turn, GPU-bound), the licensing and access story, where MisoTTS fits in the broader voice-AI landscape, and what voice-agent builders should do with this release today.

What MisoTTS actually is

MisoTTS is an 8-billion-parameter neural text-to-speech model that generates expressive speech from a combination of text input and optional reference audio. The model’s distinguishing features, both relative to other open-source TTS models and to the closed-source commercial options:

It conditions on prior audio, not just text. Most text-to-speech models take a text string and a speaker selector, then generate speech. MisoTTS can additionally take a reference audio clip and use it both for voice cloning (the cloned voice persists across the generation) and for tone conditioning (the model responds in a register that matches the reference). The motivation Miso Labs gives in the release blog: when humans speak, they adapt their delivery to the person they’re responding to. A model that only sees the text loses that conditioning entirely and tends toward the uncanny-valley flat delivery that most TTS systems exhibit.

It uses residual vector quantization to handle the sonic range problem. Standard Transformers generate from a fixed vocabulary, which works for text (a few hundred thousand tokens covers most languages) but breaks for speech (human vocal output varies across pitch, rhythm, emphasis, emotion, and accent in combinations that a flat vocabulary can’t capture without scaling the parameter count to impossible numbers). RVQ replaces the single-token output with a vector of tokens, each drawn from a separate codebook, summed to produce the final audio frame. With 32 codebooks of 2,048 entries, MisoTTS addresses an effective vocabulary of 2048³² (about 10¹⁰⁵) audio tokens, which Miso Labs notes is more than the number of atoms in the observable universe. Naive vocabulary scaling to reach that range would require a model 93 orders of magnitude larger than the largest models ever trained, so the architectural choice matters.

The implementation splits into two transformers. A 7.7B backbone (Llama 3.2-style) processes the interleaved text and audio sequence and predicts the first codebook index plus a hidden state. A 300M decoder runs autoregressively over depth, reusing the same parameters to predict each subsequent codebook index conditioned on the ones predicted earlier in the same frame. The decoder weights are reused across all 31 remaining positions, which is what keeps the parameter count manageable while letting the addressable vocabulary scale exponentially with depth.

For voice agent context that complements this release, our AI agents pillar covers the broader landscape of conversational AI, and our natural language processing primer covers the text-side fundamentals that voice systems build on top of.

The vocabulary problem MisoTTS solves

Voice generation is hard for Transformers in a way that text generation is not. A text vocabulary of 128,000 tokens covers most of what English speakers produce; pushing to 256,000 or 512,000 catches the long tail of code, names, and cross-language fragments. But the equivalent vocabulary for speech (the set of distinct audio frames a model might want to produce) is vastly larger because every combination of pitch, timing, emphasis, breath, and timbre is a distinct frame. A flat audio vocabulary that captured the diversity of human speech would need tens of millions of entries at minimum, and likely more.

The conventional fix is to make the audio vocabulary larger. That works in theory but breaks the model size: in a standard Transformer, both the embedding lookup and the prediction head scale linearly with vocabulary size, so doubling the vocabulary roughly doubles a significant chunk of the parameter count. Pushing to the vocabulary size required to cover human-speech-quality output produces models that are too large to train economically.

RVQ avoids this by representing each audio frame as a vector of indices rather than a single index. Each position in the vector draws from its own codebook (set of vectors). To recover the audio for a given vector token, the model looks up the corresponding vector in each codebook and sums the results. The effective addressable vocabulary is the codebook size raised to the power of the depth (the number of positions in the vector). MisoTTS uses 2,048-entry codebooks at depth 32, producing 2048³² addressable audio tokens.

The crucial property is that adding depth does not add parameters. The same 300M-parameter depth decoder is reused at each position, predicting the next codebook index conditioned on the previously-predicted ones in the same frame. So MisoTTS gets an exponentially-scaled audio vocabulary without an exponentially-scaled model.

That mathematical move is what makes 8B parameters enough to handle high-quality emotive speech in MisoTTS. Without RVQ, the same expressive range would require a vastly larger model. With RVQ, you trade complexity in the inference pipeline (running the depth decoder 31 times per audio frame) for parameter efficiency in the model itself. For builders, this is the architectural insight worth understanding: when you read that a TTS model is "RVQ-based," you’re being told that the model gets its sonic range from codebook depth rather than from raw parameter count.

The two-transformer architecture, in plain language

The MisoTTS computation graph at inference time:

The backbone Transformer (7.7B parameters) takes the input sequence (interleaved text tokens and audio frame tokens for any prior context), runs the usual Transformer forward pass, and produces two things: the index of the first codebook for the next audio frame, and a hidden state that will be used by the decoder.

The depth decoder Transformer (300M parameters) takes that hidden state plus the embedding of the first codebook index and autoregressively predicts the second codebook index. Then it takes the hidden state plus the first two codebook embeddings and predicts the third codebook index. And so on through all 32 codebook positions.

The output of this two-stage process is a complete 32-position vector token that represents one audio frame. The frame is then converted to actual audio by the Mimi audio decoder (which is part of the Sesame CSM technology MisoTTS builds on, not a separate Miso Labs invention).

For inference throughput, the backbone runs once per audio frame; the depth decoder runs 31 times per frame. The depth decoder is small enough (300M vs 7.7B for the backbone) that this 31x amplification doesn’t dominate inference cost, which is part of how the model achieves its ~110ms latency claim.

For training, the architecture lets MisoTTS learn from interleaved text and audio sequences. The backbone sees the full conversation history (text and prior audio frames together), so the model can carry context across turns and condition on the speaker’s tone. This is the architectural choice that Miso Labs frames as the answer to the second motivation for MisoTTS (after vocabulary size): most TTS models condition only on text and miss the speaker’s tone, contributing to the uncanny-valley flatness that voice agents have struggled with.

What MisoTTS does well today

The release blog includes four audio samples worth listening to as illustrations of the model’s capability ranges:

A basketball-commentary sample with fast, excited delivery and live-event pacing demonstrates that the model can sustain high-energy speech over longer durations without drifting toward monotone.

A casual conversation sample (9 seconds) shows the conversational timing, asides, and relaxed intonation that a normal-register voice agent would need.

A 28-second math explanation sample shows calm instructional speech with clear phrasing, which is the register that educational voice applications target.

A therapeutic-register sample with soft, emotionally aware delivery and longer pauses demonstrates the lower-affect, higher-presence end of the model’s range.

Beyond the demo samples, the practical capability story:

One-shot voice cloning works from a 10-second reference clip. The cloned voice stays consistent across long generations, which solves a problem that earlier voice-cloning systems struggled with (cloned voices that drift toward a generic average over the course of a long output).

The model runs locally on a capable CUDA GPU. That’s a meaningful constraint (you need a real workstation or cloud GPU instance) but it’s far less restrictive than the cloud-only commercial alternatives. For organizations with compliance constraints that prevent sending voice data to a third party, local inference is the necessary capability.

Audio is watermarked by default via SilentCipher. That’s an attribution and provenance measure rather than a quality measure, but it matters for the policy landscape: regulators have been increasingly concerned about voice cloning being used for fraud, and watermarking output by default makes MisoTTS-generated audio detectable downstream.

Inference setup uses uv with Python 3.10 and standard PyTorch in torch.bfloat16 precision. The barrier to running the model is modest if you have the GPU.

What MisoTTS can’t do yet

The release blog is explicit about the current limitations, which are worth knowing before you plan a project around MisoTTS:

The model is single-turn. It generates one response at a time and doesn’t model the turn-taking of a conversation. For a voice agent, this means MisoTTS handles the output side of "user says X, agent says Y," but you’d need a separate orchestration layer to manage the turn-taking decisions (who speaks next, when to interrupt, when to wait).

The model is half-duplex. MisoTTS cannot speak while the other party is speaking. That’s a significant constraint for natural conversation, since human dialogue involves frequent overlap (backchannels like "mm-hmm" while the other person is talking, interjections, talking over each other to make a point). A half-duplex voice agent will feel turn-based in a way that real conversation isn’t.

API access is not yet available. The release announces the API as "coming soon" but doesn’t commit to a date. For builders who don’t want to operate inference infrastructure themselves, this is a blocker for production use. The path forward is either running the open weights yourself or waiting for the API.

The model is English-only at release. The release blog doesn’t position MisoTTS as multilingual; the samples are all English and the training data is presumably English-heavy. Multilingual support, if it’s coming, isn’t part of the current release.

Latency claims are unverified by third parties. The 110ms latency number is Miso Labs’ own measurement; the comparisons to ElevenLabs at 700ms and Sesame at 300ms are also internal. Independent latency benchmarks haven’t landed yet. For latency-sensitive applications (real-time voice agents where the speech needs to start within a hard deadline), build benchmarks against your actual workload before committing.

Quality is also unverified by third parties at scale. The demo samples are persuasive, but rigorous comparison against ElevenLabs, OpenAI’s voice models, Sesame CSM, Cartesia Sonic, and other current state-of-the-art TTS systems requires public benchmarks that haven’t been published yet.

Licensing and access

MisoTTS weights are available on Hugging Face at MisoLabs/MisoTTS under a "modified MIT license." The exact modifications to standard MIT aren’t published in the release blog; before using MisoTTS commercially, check the LICENSE file in the repository to confirm what restrictions apply. Common modifications for AI model licenses include use-case restrictions (no weapons, no surveillance, no impersonation), attribution requirements, and redistribution constraints. Whatever the modifications are, plan to comply with them.

The model weights are about 8B parameters in F32 precision, which is roughly 32GB to download. Inference precision defaults to torch.bfloat16, so the loaded model occupies roughly half that in GPU memory at inference time.

API access is announced as coming, with no committed timeline. When the API launches, expect Miso Labs to price it competitively against ElevenLabs and OpenAI TTS (the obvious commercial comparisons). The API will presumably remove the local-GPU requirement and make MisoTTS available through a standard HTTP integration.

Source code is on GitHub at MisoLabsAI/MisoTTS. The repository contains the inference code (the model definition is in generator.py) plus setup instructions. For the technical writeup, the release blog at misolabs.ai/blog/miso-tts-8b has the full architectural details with the math written out.

Where MisoTTS fits in the TTS landscape

The current commercial-and-open landscape for high-quality TTS is denser than it was a year ago. The relevant comparison points:

ElevenLabs is the commercial default for high-quality voice cloning and emotive speech. Closed weights, cloud API, broad language support, mature product surface. Miso Labs’ own latency comparison cites ElevenLabs at 700ms (vs MisoTTS at 110ms), though that comparison is against an unspecified ElevenLabs configuration. ElevenLabs’ commercial pricing makes it the default for organizations with budget for cloud TTS and no compliance constraint against sending voice data off-premises.

Sesame’s Conversational Speech Model (CSM) is the architectural ancestor MisoTTS credits in its release blog. Sesame released CSM in early 2025 and has continued to iterate. Sesame is positioned as the voice infrastructure layer for conversational agents, with both open and closed model tiers. MisoTTS represents an explicit open-weight scaling of the CSM architecture to 8B parameters.

OpenAI TTS ships voice generation through the standard OpenAI API alongside the rest of the GPT family. The OpenAI voice mode that ships in ChatGPT and the standalone TTS API endpoint are positioned for general-purpose voice agent use. Closed weights, cloud API, broad language support, integration with OpenAI’s other models.

Cartesia Sonic is the other commercial low-latency TTS system that competes directly with the latency claim MisoTTS makes. Sonic targets sub-200ms latency for voice agent applications and is closed-source.

Google’s text-to-speech and Microsoft Azure TTS are the cloud incumbents. Both ship through their respective cloud platforms with broad language and voice support but typically aren’t the choice when expressive emotive output is the priority. Pricing is per-character.

Smaller open-source models (Kokoro, XTTS, OpenVoice, Bark, and others) have served the open-source-weights market with varying quality and capability. MisoTTS is meaningfully larger and (per Miso Labs’ positioning) meaningfully more expressive than these smaller alternatives, but the smaller models are easier to deploy on consumer-grade hardware.

MisoTTS positions itself in a specific niche: the highest-quality open-weight TTS available, with on-prem deployment as a core capability, targeting voice agent builders who can’t or don’t want to send voice data to a third-party cloud API. For that niche, MisoTTS is well-positioned. For organizations that don’t have the on-prem constraint, the commercial alternatives may still be the right answer pending Miso Labs’ API launch.

For broader AI tooling context, our OpenAI Codex on mobile coverage discusses the broader pattern of AI tools moving toward latency-sensitive, locally-deployable architectures.

What voice agent builders should do today

Six concrete actions to consider:

Download and run the model. The Hugging Face weights and the GitHub inference code together let you evaluate MisoTTS against your actual voice agent workload in an afternoon. The setup is uv-based on Python 3.10 and requires a capable CUDA GPU. Run real prompts, not just the demo samples.
Test the one-shot voice cloning against your speaker library. The 10-second clip workflow is the most practical capability for production voice agents (you don’t want to ship a personality whose voice doesn’t match your brand or character). Run clips from your existing voice talent and assess whether the cloned voice carries the character you need.
Benchmark latency on your hardware. The 110ms number is from Miso Labs; your number will depend on your GPU, your prompt length, and your inference setup. For real-time voice agents where the speech needs to start within a hard deadline, measure before you commit.
Plan the turn-taking story. Because MisoTTS is half-duplex and single-turn, your voice agent architecture needs an orchestration layer that handles when MisoTTS speaks, when the user speaks, and how to handle overlaps. That logic is your responsibility, not Miso Labs’. Existing voice agent frameworks (LiveKit, Pipecat, Vocode) can host the orchestration; MisoTTS plugs in as the TTS step.
Watch for the API launch. If you don’t want to run inference infrastructure, the API is the right path. Until it lands, plan around the open-weight workflow. When the API ships, you can swap the local generation step for the API call without changing the rest of your agent architecture.
Confirm license compatibility for your use case. The modified MIT license is broadly permissive but the specific modifications matter. Read the LICENSE file in the GitHub repository before shipping a production product that uses MisoTTS, especially for use cases involving voice impersonation (where common AI-model license restrictions apply).

The deeper takeaway is that open-weight voice infrastructure has caught up to the point where serious voice agent products can be built without sending voice data to a third-party API. For organizations with compliance constraints, regulatory exposure, or commercial reasons to keep voice data on-premises, MisoTTS represents a meaningful capability that didn’t exist at this quality level a year ago. For organizations without those constraints, the commercial cloud TTS landscape (ElevenLabs, OpenAI, Cartesia, Google, Microsoft) still offers the easiest path to production. The right choice depends on your constraints, and MisoTTS adds a new, credible option to the menu.

Frequently Asked Questions

What is MisoTTS?

MisoTTS is an 8-billion-parameter open-weight emotive text-to-speech model released by Miso Labs on June 3, 2026. It generates expressive speech from text input and optional reference audio (which can be used for voice cloning and tone conditioning). Architecturally, MisoTTS is a hierarchical residual vector quantization (RVQ) Transformer built in the spirit of Sesame’s Conversational Speech Model: a 7.7B Llama 3.2-style backbone paired with a 300M autoregressive depth decoder over 32 codebooks of 2,048 entries each.

What is residual vector quantization (RVQ)?

RVQ is the architectural trick MisoTTS uses to handle the wide sonic range of human speech without scaling parameter count to impossible numbers. Instead of generating from a single flat vocabulary, the model generates a vector of indices (32 positions in MisoTTS), each drawn from its own codebook of 2,048 entries. The audio is recovered by summing the looked-up vectors. Total addressable vocabulary scales as codebook-size to the power of depth, so MisoTTS gets ~2048^32 (about 10^105) addressable audio tokens. The depth decoder reuses the same 300M parameters at each position, so scaling depth doesn’t scale model size.

How does MisoTTS compare to ElevenLabs?

The comparison is between an open-weight model that runs on your hardware (MisoTTS) and a closed-weight cloud API (ElevenLabs). Miso Labs claims 110ms latency for MisoTTS against 700ms for ElevenLabs (using Miso Labs’ own measurements, which haven’t been independently verified). On quality, both target high-end expressive voice; rigorous third-party comparison hasn’t been published yet. The practical choice depends on your constraints: if you need to keep voice data on-premises or want to run on your own hardware, MisoTTS is the better fit; if you want a cloud API with broad language support and mature product surface, ElevenLabs remains the default. The Miso Labs API, when it launches, will compete more directly with ElevenLabs on the cloud-API axis.

What can MisoTTS not do?

Four important limitations as of the release: single-turn only (the model handles one response at a time and doesn’t model conversation turn-taking), half-duplex (it cannot speak while the user is speaking, so natural conversational overlap isn’t supported), API not yet available (you have to run the open weights yourself for now), and English-only at release (multilingual support isn’t part of the current model). Quality and latency claims also haven’t been independently verified yet; budget time for your own benchmarks.

Can MisoTTS clone voices?

Yes, with a one-shot clip of about 10 seconds. Feed the model a reference audio sample and it generates speech in the cloned voice. The cloned voice stays consistent across long generations (a problem that earlier voice-cloning systems struggled with). Output is watermarked by default via SilentCipher for provenance tracking. For use cases involving voice impersonation, check the modified MIT license restrictions before shipping a product that uses cloning.

What hardware do I need to run MisoTTS?

A capable CUDA GPU. The 8B model in F32 is about 32GB on disk; inference defaults to torch.bfloat16, so the loaded model occupies roughly half that in GPU memory. Practical inference probably wants at least 24GB VRAM, which means an RTX 4090 or better on the consumer side, or A100/H100/L40 class GPUs in the cloud. Smaller GPUs may work with additional quantization but the release doesn’t include official quantized weights.

What’s the licensing story?

MisoTTS weights are distributed under a “modified MIT license” on Hugging Face. The exact modifications to standard MIT aren’t documented in the release blog, so check the LICENSE file in the GitHub repository before commercial use. AI model license modifications typically cover use-case restrictions (no weapons, no surveillance, no impersonation), attribution requirements, and redistribution constraints. Plan to comply with whatever the modifications are.

How does MisoTTS fit a voice-agent stack?

MisoTTS handles the text-to-speech step. A complete voice agent also needs speech-to-text (for hearing the user), a language model (for understanding and reasoning), and orchestration (for managing turn-taking, latency budgets, and the conversation flow). Existing voice agent frameworks like LiveKit, Pipecat, and Vocode host the orchestration; MisoTTS plugs in as the TTS step in the same way ElevenLabs or OpenAI TTS would. Because MisoTTS is single-turn and half-duplex, the orchestration layer carries more responsibility for handling natural conversation patterns than it would with a full-duplex voice model.

Miso Labs MisoTTS: An 8B Open-Weight Emotive Text-to-Speech Model

What MisoTTS actually is

The vocabulary problem MisoTTS solves

The two-transformer architecture, in plain language

What MisoTTS does well today

What MisoTTS can’t do yet

Licensing and access

Where MisoTTS fits in the TTS landscape

What voice agent builders should do today

Frequently Asked Questions

One Sharp Round-Up, Once a Month

Related Reading

What Is Qwen3.8? Alibaba’s New Open Model, Explained

What Is Claude Opus 5? Anthropic’s New Opus Model, Explained

What Is Google Whisk? Image-to-Image Remixing, Explained