← Back to homepage
AI Tools

xAI brings two-minute voice cloning to the Grok API

xAI's new Custom Voices feature lets developers create branded or personal AI voices for TTS and voice agents, with an 80+ voice library, 28 languages, and important safety limits.

By TreffikAI Editorial9 min read
A recording studio microphone and keyboard representing AI voice cloning

xAI has moved deeper into voice AI with Custom Voices, a new capability for creating cloned voices and using them across the company's Text to Speech and Voice Agent APIs.

The pitch is deliberately simple: record a short voice sample, create a custom voice in under two minutes, or skip cloning and choose from a library of more than 80 built-in voices across 28 languages. For developers building voice agents, audiobooks, games, education tools, support bots or creator workflows, that is a big expansion of what the Grok API can sound like.

It also lands at a sensitive moment. Voice cloning is useful because it can make AI systems feel more personal, consistent and expressive. It is risky for the same reason. A realistic synthetic voice can strengthen a brand, preserve accessibility, or narrate content at scale. It can also be abused for impersonation, fraud and misinformation.

xAI's launch therefore needs to be read in two ways: as a product move in the voice API market, and as another test of whether AI labs can make cloning convenient without making consent optional.

What xAI launched

The new feature is officially framed as Custom Voices and Voice Library. Custom Voices lets a team create its own cloned voice from a reference recording. Voice Library gives the same team a catalog for browsing, previewing and managing both custom and built-in voices in the xAI console.

xAI says a custom voice can be created in under two minutes and then used wherever a built-in voice works. In practice, that means a developer can pass a voice_id into a TTS request, use it with streaming TTS, or connect it to the real-time Voice Agent API.

That matters because the voice is not a separate toy feature. It becomes an addressable resource inside the API stack.

For a product team, that can turn voice into something closer to a design system: one approved narrator for educational content, another voice for customer support, a separate character voice for a game, and a recognizable brand voice for live agents.

The library side is also important. Not every team wants to handle consent, recordings and identity checks for custom voices. A catalog of 80+ voices across 28 languages gives builders a faster path when they need variety rather than a clone.

How the cloning flow works

xAI's public announcement emphasizes speed, but the documentation gives a more useful picture of the workflow.

Developers can create a voice in the console by recording natural speech. The docs say reference audio can be up to 120 seconds long, and xAI recommends aiming for 90 to 120 seconds for the best results. Shorter clips are accepted, but clips under 30 seconds may not capture enough vocal detail.

The recording should contain a single speaker, no music, no background voices and as little noise as possible. xAI recommends a quiet room, a decent microphone, a pop filter if available and a recording style that matches the intended use case.

That last point is easy to overlook. If the goal is audiobook narration, record prose with narration pacing. If the goal is customer support, record natural support-style speech. If the reference sounds stiff and scripted, the clone may inherit that delivery.

The model is not only learning the sound of the voice. It is also picking up rhythm, expressiveness, pace and speaking habits.

Where the voice can be used

Once created, a custom voice is available across xAI's voice APIs. The documented paths include POST /v1/tts, WebSocket TTS and the real-time wss://api.x.ai/v1/realtime voice stack.

For ordinary TTS, that means turning text into audio using the custom voice. For streaming, it means delivering audio progressively rather than waiting for a full file. For voice agents, it means giving a conversational AI system a consistent voice during live interactions.

This is where the feature becomes more than a content tool. A cloned or carefully selected voice can become the interface of an agent.

Customer support is the obvious case: a company can give every automated agent the same calm, recognizable delivery. Games and interactive fiction are another strong fit, because character voices can be generated without booking a studio session for every line.

Audiobooks, podcasts, social video and training material are also natural targets. A creator or company can keep a consistent sound across large volumes of scripted content, including multilingual variants.

The language and voice library angle

The headline number is 80+ built-in voices across 28 languages.

xAI lists multilingual support including languages such as Arabic, Danish, German, English, Spanish, Finnish, French, Hindi, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Russian, Swedish, Thai, Turkish, Vietnamese and Chinese.

That range matters because voice AI is often judged first in English, while real products need local languages, accents and use-case-specific tone.

The strongest value may be for teams that want to prototype globally before committing to custom recordings. Instead of creating a cloned voice for every market, they can start with built-in voices, test scripts and only later decide where a custom identity is worth the operational overhead.

For Polish users, the presence of Polish in the library is notable. It does not automatically mean perfect prosody, inflection or local nuance, but it suggests xAI wants Grok Voice to be more than an English-only showcase.

Voice cloning needs a serious safety layer, and xAI is trying to put one directly into the creation process.

Every custom voice goes through a two-stage verification flow. First, the speaker reads a verification phrase aloud. xAI's speech-to-text system transcribes and matches that phrase in real time, which is meant to confirm both intent and presence.

Second, xAI computes speaker embeddings from the verification clip and the full recording. Those embeddings are compared to confirm that the person reading the passphrase is the same speaker as the person in the reference audio.

The company's position is clear: you should not be able to clone someone else's voice from a pre-existing recording.

That is a strong design goal, but it should not be treated as the end of the safety conversation. xAI has not published detailed false-acceptance rates, spoofing tests or independent red-team results for this system. For high-risk deployments, companies should still build consent records, review policies and abuse monitoring around any voice cloning workflow.

Availability and access limits

There are two practical constraints developers should notice before planning around the feature.

First, Custom Voices is currently available only in the United States, excluding Illinois, according to the xAI documentation. That is likely tied to biometric privacy and voice-rights rules, and it matters for international teams.

Second, console-based voice creation and API-based creation are not identical. xAI says teams can create up to 30 custom voices for free in the console. The POST /v1/custom-voices endpoint is gated to Enterprise teams.

That means many developers can experiment with the console, copy a voice_id, and use that voice in TTS or voice-agent workflows. But fully automated voice creation through the API may require an Enterprise plan.

This is a sensible boundary for a risky capability. It gives developers access without making bulk automated cloning universally available on day one.

Pricing and developer economics

xAI says there is no extra charge for using custom voices with the Text to Speech or Voice Agent APIs. The normal voice API prices apply.

As of the current xAI pricing page, realtime voice costs $0.05 per minute, or $3.00 per hour. Text to Speech is listed at $15 per 1 million characters. Speech to Text is priced separately, with REST and streaming rates.

Those numbers matter because voice agents can become expensive quickly. A text chatbot pays for tokens. A realtime voice agent pays for time, speech generation, speech recognition and often the underlying model work behind the conversation.

The lack of a separate cloning surcharge makes Custom Voices easier to adopt, but it does not make voice applications free. Teams still need to model session length, retry behavior, latency budgets, moderation and storage.

The big product question is whether a more distinctive voice improves completion, retention or customer satisfaction enough to justify those costs.

Why it matters for the voice AI market

xAI is not entering an empty category. ElevenLabs, OpenAI, Google, Meta research systems, open-source voice models and many smaller audio startups have already made voice cloning a competitive space.

What makes xAI's move interesting is the packaging. It is putting voice cloning, preset voices, TTS, realtime agents and the broader Grok platform into the same API story.

That shifts the competition from "who can clone a voice" to "who can make a complete voice product stack easy to ship."

For developers, the difference is practical. A good clone is only one piece. They also need low latency, streaming, stable pricing, tool calls, memory, safety filters, observability and a way to manage voice assets across a team.

If xAI can make that stack reliable, Custom Voices becomes more than a novelty. It becomes part of how voice agents are branded and deployed.

The risk is not theoretical

The abuse cases around voice cloning are already well understood. Scammers can impersonate family members, executives or public figures. Political audio can be fabricated. Customer support systems can be spoofed. A creator's voice can be copied without permission.

That is why the consent flow matters so much. But even a well-designed system cannot solve the whole ecosystem problem.

A responsible deployment should include written consent, internal approvals for brand voices, clear user disclosure when people are speaking with an AI, logging for voice creation, deletion controls and a process for handling complaints.

For public-facing voice agents, companies should also decide whether the agent may sound like a real employee, a fictional brand character or a clearly synthetic assistant. Those choices change user expectations.

The easier voice cloning becomes, the more governance becomes part of the product surface.

The bottom line

xAI's Custom Voices launch gives developers a fast route to personalized audio inside the Grok API ecosystem. The combination of under-two-minute creation, a large multilingual voice library and compatibility with TTS and realtime voice agents makes it useful for far more than novelty demos.

The catch is that the most important work starts after the clone exists. Teams need to decide whose voice is allowed, where it can be used, how consent is recorded, how users are informed and what happens when a voice is retired.

For builders, xAI's voice cloning is a new creative lever. For companies, it is a brand and trust decision. For the wider AI market, it is another sign that voice agents are moving from experimental demos into production APIs.

(Photo: TStudio_lv / Unsplash, license.)

Tags:#xai#grok#voice-cloning#ai-agents
Share: