AI Talking Characters: Voice, Lip Sync and Facial Animation
In This Guide
What Makes a Character Talk
A talking character is the sum of three systems working in tight coordination: audio, visual deformation, and timing. Strip any one of them away and the effect collapses. Audio without matching lip movement feels like a dubbed foreign film. Lip movement without audio is unsettling. And both together without proper timing creates an uncanny disconnect that players notice instantly, even if they cannot articulate why.
The audio layer is the voice itself. This can be pre-recorded dialogue performed by voice actors, procedurally generated speech from a text-to-speech engine, or a combination of both where key lines are recorded and ambient chatter is synthesized. Each approach carries different constraints for lip sync. Pre-recorded audio can be analyzed offline, giving you precise phoneme timing data before the game ever ships. Synthesized audio must be analyzed in real time, or the TTS engine must provide timing metadata alongside the audio stream.
The visual layer is the character's face mesh and its ability to deform. In modern 3D pipelines, this deformation happens through morph targets, sometimes called blend shapes. A morph target is a stored copy of the mesh with specific vertices displaced to form a particular shape, like lips pressed together for an "M" sound or a wide open jaw for an "AH" vowel. The engine interpolates between the neutral face and each target, blending multiple shapes simultaneously. A single frame of speech might combine a jaw-open target at 70% influence with a lip-round target at 40% influence and a slight cheek-raise at 15%.
The timing layer is the bridge between audio and visuals. It answers the question of which mouth shape should be active at which millisecond. This mapping is not one-to-one. Human speech involves coarticulation, where the shape of the mouth for one sound is influenced by the sounds before and after it. The word "tree" does not produce three distinct mouth shapes in sequence. The lips begin rounding for the "r" while still forming the "t," and the tongue is already moving toward the "ee" position before the "r" finishes. Good lip sync systems model this overlap through interpolation curves and transition timing rather than snapping between discrete shapes.
Together, these three systems create the perception of a character that is actually speaking. The quality bar is lower than you might expect. Players are remarkably forgiving of approximate lip sync as long as the timing is close and the jaw movement roughly matches syllable stress. What breaks immersion is not imperfect mouth shapes but rather visible desynchronization, where the audio says one thing and the face does another at a noticeably different moment.
The Lip Sync Pipeline
Every lip sync implementation, regardless of engine or platform, follows the same fundamental pipeline. Audio enters one end and morph target weights come out the other. Understanding each stage of this pipeline is essential for debugging problems and making informed tradeoffs between quality, performance, and flexibility.
The first stage is audio input. Raw audio arrives as a stream of PCM samples, typically at 16kHz or 44.1kHz sample rate. This audio might come from a file loaded from disk, a network stream from a TTS service, or the output of the Web Audio API. The format matters less than the fact that you need access to the raw waveform data, either as a complete buffer for offline analysis or as a rolling window for real-time processing.
The second stage is phoneme detection. A phoneme is the smallest unit of sound that distinguishes one word from another. English has roughly 44 phonemes. The phoneme detector examines the audio and produces a time-stamped sequence: "AH" from 0ms to 80ms, "L" from 80ms to 120ms, "OH" from 120ms to 200ms, and so on. Some systems skip explicit phoneme detection and work directly with audio features like spectral envelopes or formant frequencies, but the conceptual output is the same, a timeline of speech sounds.
The third stage is viseme mapping. A viseme is the visual counterpart of a phoneme, the mouth shape associated with a particular speech sound. Multiple phonemes can map to the same viseme because some sounds look identical on the face even though they sound different. The sounds "b," "p," and "m" all produce a closed-lip viseme despite being acoustically distinct. A typical viseme set contains 12 to 15 entries, far fewer than the full phoneme inventory. The mapping table converts the phoneme timeline into a viseme timeline.
The fourth stage is weight calculation. Each viseme corresponds to one or more morph targets on the character mesh, and the system must calculate how strongly each target should be applied at each moment. This is where coarticulation modeling happens. Rather than jumping instantly from one viseme to the next, the system generates smooth interpolation curves. A common approach uses cosine interpolation with a short blend window, typically 60 to 100 milliseconds, so that each viseme fades in while the previous one fades out. More sophisticated systems use dominance functions that give certain visemes priority during blending, ensuring that distinctive shapes like the rounded lips of an "OO" sound are not washed out by adjacent visemes.
The fifth and final stage is morph target application. The calculated weights are applied to the mesh every frame. In Babylon.js this means setting the influence property on each MorphTarget in the MorphTargetManager. In Three.js it means writing to the mesh's morphTargetInfluences array. The GPU handles the vertex blending, interpolating between the base mesh and each active morph target according to the specified weights. This stage runs at the engine's frame rate, typically 60fps, regardless of how frequently the upstream stages update their data.
Understanding Visemes
Visemes are the visual vocabulary of speech. Where phonemes describe how speech sounds, visemes describe how speech looks. The term was coined by combining "visual" with "phoneme," and the concept is central to every lip sync system regardless of complexity. Understanding the standard viseme sets and how they map to morph targets is fundamental knowledge for anyone building talking characters.
The most widely adopted viseme standard in game development comes from the Oculus (now Meta) Lipsync SDK, which defines 15 visemes. This set has become a de facto standard because it strikes a practical balance between visual fidelity and implementation complexity. The 15 visemes are: sil (silence, neutral mouth), PP (lips pressed, as in "p" or "b" or "m"), FF (lower lip under upper teeth, as in "f" or "v"), TH (tongue between teeth, as in "th"), DD (tongue behind upper teeth, as in "d" or "t"), kk (tongue raised at back, as in "k" or "g"), CH (lips slightly open and rounded, as in "ch" or "j"), SS (teeth nearly closed, as in "s" or "z"), nn (lips slightly open with tongue behind teeth, as in "n" or "l"), RR (lips slightly rounded, as in "r"), aa (jaw dropped wide, as in "ah"), E (lips pulled back, as in "eh" or "ae"), ih (slight jaw drop, as in "ih"), oh (lips rounded medium, as in "oh"), and ou (lips tightly rounded, as in "oo").
Each of these visemes maps to a specific morph target on the character mesh. When a character model is created in Blender, Maya, or another 3D tool, the artist sculpts each of these 15 shapes as separate blend shape targets. The targets are named to match the viseme identifiers, and they export with the model in the glTF or GLB format. A well-authored model has clean topology around the mouth area, with enough edge loops to deform smoothly between all 15 positions without creasing or intersection artifacts.
The ARKit blend shape standard, used by Apple's face tracking and adopted by Ready Player Me avatars, takes a different approach. Instead of speech-specific visemes, ARKit defines 52 blend shapes that cover the entire face with fine-grained anatomical control. These include jawOpen, jawForward, jawLeft, jawRight, mouthClose, mouthFunnel, mouthPucker, mouthLeft, mouthRight, mouthSmileLeft, mouthSmileRight, mouthFrownLeft, mouthFrownRight, mouthDimpleLeft, mouthDimpleRight, mouthStretchLeft, mouthStretchRight, mouthRollLower, mouthRollUpper, mouthShrugLower, mouthShrugUpper, mouthPressLeft, mouthPressRight, mouthLowerDownLeft, mouthLowerDownRight, mouthUpperUpLeft, mouthUpperUpRight, and more. To do lip sync with ARKit shapes, you build a mapping table that converts each viseme into a combination of ARKit blend shapes with specific weights.
Choosing between the Oculus 15-viseme set and ARKit 52-blend-shape set depends on your character pipeline. If your models come from Ready Player Me or another ARKit-compatible source, you work with ARKit shapes and build the viseme-to-blend-shape mapping yourself. If you control the character creation pipeline, authoring dedicated viseme morph targets is more efficient, requires fewer active blend shapes per frame, and gives artists direct control over each speech pose. Many production pipelines use a hybrid, with dedicated viseme targets for the mouth and ARKit-style shapes for the rest of the face.
Lip Sync in Web Game Engines
Web-based game engines handle morph targets through their respective mesh systems, and the approach differs meaningfully between Babylon.js and Three.js. Both engines load morph targets from glTF/GLB files, but their APIs for manipulating blend shape weights at runtime follow different patterns.
In Babylon.js, morph targets are managed through the MorphTargetManager class. When you load a GLB model containing blend shapes, the engine automatically creates a MorphTargetManager and populates it with MorphTarget instances, one for each blend shape in the file. Each MorphTarget has an influence property that ranges from 0.0 (fully off) to 1.0 (fully active). To animate lip sync, you update these influence values every frame based on the current viseme weights from your phoneme analysis pipeline. You can access targets by index or iterate through the manager's targets list to find them by name. The Babylon.js morph target system performs blending on the GPU, which means even models with many active blend shapes maintain good frame rates. A typical lip sync implementation retrieves the manager from the mesh, builds a lookup table mapping viseme names to target indices during initialization, then sets influence values in the scene's render loop.
In Three.js, morph targets work through the mesh's morphTargetInfluences array and morphTargetDictionary object. The morphTargetDictionary maps blend shape names to array indices, and morphTargetInfluences is a flat Float32Array where each element corresponds to one blend shape's weight. Setting morphTargetInfluences[index] to a value between 0.0 and 1.0 activates that blend shape. Three.js also computes morph targets on the GPU through its shader system. The main practical difference from Babylon.js is that you work with array indices rather than named objects, which means the initialization step of mapping viseme names to indices through morphTargetDictionary is essential. React Three Fiber wraps this system with declarative morph target props, and the useFrame hook is the standard place to update influences each frame.
Both engines support the glTF 2.0 morph target specification, which means models created in Blender or any compliant DCC tool export blend shapes that both engines can read. The typical workflow is to create your character in Blender, add shape keys for each viseme, export as GLB, load the GLB in your engine, build the name-to-index mapping, then drive the weights from your audio analysis code. Ready Player Me avatars use this exact pipeline, shipping GLB models with ARKit-compatible blend shapes that both Babylon.js and Three.js can consume without modification.
Voice Synthesis for Game Characters
Voice synthesis, or text-to-speech, has reached a quality threshold where synthesized voices are viable for game characters beyond placeholder prototyping. The web platform offers several paths to voice synthesis, each with different quality levels, latency characteristics, and lip sync integration capabilities.
The Web Speech API is the zero-cost starting point. Built into modern browsers, it provides the SpeechSynthesis interface with a simple speechSynthesis.speak() call. Quality varies dramatically between browsers and operating systems. Chrome on desktop uses high-quality neural voices that sound natural for short utterances. Mobile browsers often fall back to older concatenative synthesis that sounds robotic. The major limitation for lip sync is that the Web Speech API provides no phoneme or viseme timing data. You get boundary events for word and sentence boundaries, but nothing at the phoneme level. This means you must either run a separate audio analysis step on the output, or fall back to simple jaw-flapping driven by audio amplitude.
Azure Cognitive Services Speech is the strongest option for integrated lip sync. Microsoft's service provides a viseme output stream alongside the audio, delivering timestamped viseme IDs that map directly to the Oculus 15-viseme set or to SVG mouth shapes. You can request visemes in real time via WebSocket or as part of a batch synthesis response. The viseme data arrives as an array of objects containing the viseme ID and the audio offset in milliseconds, which you feed directly into your morph target system. Azure also supports SSML markup for controlling pronunciation, emphasis, and pausing, which gives you fine-grained authorial control over how characters deliver their lines.
ElevenLabs produces the most natural-sounding voices currently available for web applications. Their API returns high-quality audio with optional word-level and character-level timestamps through the alignment endpoint. However, ElevenLabs does not provide viseme data directly. To do lip sync with ElevenLabs, you use the timestamps to align a separate phoneme analysis pass over the audio, or you use a client-side library like Rhubarb (compiled to WebAssembly) to extract phonemes from the audio buffer after it arrives. The quality of ElevenLabs voices is high enough that many developers consider the extra lip sync step worthwhile.
OpenAI's TTS API delivers good voice quality with very low latency through its streaming endpoint. Like ElevenLabs, it does not provide viseme or phoneme timing data. The voices sound natural but somewhat generic compared to ElevenLabs' voice cloning capabilities. For lip sync, you face the same challenge of running client-side phoneme analysis on the returned audio. OpenAI's advantage is its tight integration with GPT models, making it straightforward to pipe LLM-generated dialogue directly into voice synthesis within a single API ecosystem.
The practical reality is that Azure is the only major TTS provider that hands you viseme data ready to use. Every other provider requires a secondary analysis step. This is a meaningful architectural decision because client-side phoneme analysis adds latency, CPU load, and complexity. If lip sync quality is a priority and you want the simplest integration path, Azure is the pragmatic choice despite its voices being slightly less natural than ElevenLabs.
Facial Animation Beyond the Mouth
Lip sync handles the mouth, but a face that only moves its lips looks like a ventriloquist's dummy. Convincing talking characters need the full face to participate in the performance. Eyes, eyebrows, cheeks, and the forehead all move during natural speech, and their absence is as noticeable as bad lip sync.
Blink cycles are the most critical non-mouth animation. Humans blink every 3 to 5 seconds on average, but the rate varies with cognitive load, emotional state, and social context. A character that never blinks looks dead. A character that blinks at a perfectly regular interval looks robotic. The standard approach is a procedural blink system that uses a randomized timer, triggering a blink every 2 to 6 seconds with slight variation. Each blink is a quick close-open cycle lasting about 150 to 400 milliseconds, driven by a blend shape that closes both eyelids. Some implementations add a secondary "half blink" that occurs more frequently and looks like a natural micro-adjustment.
Eye gaze gives characters the appearance of attention and awareness. In a dialogue scene, the character should look at the player or the speaking character. Gaze is typically implemented by rotating eye bone transforms toward a target point in world space, with small saccadic movements added procedurally. Saccades are the rapid, jittering eye movements that happen between fixation points, and they are a strong cue for lifelike behavior. A character whose eyes are perfectly locked on a target without any micro-movement reads as mechanical. Adding random horizontal and vertical offsets of 1 to 3 degrees at intervals of 200 to 500 milliseconds creates the appearance of natural visual attention.
Eyebrow animation carries emotional weight during speech. Raised eyebrows signal surprise, emphasis, or questioning. Furrowed brows indicate concern, anger, or concentration. During normal conversation, eyebrows subtly track speech prosody, rising slightly on stressed syllables and emphasized words. This can be driven procedurally by analyzing the pitch contour of the audio, raising the brow blend shapes when the fundamental frequency increases. Even a simple correlation between audio pitch and eyebrow height adds measurable life to a talking character.
Cheek and nostril movements are secondary details that compound into realism. When the mouth opens wide for a vowel, the cheeks pull slightly. During an emphatic consonant, the nostrils may flare subtly. These micro-expressions are typically handled through blend shape coupling, where activating a mouth viseme at high influence automatically triggers a small amount of the associated cheek or nose morph target. This coupling is set up as a data table rather than code logic, making it tunable by artists without programmer involvement.
The ARKit blend shape standard makes full facial animation more accessible because it provides separate controls for each anatomical region. With 52 blend shapes covering the jaw, lips, cheeks, nose, eyes, eyebrows, and tongue, you can build layered animation systems where each system (lip sync, blink, gaze, emotion) writes to its own subset of blend shapes without interference. This separation of concerns is a significant architectural advantage over simpler viseme-only morph target sets that only cover the mouth.
Real-Time vs Pre-Baked Approaches
The decision between real-time and pre-baked lip sync shapes the entire architecture of your talking character system. Neither approach is universally better. Each has clear advantages that make it the right choice for specific use cases.
Pre-baked lip sync means analyzing your audio files offline, before the game ships, and storing the resulting viseme timelines as data files alongside the audio. At runtime, the engine reads the viseme data and plays it back in sync with the audio, with no analysis happening on the client. This is the standard approach for games with fully voice-acted dialogue where all lines are known at build time. The advantages are substantial: zero runtime CPU cost for phoneme analysis, perfect timing that has been verified by a human, and the ability to hand-tune individual viseme sequences where the automatic analysis produces poor results. Tools like Rhubarb Lip Sync are designed for this workflow, processing WAV files and outputting JSON timelines that you bundle with your game.
Real-time lip sync means analyzing audio as it plays, extracting phonemes and computing viseme weights on the fly. This is required when the audio is not known in advance, specifically when using text-to-speech to voice dynamically generated dialogue from an LLM. The audio arrives as a stream, and you must begin animating the face before the full utterance has finished generating. Real-time analysis adds CPU load, introduces latency between audio and visual, and produces less consistent results than offline tools. The compensating advantage is flexibility. Any text can become spoken dialogue without a pre-processing step, enabling truly dynamic conversations.
Hybrid approaches combine both strategies. Key storyline dialogue is pre-recorded and pre-baked for maximum quality. Ambient NPC chatter, procedural barks, and AI-driven conversations use real-time synthesis and analysis. This is the pragmatic choice for games that want polished narrative moments alongside dynamic world interactions. The implementation requires your animation system to accept viseme data from either source, a pre-loaded timeline or a real-time stream, and apply it through the same morph target pipeline.
Audio amplitude fallback is the simplest form of real-time lip sync. Instead of detecting phonemes, you measure the volume of the audio signal each frame and map it to jaw openness. Loud audio opens the jaw, quiet audio closes it. This produces crude "jaw flapping" that looks passable from a distance or for minor characters. Many games use amplitude-based lip sync for background NPCs while reserving full phoneme-based sync for characters in close-up dialogue scenes. The implementation is trivial, requiring only an AnalyserNode from the Web Audio API and a single morph target for jaw open.
Building the Full Talking Character Pipeline
A complete talking character pipeline in a web game connects several systems end to end. The architecture looks different depending on whether you are using pre-recorded dialogue or AI-generated speech, but the core data flow follows the same pattern: text becomes audio, audio becomes timing data, timing data becomes morph target weights, and weights become visible animation.
For the AI-driven path, the pipeline starts with player input. The player types or speaks a message, which is sent to a large language model (LLM) such as GPT-4 or Claude. The LLM generates a response as text. This text is sent to a TTS service, which returns an audio stream and, if the service supports it, phoneme or viseme timing data. The audio is played through the Web Audio API, and the timing data drives the morph target system. If the TTS service does not provide timing data, a client-side phoneme detector analyzes the audio buffer and produces the timing information locally.
Latency management is the central challenge in AI-driven pipelines. The player sends a message and expects a response. The LLM takes some time to generate text, the TTS service takes some time to synthesize audio, and the audio must buffer enough to begin playback. Without optimization, the total delay can reach several seconds, which feels unresponsive. The key optimization is streaming at every stage. The LLM streams tokens as they are generated. As soon as a complete sentence arrives, it is sent to TTS while the LLM continues generating. The TTS service streams audio chunks as they are synthesized. The client begins playback and lip sync animation as soon as the first chunk arrives, while subsequent chunks are still in transit. This pipelining can reduce perceived latency from seconds to hundreds of milliseconds.
For the pre-recorded path, the pipeline is simpler but requires more upfront work. Dialogue lines are recorded by voice actors and processed through an offline tool like Rhubarb Lip Sync, which outputs a JSON file containing timestamped viseme sequences for each audio clip. These JSON files are loaded alongside the audio at runtime. When a dialogue line plays, the engine reads the corresponding viseme timeline and applies weights to the morph targets in sync with the audio playback position. The dialogue system triggers lines based on game events, quest state, or proximity, and the lip sync follows automatically.
The morph target animation loop runs every frame regardless of the source. It maintains a current viseme state consisting of weights for each morph target, and a target viseme state derived from the phoneme timeline. Each frame, it interpolates the current state toward the target using a smoothing function, typically lerp with a blend factor tuned for natural-looking transitions. The smoothing prevents harsh pops between visemes and handles timing imprecision in the upstream data. A blend factor of 0.3 to 0.5 per frame at 60fps produces natural-looking transitions for most speech rates.
Performance and Optimization
Morph targets are computed on the GPU in both Babylon.js and Three.js, which means the per-frame cost of applying blend shapes is relatively low compared to CPU-side skeletal animation. However, the total cost scales with the number of active morph targets and the vertex count of the mesh. A character face with 5000 vertices and 15 active morph targets is cheap. A character with 50000 vertices and 52 active ARKit blend shapes is measurably more expensive, particularly on mobile GPUs.
The first optimization lever is limiting active morph targets. Most lip sync frames only need 2 to 4 visemes active simultaneously. Rather than setting all 15 (or 52) morph target weights every frame, only write to targets whose weight changed since the last frame. Both Babylon.js and Three.js check for weight changes internally before uploading to the GPU, but avoiding unnecessary property writes reduces JavaScript overhead in the animation loop.
The second lever is mesh complexity around the mouth. The morph target delta (the difference between the base mesh and the target position) is uploaded to the GPU for every vertex that moves. If your character mesh has dense geometry across the entire head but only the mouth area deforms for lip sync, the GPU is processing deltas for many vertices that are always zero. Some pipelines optimize this by splitting the head mesh into a static portion and a deformable face portion, applying morph targets only to the face submesh. This reduces GPU work proportionally to the vertex reduction.
Audio analysis, whether for real-time phoneme detection or simple amplitude measurement, should run on a separate thread when possible. The Web Audio API's AnalyserNode runs on the audio thread, so amplitude data is essentially free. More complex phoneme analysis using a WebAssembly port of Rhubarb or a custom neural network should run in a Web Worker to avoid blocking the main thread and causing frame drops. The worker receives audio buffers via transferable objects (zero-copy), processes them, and posts back viseme weights that the main thread applies during the next animation frame.
On mobile browsers, morph target performance is more constrained. Mobile GPUs have lower fill rate, fewer shader cores, and tighter thermal limits. The practical ceiling is around 8 to 12 active morph targets on mid-range mobile devices before frame rate impact becomes noticeable. This argues for using the 15-viseme set rather than the 52-shape ARKit set on mobile, and for limiting concurrent facial animations. If your game runs on both desktop and mobile, a quality tier system that adjusts the morph target count based on device capability is a worthwhile investment.
Memory is the other constraint. Each morph target stores a delta buffer proportional to the vertex count of the mesh. For a 10000-vertex face mesh with 52 morph targets, the morph target data alone occupies roughly 10000 vertices times 3 floats times 4 bytes times 52 targets, which is approximately 62MB. With the 15-viseme set, that drops to roughly 18MB. On memory-constrained mobile devices, this difference matters. Compressing morph target data using quantization, a feature supported by the glTF KHR_mesh_quantization extension, can reduce this footprint by 50 to 75 percent with minimal visual quality loss.