alphaXiv

History

Papers Benchmarks

Character AI

526

30 Sep 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Yale University Character AI

OVI presents a unified generative AI framework for synchronized audio-video content, employing a symmetric twin backbone Diffusion Transformer with blockwise cross-modal fusion. The model achieves clear human preference over existing open-source solutions for combined audio and video quality, along with improved synchronization.

1,506

22 May 2023

computer-science computation-and-language sound

Scaling Speech Technology to 1,000+ Languages

Meta Hebrew University of Jerusalem

Apple JPMorgan Chase Character AI Open AI

Meta AI researchers developed the Massively Multilingual Speech (MMS) project, which extends automatic speech recognition, language identification, and text-to-speech technologies to over 1,000 languages. This involved creating a 44.7K-hour labeled dataset from religious texts and resulted in ASR models outperforming Whisper (large-v2) on FLEURS-54 with a 58% relative WER reduction, supporting 1,107 languages.

31,861

402

03 Jun 2025

computer-science conversational-ai artificial-intelligence

TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models

Character AI

Character AI researchers develop TalkingMachines, a framework that adapts large video foundation models for real-time audio-driven character animation by transforming WAN2.1's bidirectional architecture into an autoregressive system through asymmetric knowledge distillation, achieving real-time performance (meeting TTBC latency thresholds) with only 2 neural function evaluations versus the original 24, while enabling infinite video streaming through sparse causal attention patterns that allow each 3-frame chunk to attend only to itself, the previous chunk, and initial reference frame, with system optimizations including Score-VAE disaggregation across separate GPUs and CUDA stream overlapping reducing inference bottlenecks for FaceTime-style AI character interactions.

142

13 Feb 2024

computer-science computation-and-language reinforcement-learning

Suppressing Pink Elephants with Direct Principle Feedback

Brown University SynthLabs.ai EleutherAI Character AI

Nathan Lile

Researchers developed Direct Principle Feedback (DPF), a simplified AI feedback method, to enable large language models to reliably avoid specific "forbidden" topics when prompted. This approach, trained on a synthetically generated dataset, achieved avoidance rates comparable to GPT-4 while maintaining general model performance, addressing a common failure mode where LLMs paradoxically mention what they are told to avoid.

01 Sep 2024

computer-science artificial-intelligence machine-learning

Sample-Efficient Diffusion for Text-To-Speech Synthesis

Cornell University Character AI ASAPP Inc.

This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm for effective speech synthesis in modest data regimes through latent diffusion. It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT), that efficiently scales to long sequences and operates in the latent space of a pre-trained audio autoencoder. Conditioned on character-aware language model representations, SESD achieves impressive results despite training on less than 1k hours of speech - far less than current state-of-the-art systems. In fact, it synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode