Character AI
OVI presents a unified generative AI framework for synchronized audio-video content, employing a symmetric twin backbone Diffusion Transformer with blockwise cross-modal fusion. The model achieves clear human preference over existing open-source solutions for combined audio and video quality, along with improved synchronization.
12
Meta AI researchers developed the Massively Multilingual Speech (MMS) project, which extends automatic speech recognition, language identification, and text-to-speech technologies to over 1,000 languages. This involved creating a 44.7K-hour labeled dataset from religious texts and resulted in ASR models outperforming Whisper (large-v2) on FLEURS-54 with a 58% relative WER reduction, supporting 1,107 languages.
31,861
Character AI researchers develop TalkingMachines, a framework that adapts large video foundation models for real-time audio-driven character animation by transforming WAN2.1's bidirectional architecture into an autoregressive system through asymmetric knowledge distillation, achieving real-time performance (meeting TTBC latency thresholds) with only 2 neural function evaluations versus the original 24, while enabling infinite video streaming through sparse causal attention patterns that allow each 3-frame chunk to attend only to itself, the previous chunk, and initial reference frame, with system optimizations including Score-VAE disaggregation across separate GPUs and CUDA stream overlapping reducing inference bottlenecks for FaceTime-style AI character interactions.
Researchers developed Direct Principle Feedback (DPF), a simplified AI feedback method, to enable large language models to reliably avoid specific "forbidden" topics when prompted. This approach, trained on a synthetically generated dataset, achieved avoidance rates comparable to GPT-4 while maintaining general model performance, addressing a common failure mode where LLMs paradoxically mention what they are told to avoid.
This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm for effective speech synthesis in modest data regimes through latent diffusion. It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT), that efficiently scales to long sequences and operates in the latent space of a pre-trained audio autoencoder. Conditioned on character-aware language model representations, SESD achieves impressive results despite training on less than 1k hours of speech - far less than current state-of-the-art systems. In fact, it synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.
There are no more papers matching your filters at the moment.