IndexTTS2, developed by bilibili, introduces an autoregressive zero-shot text-to-speech (TTS) system that achieves precise, token-count-based speech duration control and robust emotional expression from reference audio or natural language. It integrates GPT latent representations to enhance speech clarity, particularly in emotional speech, and outperforms state-of-the-art baselines across objective and subjective metrics.
View blogBilibili Inc. researchers developed AniSora, a comprehensive AI system designed for animation video generation that includes a curated 10-million-clip dataset, a controllable diffusion transformer model, and animation-specific evaluation metrics. The system achieved superior performance in generating high-quality, controllable animation videos, particularly excelling in visual smoothness and character consistency compared to existing general video models.
View blogResearchers at BILIBILI Inc. developed MX-Font++, a few-shot font generation model that enhances content-style disentanglement and feature extraction using Heterogeneous Aggregation Experts and a content-style homogeneity loss. The model achieves state-of-the-art visual quality for complex Chinese characters and substantially improves scene text recognition accuracy for low-resource languages like Cyrillic when trained with its generated fonts.
View blog