Ask or search anything...

History

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Hot

Suzhou Municipal Key Laboratory of Multimodal Intelligent Systems

Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models

06 Jun 2025

Wuhan University Duke Kunshan University

This paper presents a framework that integrates speaker diarization with Large Language Models for multi-speaker automatic speech recognition. The system generates speaker-attributed transcriptions with precise temporal alignment by using a novel diarization-aware triplet enrollment mechanism, yielding competitive performance on Mandarin meeting datasets and strong multilingual capabilities.

View blog

#audio-and-speech-processing #electrical-engineering

Resources

WhisperVC: Target Speaker-Controllable Mandarin Whisper-to-Speech Conversion

02 Nov 2025

Wuhan University Duke Kunshan University

Whispered speech lacks vocal-fold excitation and exhibits reduced energy and shifted formant frequencies, making natural and intelligible voice reconstruction highly challenging. To address this issue, we propose \emph{WhisperVC}, a three-stage framework for Mandarin whisper-to-speech (W2S) conversion. Stage~1 employs a fine-tuned Content Encoder based on the OpenAI Whisper-large~V3 model and a Conformer-based variational autoencoder with soft-DTW alignment to learn domain-invariant and temporally consistent representations. Stage~2 introduces a deterministic Length--Channel Aligner and a duration-free FastSpeech~2 model conditioned on speaker embeddings for controllable timbre and stable prosody. Stage~3 fine-tunes a HiFi-GAN vocoder on predicted mel-spectrograms to synthesize high-fidelity waveforms. Experiments on the AISHELL6-Whisper corpus demonstrate that WhisperVC achieves near ground-truth quality (\textbf{DNSMOS~3.11}, \textbf{UTMOS~2.52}, \textbf{CER~18.67\%}), while maintaining speaker similarity (\textbf{cosine~0.76}) and robust performance under whisper-only inference.

View blog

#audio-and-speech-processing #electrical-engineering

Resources

There are no more papers matching your filters at the moment.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

Ask or search anything...

Events