alphaXiv

107

01 Jun 2024

computer-science sound audio-and-speech-processing

DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors

Brno University of Technology Athens University of Economics and Business Omilia

The paper introduces DiaPer, an end-to-end neural diarization model that replaces LSTM-based attractor generation in EEND-EDA with a Perceiver-based architecture. This approach improves performance, particularly in multi-speaker scenarios and with faster inference for long recordings, while maintaining a lightweight design across various telephone and wide-band datasets.

24

22 Aug 2025

audio-and-speech-processing electrical-engineering

Hybrid Pruning: In-Situ Compression of Self-Supervised Speech Models for Speaker Verification and Anti-Spoofing

Brno University of Technology

Nanjing University

Johns Hopkins University Athens University of Economics and Business Archimedes/Athena RC Omilia

Although large-scale self-supervised learning (SSL) models like WavLM have achieved state-of-the-art performance in speech processing, their significant size impedes deployment on resource-constrained devices. While structured pruning is a key technique for model compression, existing methods typically separate it from task-specific fine-tuning. This multi-stage approach struggles to create optimal architectures tailored for diverse downstream tasks. In this work, we introduce a unified framework that integrates structured pruning into the downstream fine-tuning process. Our framework unifies these steps, jointly optimizing for task performance and model sparsity in a single stage. This allows the model to learn a compressed architecture specifically for the end task, eliminating the need for complex multi-stage pipelines and knowledge distillation. Our pruned models achieve up to a 70\% parameter reduction with negligible performance degradation on large-scale datasets, achieving equal error rates of 0.7\%, 0.8\%, and 1.6\% on Vox1-O, -E, and -H, respectively. Furthermore, our approach demonstrates improved generalization in low-resource scenarios, reducing overfitting and achieving a state-of-the-art 3.7\% EER on ASVspoof5.

29

03 Oct 2024

audio-and-speech-processing electrical-engineering

State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data

Brno University of Technology Athens University of Economics and Business Omilia Universidad Aut ´ onoma de Madrid

Researchers from Brno University of Technology and collaborators introduce a video-free, weakly-supervised method for training speaker embedding extractors using only audio and recording-level labels. This two-stage approach achieves state-of-the-art speaker verification performance on VoxCeleb1 while being robust to initial diarization quality and significantly reducing reliance on video data and precise annotations.

08 Nov 2025

audio-and-speech-processing electrical-engineering

Hybrid Pruning: In-Situ Compression of Self-Supervised Speech Models for Speaker Verification and Anti-Spoofing

Brno University of Technology

Nanjing University

Johns Hopkins University Athens University of Economics and Business Omilia Archimedes/Athena R.C.

Although large-scale self-supervised learning (SSL) models like WavLM have achieved state-of-the-art performance in speech processing, their significant size impedes deployment on resource-constrained devices. While structured pruning is a key technique for model compression, existing methods typically separate it from task-specific fine-tuning. This multi-stage approach struggles to create optimal architectures tailored for diverse downstream tasks. In this work, we introduce a unified framework that integrates structured pruning into the downstream fine-tuning process. Our framework unifies these steps, jointly optimizing for task performance and model sparsity in a single stage. This allows the model to learn a compressed architecture specifically for the end task, eliminating the need for complex multi-stage pipelines and knowledge distillation. Our pruned models achieve up to a 70\% parameter reduction with negligible performance degradation on large-scale datasets, achieving equal error rates of 0.7\%, 0.8\%, and 1.6\% on Vox1-O, -E, and -H, respectively. Furthermore, our approach demonstrates improved generalization in low-resource scenarios, reducing overfitting and achieving a state-of-the-art 3.7\% EER on ASVspoof5.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors

Hybrid Pruning: In-Situ Compression of Self-Supervised Speech Models for Speaker Verification and Anti-Spoofing

State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data

Hybrid Pruning: In-Situ Compression of Self-Supervised Speech Models for Speaker Verification and Anti-Spoofing

Events

AI for Law

Personalize Your Feed