alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

National Institute of Information and Communications TechnologyKyoto

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

13 Jun 2024

Shanghai Jiao Tong University National Institute of Informatics

A multi-speaker text-to-speech system allows users to generate custom voices from natural language prompts derived from listener impressions. The system employs Low-rank Adaptation (LoRA) for efficient language model fine-tuning and a hybrid discriminative-generative approach with Flow Matching to synthesize speaker embeddings, yielding high-fidelity and controllable speech.

#computer-science #sound #audio-and-speech-processing

Paper thumbnail

Turning Whisper into Real-Time Transcription System

21 Sep 2023

Charles University National Institute of Information and Communications Technology

Researchers from Charles University and NICT developed 'Whisper-Streaming,' an adaptation of OpenAI's Whisper model, to provide real-time, low-latency automatic speech recognition and translation, achieving average latencies of 3.3 seconds for English and 4.4-4.8 seconds for German/Czech ASR in live settings.

#computer-science #computation-and-language

Resources 3,375

Paper thumbnail

The AudioMOS Challenge 2025

01 Sep 2025

Nagoya University Meta logo

This is the summary paper for the AudioMOS Challenge 2025, the very first challenge for automatic subjective quality prediction for synthetic audio. The challenge consists of three tracks. The first track aims to assess text-to-music samples in terms of overall quality and textual alignment. The second track is based on the four evaluation dimensions of Meta Audiobox Aesthetics, and the test set consists of text-to-speech, text-to-audio, and text-to-music samples. The third track focuses on synthetic speech quality assessment in different sampling rates. The challenge attracted 24 unique teams from both academia and industry, and improvements over the baselines were confirmed. The outcome of this challenge is expected to facilitate development and progress in the field of automatic evaluation for audio generation systems.

#computer-science #sound #audio-and-speech-processing

Paper thumbnail

An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

13 Jun 2024

Tianjin University University of Edinburgh

Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.

#computer-science #computation-and-language #audio-and-speech-processing

Paper thumbnail

RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models via Romanization

23 Jun 2024

Microsoft National Institute of Information and Communications Technology

This study addresses the challenge of extending Large Language Models (LLMs) to non-English languages that use non-Roman scripts. We propose an approach that utilizes the romanized form of text as an interface for LLMs, hypothesizing that its frequent informal use and shared tokens with English enhance cross-lingual alignment. Our approach involves the continual pretraining of an English LLM like Llama 2 on romanized text of non-English, non-Roman script languages, followed by instruction tuning on romanized data. The results indicate that romanized text not only reduces token fertility by 2x-4x but also matches or outperforms native script representation across various NLU, NLG, and MT tasks. Moreover, the embeddings computed on romanized text exhibit closer alignment with their English translations than those from the native script. Our approach presents a promising direction for leveraging the power of English LLMs in languages traditionally underrepresented in NLP. Our code is available on this https URL.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

RomanLens: The Role Of Latent Romanization In Multilinguality In LLMs

09 Jun 2025

IT University of Copenhagen Microsoft logo

Researchers from AI4Bharat and other institutions explored how large language models (LLMs) process non-Roman script languages, revealing that LLMs implicitly leverage Romanization as an intermediate step. This internal Romanization facilitates consistent semantic encoding across native and Romanized scripts and enables target language representations to emerge earlier in the model's layers.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

TikZero: Zero-Shot Text-Guided Graphics Program Synthesis

14 Aug 2025

University of Technology Nuremberg University of Mannheim

Automatically synthesizing figures from text captions is a compelling capability. However, achieving high geometric precision and editability requires representing figures as graphics programs in languages like TikZ, and aligned training data (i.e., graphics programs with captions) remains scarce. Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available. We reconcile these disparate data sources by presenting TikZero, which decouples graphics program generation from text understanding by using image representations as an intermediary bridge. It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis during inference. We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs. Furthermore, when leveraging caption-aligned graphics programs as a complementary training signal, TikZero matches or exceeds the performance of much larger models, including commercial systems like GPT-4o. Our code, datasets, and select models are publicly available.

#computer-science #computation-and-language #computer-vision-and-pattern-recognition

Paper thumbnail

Triple Phase Transitions: Understanding the Learning Dynamics of Large Language Models from a Neuroscience Perspective

29 Mar 2025

Osaka University National Institute of Informatics

Large language models (LLMs) often exhibit abrupt emergent behavior, whereby new abilities arise at certain points during their training. This phenomenon, commonly referred to as a ''phase transition'', remains poorly understood. In this study, we conduct an integrative analysis of such phase transitions by examining three interconnected perspectives: the similarity between LLMs and the human brain, the internal states of LLMs, and downstream task performance. We propose a novel interpretation for the learning dynamics of LLMs that vary in both training data and architecture, revealing that three phase transitions commonly emerge across these models during training: (1) alignment with the entire brain surges as LLMs begin adhering to task instructions Brain Alignment and Instruction Following, (2) unexpectedly, LLMs diverge from the brain during a period in which downstream task accuracy temporarily stagnates Brain Detachment and Stagnation, and (3) alignment with the brain reoccurs as LLMs become capable of solving the downstream tasks Brain Realignment and Consolidation. These findings illuminate the underlying mechanisms of phase transitions in LLMs, while opening new avenues for interdisciplinary research bridging AI and neuroscience.

#ai-for-health #computer-science #artificial-intelligence

Paper thumbnail

Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models

05 Nov 2025

Kyoto University The University of Osaka

Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.

#causal-inference #computer-science #computation-and-language

Paper thumbnail

Privacy in continuous-variable distributed quantum sensing

15 Sep 2025

Keio University Sorbonne Université logo

Sorbonne Université

Can a distributed network of quantum sensors estimate a global parameter while protecting every locally encoded value? We answer this question affirmatively by introducing and analysing a protocol for distributed quantum sensing in the continuous-variable regime. We consider a multipartite network in which each node encodes a local phase into a shared entangled Gaussian state. We show that the average phase can be estimated with high precision, exhibiting Heisenberg scaling in the total photon number, while individual phases are inaccessible. Although complete privacy - where all other combinations of phases remain entirely hidden - is unattainable for finite squeezing in multi-party settings, it emerges in the large-squeezing limit. We further investigate the impact of displacements and optical losses, revealing trade-offs between estimation accuracy and privacy. Finally, we benchmark the protocol against other continuous-variable resource states.

#physics #quantum-physics

Paper thumbnail

Propagating Gottesman-Kitaev-Preskill states encoded in an optical oscillator

05 Sep 2023

the University of Tokyo Kobe University

A quantum computer with low-error, high-speed quantum operations and capability for interconnections is required for useful quantum computations. A logical qubit called Gottesman-Kitaev-Preskill (GKP) qubit in a single Bosonic harmonic oscillator is efficient for mitigating errors in a quantum computer. The particularly intriguing prospect of GKP qubits is that entangling gates as well as syndrome measurements for quantum error correction only require efficient, noise-robust linear operations. To date, however, GKP qubits have been only demonstrated at mechanical and microwave frequency in a highly nonlinear physical system. The physical platform that naturally provides the scalable linear toolbox is optics, including near-ideal loss-free beam splitters and near-unit efficiency homodyne detectors that allow to obtain the complete analog syndrome for optimized quantum error correction. Additional optical linear amplifiers and specifically designed GKP qubit states are then all that is needed for universal quantum computing. In this work, we realize a GKP state in propagating light at the telecommunication wavelength and demonstrate homodyne meausurements on the GKP states for the first time without any loss corrections. Our GKP states do not only show non-classicality and non-Gaussianity at room temperature and atmospheric pressure, but unlike the existing schemes with stationary qubits, they are realizable in a propagating wave system. This property permits large-scale quantum computation and interconnections, with strong compatibility to optical fibers and 5G telecommunication technology.

#physics #quantum-physics

Paper thumbnail

HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality Assessment

27 Jun 2025

Academia Sinica National Taiwan University

Modern speech quality prediction models are trained on audio data resampled to a specific sampling rate. When faced with higher-rate audio at test time, these models can produce biased scores. We introduce HighRateMOS, the first non-intrusive mean opinion score (MOS) model that explicitly considers sampling rate. HighRateMOS ensembles three model variants that exploit the following information: (i) a learnable embedding of speech sampling rate, (ii) Wav2vec 2.0 self-supervised embeddings, (iii) multi-scale CNN spectral features, and (iv) MFCC features. In AudioMOS 2025 Track3, HighRateMOS ranked first in five out of eight metrics. Our experiments confirm that modeling the sampling rate directly leads to more robust and sampling-rate-agnostic speech quality predictions.

#audio-and-speech-processing #electrical-engineering

Paper thumbnail

Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages

01 Dec 2025

National Institute of Information and Communications Technology

Lexical normalization research has sought to tackle the challenge of processing informal expressions in user-generated text, yet the absence of comprehensive evaluations leaves it unclear which methods excel across multiple perspectives. Focusing on unsegmented languages, we make three key contributions: (1) creating a large-scale, multi-domain Japanese normalization dataset, (2) developing normalization methods based on state-of-the-art pretrained models, and (3) conducting experiments across multiple evaluation perspectives. Our experiments show that both encoder-only and decoder-only approaches achieve promising results in both accuracy and efficiency.

#computer-science #computation-and-language #data-curation

Paper thumbnail

MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement

04 Jun 2021

Academia Sinica Mila - Quebec AI Institute logo

Mila - Quebec AI Institute

The discrepancy between the cost function used for training a speech enhancement model and human auditory perception usually makes the quality of enhanced speech unsatisfactory. Objective evaluation metrics which consider human perception can hence serve as a bridge to reduce the gap. Our previously proposed MetricGAN was designed to optimize objective metrics by connecting the metric with a discriminator. Because only the scores of the target evaluation functions are needed during training, the metrics can even be non-differentiable. In this study, we propose a MetricGAN+ in which three training techniques incorporating domain-knowledge of speech processing are proposed. With these techniques, experimental results on the VoiceBank-DEMAND dataset show that MetricGAN+ can increase PESQ score by 0.3 compared to the previous MetricGAN and achieve state-of-the-art results (PESQ score = 3.15).

#computer-science #artificial-intelligence #sound

Paper thumbnail

Emergence of Human-Like Attention in Self-Supervised Vision Transformers: an eye-tracking study

30 Oct 2024

The University of Osaka National Institute of Information and Communications Technology

Many models of visual attention have been proposed so far. Traditional bottom-up models, like saliency models, fail to replicate human gaze patterns, and deep gaze prediction models lack biological plausibility due to their reliance on supervised learning. Vision Transformers (ViTs), with their self-attention mechanisms, offer a new approach but often produce dispersed attention patterns if trained with supervised learning. This study explores whether self-supervised DINO (self-DIstillation with NO labels) training enables ViTs to develop attention mechanisms resembling human visual attention. Using video stimuli to capture human gaze dynamics, we found that DINO-trained ViTs closely mimic human attention patterns, while those trained with supervised learning deviate significantly. An analysis of self-attention heads revealed three distinct clusters: one focusing on foreground objects, one on entire objects, and one on the background. DINO-trained ViTs offer insight into how human overt attention and figure-ground separation develop in visual perception.

#neurons-and-cognition #quantitative-biology

Paper thumbnail

IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages

27 Oct 2022

University of Edinburgh Microsoft logo

Natural Language Generation (NLG) for non-English languages is hampered by the scarcity of datasets in these languages. In this paper, we present the IndicNLG Benchmark, a collection of datasets for benchmarking NLG for 11 Indic languages. We focus on five diverse tasks, namely, biography generation using Wikipedia infoboxes, news headline generation, sentence summarization, paraphrase generation and, question generation. We describe the created datasets and use them to benchmark the performance of several monolingual and multilingual baselines that leverage pre-trained sequence-to-sequence models. Our results exhibit the strong performance of multilingual language-specific pre-trained models, and the utility of models trained on our dataset for other related NLG tasks. Our dataset creation methods can be easily applied to modest-resource languages as they involve simple steps such as scraping news articles and Wikipedia infoboxes, light cleaning, and pivoting through machine translation data. To the best of our knowledge, the IndicNLG Benchmark is the first NLG benchmark for Indic languages and the most diverse multilingual NLG dataset, with approximately 8M examples across 5 tasks and 11 languages. The datasets and models are publicly available at this https URL

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Spoof Diarization: "What Spoofed When" in Partially Spoofed Audio

12 Jun 2024

Brno University of Technology Sokendai

This paper defines Spoof Diarization as a novel task in the Partial Spoof (PS) scenario. It aims to determine what spoofed when, which includes not only locating spoof regions but also clustering them according to different spoofing methods. As a pioneering study in spoof diarization, we focus on defining the task, establishing evaluation metrics, and proposing a benchmark model, namely the Countermeasure-Condition Clustering (3C) model. Utilizing this model, we first explore how to effectively train countermeasures to support spoof diarization using three labeling schemes. We then utilize spoof localization predictions to enhance the diarization performance. This first study reveals the high complexity of the task, even in restricted scenarios where only a single speaker per audio file and an oracle number of spoofing methods are considered. Our code is available at this https URL.

#clustering-algorithms #computer-science #computation-and-language

Paper thumbnail

Hybridization of pulse and continuous-wave based optical quantum computation

29 Nov 2025

the University of Tokyo NTT Corporation

We propose a pulse and continuous wave (CW) hybrid architecture of continuous-variable measurement-based optical quantum computation utilizing the strengths of both pulsed and CW light. In this architecture, input and ancillary non-Gaussian quantum states necessary for fault-tolerance and universality of quantum computing are generated with pulsed light, whereas quantum processors including continuous-variable cluster states and homodyne measurement systems are operated with CW light. This architecture is expected to enable both generation of quantum states with shorter optical wavepackets and low-loss manipulation and measurement of these states, thus is compatible with ultrafast and low-loss quantum information processing. In this study, as a proof-of-principle, an ultrafast homodyne measurement using CW local oscillator was performed on single-photon states generated with pulsed light. The measured single-photon state's temporal width was around 70 ps and the value of the Wigner function at the origin was W(0,0) = -0.153 +/- 0.003, which is highly non-classical. This will be a core technology for realizing high-speed optical quantum information processing.

#physics #quantum-physics

Paper thumbnail

MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

06 Nov 2024

Nagoya University National Institute of Information and Communications Technology

Subjective speech quality assessment (SSQA) is critical for evaluating speech samples as perceived by human listeners. While model-based SSQA has enjoyed great success thanks to the development of deep neural networks (DNNs), generalization remains a key challenge, especially for unseen, out-of-domain data. To benchmark the generalization abilities of SSQA models, we present MOS-Bench, a diverse collection of datasets. In addition, we also introduce SHEET, an open-source toolkit containing complete recipes to conduct SSQA experiments. We provided benchmark results for MOS-Bench, and we also explored multi-dataset training to enhance generalization. Additionally, we proposed a new performance metric, best score difference/ratio, and used latent space visualizations to explain model behavior, offering valuable insights for future research.

#computer-science #sound #audio-and-speech-processing

Paper thumbnail

CP conditions for GKSL-like master equations

17 Jun 2024

Waseda University National Institute of Information and Communications Technology

The complete positivity (CP) of a quantum dynamical map (QDM) is, in general, difficult to show when its master equation (ME) does not conform to the Gorini-Kossakowski-Sudarshan-Lindblad (GKSL) form. The GKSL ME describes the Markovian dynamics, comprising a unitary component with time-independent Hermitian operators and a non-unitary component with time-independent Lindblad operators and positive time-independent damping rates. Recently, the non-Markovian dynamics has received growing attention, and the various types of GKSL-like MEs with time-dependent operators are widely discussed; however, rigorous discussions on their CP conditions remain limited. This paper presents conditions for QDMs to be CP, whose MEs take the GKSL-like form with arbitrary time dependence. One case considered is where its ME takes the time-local integro-differential GKSL-like form, which includes CP-divisible cases. Another case considered is where the ME is time-non-local but can be approximated to be time-local in the weak-coupling regime. As a special case of the time-non-local case, the same discussion holds for the time-convoluted GKSL-like form, which should be compared to previous studies.

#physics #quantum-physics

Paper thumbnail

There are no more papers matching your filters at the moment.