Universidad de Buenos Aires (UBA)
The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To approach this problem, methods inspired by works on self-supervised learning for NLP, like BERT, or computer vision, like masked autoencoders (MAE), are often adapted to the audio domain. In this work, we propose masking representations of the audio signal, and training a MAE to reconstruct the masked segments. The reconstruction is done by predicting the discrete units generated by EnCodec, a neural audio codec, from the unmasked inputs. We evaluate this approach, which we call EnCodecMAE, on a wide range of tasks involving speech, music and environmental sounds. Our best model outperforms various state-of-the-art audio representation models in terms of global performance. Additionally, we evaluate the resulting representations in the challenging task of automatic speech recognition (ASR), obtaining decent results and paving the way for a universal audio representation.
93
Self-supervised representations of speech are currently being widely used for a large number of applications. Recently, some efforts have been made in trying to analyze the type of information present in each of these representations. Most such work uses downstream models to test whether the representations can be successfully used for a specific task. The downstream models, though, typically perform nonlinear operations on the representation extracting information that may not have been readily available in the original representation. In this work, we analyze the spatial organization of phone and speaker information in several state-of-the-art speech representations using methods that do not require a downstream model. We measure how different layers encode basic acoustic parameters such as formants and pitch using representation similarity analysis. Further, we study the extent to which each representation clusters the speech samples by phone or speaker classes using non-parametric statistical testing. Our results indicate that models represent these speech attributes differently depending on the target task used during pretraining.
Higher-derivative interactions and transformation rules of the fields in the effective field theories of the massless string states are strongly constrained by space-time symmetries and dualities. Here we use an exact formulation of ten dimensional N=1{\cal N}=1 supergravity coupled to Yang-Mills with manifest T-duality symmetry to construct the first order α\alpha'-corrections of the heterotic string effective action. The theory contains a supersymmetric and T-duality covariant generalization of the Green-Schwarz mechanism that determines the modifications to the leading order supersymmetry transformation rules of the fields. We compute the resulting field-dependent deformations of the coefficients in the supersymmetry algebra and construct the invariant action, with up to and including four-derivative terms of all the massless bosonic and fermionic fields of the heterotic string spectrum.
We investigate a shape optimization problem for a heat-conducting fluid governed by a Boussinesq system. The main goal is to determine an optimal domain shape that yields a temperature distribution as uniform as possible. Initially, we analyze the state problem, prove its well-posedness and establish a local boundary regularity result for the weak solution. We then demonstrate the existence of an optimal shape and derive a first-order optimality condition. This requires the derivation and analysis of the adjoint system associated with the Boussinesq model, as well as a rigorous treatment of the directional derivatives of the objective functional under appropriate domain perturbations. Finally, we present numerical experiments that illustrate and support the theoretical findings.
We use a moduli space exploration algorithm to produce a complete list of maximally enhanced gauge groups that are realized in the heterotic string in 7d, encompassing the usual Narain component, and five other components with rank reduction realized via nontrivial holonomy triples. Using lattice embedding techniques we find an explicit match with the mechanism of singularity freezing in M-theory on K3. The complete global data for each gauge group is explicitly given.
Recent observations have revealed remarkable insights into the gas reservoir in the circumgalactic medium (CGM) of galaxy haloes. In this paper, we characterise the gas in the vicinity of Milky Way and Andromeda analogues in the HESTIA (High resolution Environmental Simulations of The Immediate Area) suite of constrained Local Group (LG) simulations. The HESTIA suite comprise of a set of three high-resolution {\sc arepo}-based simulations of the LG, run using the Auriga galaxy formation model. For this paper, we focus only on the z=0z = 0 simulation datasets and generate mock skymaps along with a power spectrum analysis to show that the distributions of ions tracing low-temperature gas (HI and SiIII) are more clumpy in comparison to warmer gas tracers (OVI, OVII and OVIII). We compare to the spectroscopic CGM observations of M31 and low-redshift galaxies. HESTIA under-produces the column densities of the M31 observations, but the simulations are consistent with the observations of low-redshift galaxies. A possible explanation for these findings is that the spectroscopic observations of M31 are contaminated by gas residing in the CGM of the Milky Way.
In recent years, self-supervised learning (SSL) models have produced promising results in a variety of speech-processing tasks, especially in contexts of data scarcity. In this paper, we study the use of SSL models for the task of mispronunciation detection for second language learners. We compare two downstream approaches: 1) training the model for phone recognition (PR) using native English data, and 2) training a model directly for the target task using non-native English data. We compare the performance of these two approaches for various SSL representations as well as a representation extracted from a traditional DNN-based speech recognition model. We evaluate the models on L2Arctic and EpaDB, two datasets of non-native speech annotated with pronunciation labels at the phone level. Overall, we find that using a downstream model trained for the target task gives the best performance and that most upstream models perform similarly for the task.
Argentina has a large yet little-known Indigenous linguistic diversity, encompassing at least 40 different languages. The majority of these languages are at risk of disappearing, resulting in a significant loss of world heritage and cultural knowledge. Currently, unified information on speakers and computational tools is lacking for these languages. In this work, we present a systematization of the Indigenous languages spoken in Argentina, classifying them into seven language families: Mapuche, Tup\'i-Guaran\'i, Guaycur\'u, Quechua, Mataco-Mataguaya, Aymara, and Chon. For each one, we present an estimation of the national Indigenous population size, based on the most recent Argentinian census. We discuss potential reasons why the census questionnaire design may underestimate the actual number of speakers. We also provide a concise survey of computational resources available for these languages, whether or not they were specifically developed for Argentinian varieties.
There are no more papers matching your filters at the moment.