alphaXiv

History

Papers Benchmarks

Tampere University

329

24 Feb 2025

computer-science computer-vision-security computer-vision-and-pattern-recognition

Deep Learning for Camera Calibration and Beyond: A Survey

Beijing Jiaotong University The University of Sydney Tampere University

Kang Liao

This comprehensive survey consolidates a decade of research (2015-2024) in deep learning for camera calibration, providing a systematic taxonomy for over 200 papers, outlining key trends and challenges, and introducing a holistic public benchmark dataset. It offers a structured understanding of various learning-based methods and their applications across different camera models.

609

22 May 2025

computer-science sound audio-and-speech-processing

Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers

Tampere University Nokia Technologies

Researchers from Tampere University and Nokia Technologies developed the Attractor-based Joint Diarization, Counting, and Separation System (A-DCSS), which addresses the challenge of separating speech from an unknown number of speakers, each contributing multiple non-contiguous utterances. The system achieved superior speech separation performance and high accuracy in speaker diarization and counting across various challenging acoustic conditions.

120

14 Nov 2023

computer-science computer-vision-and-pattern-recognition multimedia

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Tampere University Sony Group Corporation Sony Research

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at this https URL.

17 Oct 2021

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Taming Visually Guided Sound Generation

Tampere University

Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU. We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples. Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGAN

355

558

21 Oct 2024

computer-science artificial-intelligence information-retrieval

Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report

Tampere University

This paper offers a practical guide for developing Retrieval Augmented Generation (RAG) systems that utilize PDF documents as a knowledge source. It details step-by-step implementations using both OpenAI's Assistant API and open-source models like Llama, alongside insights into common technical challenges and a comparative analysis of each approach. The work includes an evaluation based on participant feedback, which highlighted the value of hands-on exercises and practical concerns such as data security.

110

19 Jul 2025

computer-science artificial-intelligence computation-and-language

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

Amazon Tampere University

The MEMERAG benchmark introduces the first native multilingual meta-evaluation dataset for Retrieval Augmented Generation (RAG) systems, providing human expert annotations for faithfulness and relevance across five languages. It enables the development of more reliable automatic evaluation methods for RAG, demonstrating that detailed prompting significantly improves LLM-as-a-judge alignment with human judgments.

199

29 Apr 2024

computer-science software-engineering

AI-powered Code Review with LLMs: Early Results

University of Helsinki Tampere University University of Jyväskylä University of South-Eastern Norway Free University of Bozen-Bolzano

Researchers from Tampere University and other European institutions developed a multi-agent Large Language Model system to enhance code review by providing comprehensive, contextually-aware feedback. The system identifies potential bugs, code smells, and inefficiencies, offering actionable recommendations and educational guidance across various programming languages and application domains.

24 Sep 2025

computer-science conversational-ai artificial-intelligence

TactfulToM: Do LLMs Have the Theory of Mind Ability to Understand White Lies?

EPFL

University of Tokyo Tampere University National Institute of Informatics

Researchers introduced TactfulToM, a new benchmark designed to evaluate Large Language Models' (LLMs) understanding of "white lies" in realistic social conversations. The evaluation revealed that current LLMs significantly underperform human baselines in discerning prosocial motivations behind these subtle deceptions, even with advanced reasoning techniques.

120

16 Dec 2024

computer-science computer-vision-security computer-vision-and-pattern-recognition

MuSHRoom: Multi-Sensor Hybrid Room Dataset for Joint 3D Reconstruction and Novel View Synthesis

The University of Hong Kong

Aalto University Tampere University

Metaverse technologies demand accurate, real-time, and immersive modeling on consumer-grade hardware for both non-human perception (e.g., drone/robot/autonomous car navigation) and immersive technologies like AR/VR, requiring both structural accuracy and photorealism. However, there exists a knowledge gap in how to apply geometric reconstruction and photorealism modeling (novel view synthesis) in a unified framework. To address this gap and promote the development of robust and immersive modeling and rendering with consumer-grade devices, we propose a real-world Multi-Sensor Hybrid Room Dataset (MuSHRoom). Our dataset presents exciting challenges and requires state-of-the-art methods to be cost-effective, robust to noisy data and devices, and can jointly learn 3D reconstruction and novel view synthesis instead of treating them as separate tasks, making them ideal for real-world applications. We benchmark several famous pipelines on our dataset for joint 3D mesh reconstruction and novel view synthesis. Our dataset and benchmark show great potential in promoting the improvements for fusing 3D reconstruction and high-quality rendering in a robust and computationally efficient end-to-end fashion. The dataset and code are available at the project website: this https URL

22 Sep 2025

computer-science multimedia signal-processing

A multimodal stress detection dataset with facial expressions and physiological signals

Aarhus University Tampere University University of Jyväskylä University of Louisiana at Lafayette

Affective computing has garnered the attention and interest of researchers in recent years, as there is a need for AI systems to better understand and react to human emotions. However, analyzing human emotions, such as mood or stress, is quite complex. While various stress studies use facial expressions and wearables, most existing datasets rely on processing data from a single modality. This paper presents EmpathicSchool, a novel dataset that captures facial expressions and the associated physiological signals, such as heart rate, electrodermal activity, and skin temperature, under different stress levels. The data was collected from 30 participants during different sessions for about ninety minutes each (for a total of 40 hours). The data includes seven different signal types, including both computer vision and physiological features that can be used to detect stress. In addition, various experiments were conducted to validate the signal quality.

17 Jul 2024

computer-science computer-vision-security computer-vision-and-pattern-recognition

Gaussian Splatting on the Move: Blur and Rolling Shutter Compensation for Natural Camera Motion

ETH Zurich

Aalto University Tampere University Spectacular AI

This work enhances 3D Gaussian Splatting (3DGS) by integrating a physical image formation model directly into the differentiable rendering pipeline to compensate for motion blur and rolling shutter distortion. The approach leverages Visual-Inertial Odometry (VIO) to regularize pose and velocity optimization, yielding significantly sharper and more accurate 3D reconstructions from casually captured handheld video data, outperforming prior 2D and 3D compensation techniques.

177

20 Sep 2024

computer-science computer-vision-and-pattern-recognition multimedia

Temporally Aligned Audio for Video with Autoregression

University of Oxford Tampere University

We introduce V-AURA, the first autoregressive model to achieve high temporal alignment and relevance in video-to-audio generation. V-AURA uses a high-framerate visual feature extractor and a cross-modal audio-visual feature fusion strategy to capture fine-grained visual motion events and ensure precise temporal alignment. Additionally, we propose VisualSound, a benchmark dataset with high audio-visual relevance. VisualSound is based on VGGSound, a video dataset consisting of in-the-wild samples extracted from YouTube. During the curation, we remove samples where auditory events are not aligned with the visual ones. V-AURA outperforms current state-of-the-art models in temporal alignment and semantic relevance while maintaining comparable audio quality. Code, samples, VisualSound and models are available at this https URL

15 Jul 2025

statistical-mechanics physics physics-and-society

Opinion dynamics: Statistical physics and beyond

Universitat Pompeu Fabra

University of Pennsylvania Tampere University Complexity Science Hub Central European University University of Konstanz Constructor University Barcelona Supercomputing Center Wroclaw University of Science and Technology CENTAI Institute National Laboratory for Health Security Institute for Cross-Disciplinary Physics and Complex Systems (IFISC) UIB-CSIC HUN-REN Rényi Institute of Mathematics Universidad Nacional Autóma de México

Opinion dynamics, the study of how individual beliefs and collective public opinion evolve, is a fertile domain for applying statistical physics to complex social phenomena. Like physical systems, societies exhibit macroscopic regularities from localized interactions, leading to outcomes such as consensus or fragmentation. This field has grown significantly, attracting interdisciplinary methods and driven by a surge in large-scale behavioral data. This review covers its rapid progress, bridging the literature dispersion. We begin with essential concepts and definitions, encompassing the nature of opinions, microscopic and macroscopic dynamics. This foundation leads to an overview of empirical research, from lab experiments to large-scale data analysis, which informs and validates models of opinion dynamics. We then present individual-based models, categorized by their macroscopic phenomena (e.g., consensus, polarization, echo chambers) and microscopic mechanisms (e.g., homophily, assimilation). We also review social contagion phenomena, highlighting their connection to opinion dynamics. Furthermore, the review covers common analytical and computational tools, including stochastic processes, treatments, simulations, and optimization. Finally, we explore emerging frontiers, such as connecting empirical data to models and using AI agents as testbeds for novel social phenomena. By systematizing terminology and emphasizing analogies with traditional physics, this review aims to consolidate knowledge, provide a robust theoretical foundation, and shape future research in opinion dynamics.

23 Sep 2025

ai-for-health computer-science computer-vision-and-pattern-recognition

Reconstruction of Optical Coherence Tomography Images from Wavelength-space Using Deep-learning

Tampere University Institute of Optical Materials and Technologies, Bulgarian Academy of Sciences

Conventional Fourier-domain Optical Coherence Tomography (FD-OCT) systems depend on resampling into wavenumber (k) domain to extract the depth profile. This either necessitates additional hardware resources or amplifies the existing computational complexity. Moreover, the OCT images also suffer from speckle noise, due to systemic reliance on low coherence interferometry. We propose a streamlined and computationally efficient approach based on Deep-Learning (DL) which enables reconstructing speckle-reduced OCT images directly from the wavelength domain. For reconstruction, two encoder-decoder styled networks namely Spatial Domain Convolution Neural Network (SD-CNN) and Fourier Domain CNN (FD-CNN) are used sequentially. The SD-CNN exploits the highly degraded images obtained by Fourier transforming the domain fringes to reconstruct the deteriorated morphological structures along with suppression of unwanted noise. The FD-CNN leverages this output to enhance the image quality further by optimization in Fourier domain (FD). We quantitatively and visually demonstrate the efficacy of the method in obtaining high-quality OCT images. Furthermore, we illustrate the computational complexity reduction by harnessing the power of DL models. We believe that this work lays the framework for further innovations in the realm of OCT image reconstruction.

16 Jul 2025

ai-for-health autonomous-vehicles computer-science

UL-DD: A Multimodal Drowsiness Dataset Using Video, Biometric Signals, and Behavioral Data

Tampere University University of Louisiana at Lafayette Tietoevry

In this study, we present a comprehensive public dataset for driver drowsiness detection, integrating multimodal signals of facial, behavioral, and biometric indicators. Our dataset includes 3D facial video using a depth camera, IR camera footage, posterior videos, and biometric signals such as heart rate, electrodermal activity, blood oxygen saturation, skin temperature, and accelerometer data. This data set provides grip sensor data from the steering wheel and telemetry data from the American truck simulator game to provide more information about drivers' behavior while they are alert and drowsy. Drowsiness levels were self-reported every four minutes using the Karolinska Sleepiness Scale (KSS). The simulation environment consists of three monitor setups, and the driving condition is completely like a car. Data were collected from 19 subjects (15 M, 4 F) in two conditions: when they were fully alert and when they exhibited signs of sleepiness. Unlike other datasets, our multimodal dataset has a continuous duration of 40 minutes for each data collection session per subject, contributing to a total length of 1,400 minutes, and we recorded gradual changes in the driver state rather than discrete alert/drowsy labels. This study aims to create a comprehensive multimodal dataset of driver drowsiness that captures a wider range of physiological, behavioral, and driving-related signals. The dataset will be available upon request to the corresponding author.

30 Sep 2025

computer-science computer-vision-and-pattern-recognition

Video Object Segmentation-Aware Audio Generation

University of Oxford Tampere University

Existing multimodal audio generation models often lack precise user control, which limits their applicability in professional Foley workflows. In particular, these models focus on the entire video and do not provide precise methods for prioritizing a specific object within a scene, generating unnecessary background sounds, or focusing on the wrong objects. To address this gap, we introduce the novel task of video object segmentation-aware audio generation, which explicitly conditions sound synthesis on object-level segmentation maps. We present SAGANet, a new multimodal generative model that enables controllable audio generation by leveraging visual segmentation masks along with video and textual cues. Our model provides users with fine-grained and visually localized control over audio generation. To support this task and further research on segmentation-aware Foley, we propose Segmented Music Solos, a benchmark dataset of musical instrument performance videos with segmentation information. Our method demonstrates substantial improvements over current state-of-the-art methods and sets a new standard for controllable, high-fidelity Foley synthesis. Code, samples, and Segmented Music Solos are available at this https URL

385

21 Sep 2025

computer-science software-engineering

System for systematic literature review using multiple AI agents: Concept and an empirical evaluation

University of Helsinki Tampere University University of South-Eastern Norway Jyväskylä University

Systematic literature review (SLR) is foundational to evidence-based research, enabling scholars to identify, classify, and synthesize existing studies to address specific research questions. Conducting an SLR is, however, largely a manual process. In recent years, researchers have made significant progress in automating portions of the SLR pipeline to reduce the effort and time required for high-quality reviews; nevertheless, there remains a lack of AI-agent-based systems that automate the entire SLR workflow. To this end, we introduce a novel multi-AI-agent system designed to fully automate SLRs. Leveraging large language models (LLMs), our system streamlines the review process to enhance efficiency and accuracy. Through a user-friendly interface, researchers specify a topic; the system then generates a search string to retrieve relevant academic papers. Next, an inclusion/exclusion filtering step is applied to titles relevant to the research area. The system subsequently summarizes paper abstracts and retains only those directly related to the field of study. In the final phase, it conducts a thorough analysis of the selected papers with respect to predefined research questions. This paper presents the system, describes its operational framework, and demonstrates how it substantially reduces the time and effort traditionally required for SLRs while maintaining comprehensiveness and precision. The code for this project is available at: this https URL .

22 Sep 2025

computer-science computation-and-language audio-and-speech-processing

Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models

Apple Tampere University

Self-supervised learning (SSL) has made significant advances in speech representation learning. Models like wav2vec 2.0 and HuBERT have achieved state-of-the-art results in tasks such as speech recognition, particularly in monolingual settings. However, multilingual SSL models tend to underperform their monolingual counterparts on each individual language, especially in multilingual scenarios with few languages such as the bilingual setting. In this work, we investigate a novel approach to reduce this performance gap by introducing limited visual grounding into bilingual speech SSL models. Our results show that visual grounding benefits both monolingual and bilingual models, with especially pronounced gains for the latter, reducing the multilingual performance gap on zero-shot phonetic discrimination from 31.5% for audio-only models to 8.04% with grounding.

133

18 Aug 2024

computer-science software-engineering

AI based Multiagent Approach for Requirements Elicitation and Analysis

Tampere University University of Jyväskylä

The research by Tampere University and University of Jyväskylä develops an AI-based multiagent system that uses Large Language Models (LLMs) to automate and improve requirements elicitation, analysis, and prioritization in software development. This system, which simulates different team roles, generated more comprehensive and higher-quality user stories with GPT-4o, while Mixtral-8x7B offered the fastest response times for initial drafts.

17 Jun 2022

computer-science machine-learning sound

Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering

Tampere University

Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer. In this paper, we introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds in duration selected from the Clotho dataset. For each audio file, we collect six different questions and corresponding answers by crowdsourcing using Amazon Mechanical Turk. The questions and answers are produced by different annotators. Out of the six questions for each audio, two questions each are designed to have 'yes' and 'no' as answers, while the remaining two questions have other single-word answers. For each question, we collect answers from three different annotators. We also present two baseline experiments to describe the usage of our dataset for the AQA task - an LSTM-based multimodal binary classifier for 'yes' or 'no' type answers and an LSTM-based multimodal multi-class classifier for 828 single-word answers. The binary classifier achieved an accuracy of 62.7% and the multi-class classifier achieved a top-1 accuracy of 54.2% and a top-5 accuracy of 93.7%. Clotho-AQA dataset is freely available online at this https URL

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Deep Learning for Camera Calibration and Beyond: A Survey

Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Taming Visually Guided Sound Generation

Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

AI-powered Code Review with LLMs: Early Results

TactfulToM: Do LLMs Have the Theory of Mind Ability to Understand White Lies?

MuSHRoom: Multi-Sensor Hybrid Room Dataset for Joint 3D Reconstruction and Novel View Synthesis

A multimodal stress detection dataset with facial expressions and physiological signals

Gaussian Splatting on the Move: Blur and Rolling Shutter Compensation for Natural Camera Motion

Temporally Aligned Audio for Video with Autoregression

Opinion dynamics: Statistical physics and beyond

Reconstruction of Optical Coherence Tomography Images from Wavelength-space Using Deep-learning

UL-DD: A Multimodal Drowsiness Dataset Using Video, Biometric Signals, and Behavioral Data

Video Object Segmentation-Aware Audio Generation

System for systematic literature review using multiple AI agents: Concept and an empirical evaluation

Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models

AI based Multiagent Approach for Requirements Elicitation and Analysis

Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering

Events

AI for Law

Personalize Your Feed