alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Center for Machine Vision and Signal Analysis (CMVS)University of Oulu

Extended multi-stream temporal-attention module for skeleton-based human action recognition (HAR)

10 Nov 2024

LUT University University of Oulu

An Extended Multi-stream Temporal-attention Adaptive GCN (EMS-TAGCN) is presented to enhance skeleton-based human action recognition by integrating adaptive graph topology learning, processing diverse skeletal data streams, and employing spatial-temporal-channel attention. The model achieved state-of-the-art performance, with accuracy gains of up to 2.34% on UCF-101 and 1.4% on NTU-RGBD cross-view over existing methods.

#attention-mechanisms #computer-science #computer-vision-security

Paper thumbnail

AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models

07 May 2025

sun-hai-yang

孙海洋

Chinese Academy of Sciences

Shanghai Jiao Tong University

AffectGPT introduces a new dataset, model, and benchmark to advance multimodal large language models in generative, descriptive emotion understanding. The proposed AffectGPT model, utilizing a specialized pre-fusion architecture and trained on the MER-Caption dataset, achieves over a 9% performance improvement compared to existing MLLMs on a unified evaluation framework.

#computer-science #human-computer-interaction

Paper thumbnail

Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization

21 Mar 2025

ETH Zurich The University of Hong Kong logo

The University of Hong Kong

Reloc3r presents a visual localization framework leveraging a large-scale trained relative camera pose regression network built on a Vision Transformer backbone and a minimalist motion averaging module. This approach achieves state-of-the-art accuracy, exhibits robust generalization across diverse unseen scenes, and maintains real-time inference speeds.

#computer-science #computer-vision-security #computer-vision-and-pattern-recognition

Paper thumbnail

Efficient Reinforcement Learning by Guiding Generalist World Models with Non-Curated Data

18 May 2025

Max Planck Institute for Intelligent Systems University of Edinburgh

Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two essential techniques: \emph{i)} experience rehearsal and \emph{ii)} execution guidance. With these modifications, the non-curated offline data substantially improves RL's sample efficiency. Under limited sample budgets, our method achieves a 102.8\% relative improvement in aggregate score over learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.

#computer-science #machine-learning #robotics

Paper thumbnail

Chem3DLLM: 3D Multimodal Large Language Models for Chemistry

14 Aug 2025

Shanghai Artificial Intelligence Laboratory

University of Science and Technology of China

In the real world, a molecule is a 3D geometric structure. Compared to 1D SMILES sequences and 2D molecular graphs, 3D molecules represent the most informative molecular modality. Despite the rapid progress of autoregressive-based language models, they cannot handle the generation of 3D molecular conformation due to several challenges: 1) 3D molecular structures are incompatible with LLMs' discrete token space, 2) integrating heterogeneous inputs like proteins, ligands, and text remains difficult within a unified model, and 3) LLMs lack essential scientific priors, hindering the enforcement of physical and chemical constraints during generation. To tackle these issues, we present Chem3DLLM, a unified protein-conditioned multimodal large language model. Our approach designs a novel reversible text encoding for 3D molecular structures using run-length compression, achieving 3x size reduction while preserving complete structural information. This enables seamless integration of molecular geometry with protein pocket features in a single LLM architecture. We employ reinforcement learning with stability-based rewards to optimize chemical validity and incorporate a lightweight protein embedding projector for end-to-end training. Experimental results on structure-based drug design demonstrate state-of-the-art performance with a Vina score of -7.21, validating our unified multimodal approach for practical drug discovery applications.

#computer-science #computational-engineering-finance-and-science

Paper thumbnail

SoftCFG: Uncertainty-guided Stable Guidance for Visual Autoregressive Model

02 Oct 2025

KU Leuven University of Oulu

Autoregressive (AR) models have emerged as powerful tools for image generation by modeling images as sequences of discrete tokens. While Classifier-Free Guidance (CFG) has been adopted to improve conditional generation, its application in AR models faces two key issues: guidance diminishing, where the conditional-unconditional gap quickly vanishes as decoding progresses, and over-guidance, where strong conditions distort visual coherence. To address these challenges, we propose SoftCFG, an uncertainty-guided inference method that distributes adaptive perturbations across all tokens in the sequence. The key idea behind SoftCFG is to let each generated token contribute certainty-weighted guidance, ensuring that the signal persists across steps while resolving conflicts between text guidance and visual context. To further stabilize long-sequence generation, we introduce Step Normalization, which bounds cumulative perturbations of SoftCFG. Our method is training-free, model-agnostic, and seamlessly integrates with existing AR pipelines. Experiments show that SoftCFG significantly improves image quality over standard CFG and achieves state-of-the-art FID on ImageNet 256*256 among autoregressive models.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

OV-MER: Towards Open-Vocabulary Multimodal Emotion Recognition

07 May 2025

sun-hai-yang

孙海洋

Chinese Academy of Sciences

Shanghai Jiao Tong University

Multimodal Emotion Recognition (MER) is a critical research area that seeks to decode human emotions from diverse data modalities. However, existing machine learning methods predominantly rely on predefined emotion taxonomies, which fail to capture the inherent complexity, subtlety, and multi-appraisal nature of human emotional experiences, as demonstrated by studies in psychology and cognitive science. To overcome this limitation, we advocate for introducing the concept of open vocabulary into MER. This paradigm shift aims to enable models to predict emotions beyond a fixed label space, accommodating a flexible set of categories to better reflect the nuanced spectrum of human emotions. To achieve this, we propose a novel paradigm: Open-Vocabulary MER (OV-MER), which enables emotion prediction without being confined to predefined spaces. However, constructing a dataset that encompasses the full range of emotions for OV-MER is practically infeasible; hence, we present a comprehensive solution including a newly curated database, novel evaluation metrics, and a preliminary benchmark. By advancing MER from basic emotions to more nuanced and diverse emotional states, we hope this work can inspire the next generation of MER, enhancing its generalizability and applicability in real-world scenarios. Code and dataset are available at: this https URL

#computer-science #human-computer-interaction

Paper thumbnail

MER 2025: When Affective Computing Meets Large Language Models

29 Apr 2025

Tianjin University Chinese Academy of Sciences logo

Chinese Academy of Sciences

MER2025 is the third year of our MER series of challenges, aiming to bring together researchers in the affective computing community to explore emerging trends and future directions in the field. Previously, MER2023 focused on multi-label learning, noise robustness, and semi-supervised learning, while MER2024 introduced a new track dedicated to open-vocabulary emotion recognition. This year, MER2025 centers on the theme "When Affective Computing Meets Large Language Models (LLMs)".We aim to shift the paradigm from traditional categorical frameworks reliant on predefined emotion taxonomies to LLM-driven generative methods, offering innovative solutions for more accurate and reliable emotion understanding. The challenge features four tracks: MER-SEMI focuses on fixed categorical emotion recognition enhanced by semi-supervised learning; MER-FG explores fine-grained emotions, expanding recognition from basic to nuanced emotional states; MER-DES incorporates multimodal cues (beyond emotion words) into predictions to enhance model interpretability; MER-PR investigates whether emotion prediction results can improve personality recognition performance. For the first three tracks, baseline code is available at MERTools, and datasets can be accessed via Hugging Face. For the last track, the dataset and baseline code are available on GitHub.

#computer-science #human-computer-interaction

Paper thumbnail

Pixel Difference Networks for Efficient Edge Detection

16 Aug 2021

Xidian University National University of Defense Technology

Recently, deep Convolutional Neural Networks (CNNs) can achieve human-level performance in edge detection with the rich and abstract edge representation capacities. However, the high performance of CNN based edge detection is achieved with a large pretrained CNN backbone, which is memory and energy consuming. In addition, it is surprising that the previous wisdom from the traditional edge detectors, such as Canny, Sobel, and LBP are rarely investigated in the rapid-developing deep learning era. To address these issues, we propose a simple, lightweight yet effective architecture named Pixel Difference Network (PiDiNet) for efficient edge detection. Extensive experiments on BSDS500, NYUD, and Multicue are provided to demonstrate its effectiveness, and its high training and inference efficiency. Surprisingly, when training from scratch with only the BSDS500 and VOC datasets, PiDiNet can surpass the recorded result of human perception (0.807 vs. 0.803 in ODS F-measure) on the BSDS500 dataset with 100 FPS and less than 1M parameters. A faster version of PiDiNet with less than 0.1M parameters can still achieve comparable performance among state of the arts with 200 FPS. Results on the NYUD and Multicue datasets show similar observations. The codes are available at this https URL.

#computer-science #computer-vision-security #computer-vision-and-pattern-recognition

Paper thumbnail

Towards Consistent and Controllable Image Synthesis for Face Editing

26 Nov 2025

Southeast University University of Oulu

Face editing methods, essential for tasks like virtual avatars, digital human synthesis and identity preservation, have traditionally been built upon GAN-based techniques, while recent focus has shifted to diffusion-based models due to their success in image reconstruction. However, diffusion models still face challenges in controlling specific attributes and preserving the consistency of other unchanged attributes especially the identity characteristics. To address these issues and facilitate more convenient editing of face images, we propose a novel approach that leverages the power of Stable-Diffusion (SD) models and crude 3D face models to control the lighting, facial expression and head pose of a portrait photo. We observe that this task essentially involves the combinations of target background, identity and face attributes aimed to edit. We strive to sufficiently disentangle the control of these factors to enable consistency of face editing. Specifically, our method, coined as RigFace, contains: 1) A Spatial Attribute Encoder that provides presise and decoupled conditions of background, pose, expression and lighting; 2) A high-consistency FaceFusion method that transfers identity features from the Identity Encoder to the denoising UNet of a pre-trained SD model; 3) An Attribute Rigger that injects those conditions into the denoising UNet. Our model achieves comparable or even superior performance in both identity preservation and photorealism compared to existing face editing models.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

IA-VLA: Input Augmentation for Vision-Language-Action models in settings with semantically complex tasks

29 Sep 2025

Aalto University University of Oulu

The IA-VLA framework enables Vision-Language-Action (VLA) models to interpret complex semantic instructions by offloading object identification to a larger Vision-Language Model (VLM) for input augmentation. This approach significantly improves VLA generalization, especially for tasks involving visually indistinguishable duplicate objects and novel instructions.

#computer-science #robotics

Paper thumbnail

MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control

20 Nov 2025

Southeast University University of Oulu

We address the problem of facial expression editing by controling the relative variation of facial action-unit (AU) from the same person. This enables us to edit this specific person's expression in a fine-grained, continuous and interpretable manner, while preserving their identity, pose, background and detailed facial attributes. Key to our model, which we dub MagicFace, is a diffusion model conditioned on AU variations and an ID encoder to preserve facial details of high consistency. Specifically, to preserve the facial details with the input identity, we leverage the power of pretrained Stable-Diffusion models and design an ID encoder to merge appearance features through self-attention. To keep background and pose consistency, we introduce an efficient Attribute Controller by explicitly informing the model of current background and pose of the target. By injecting AU variations into a denoising UNet, our model can animate arbitrary identities with various AU combinations, yielding superior results in high-fidelity expression editing compared to other facial expression editing works. Code is publicly available at this https URL.

#computer-science #computer-vision-and-pattern-recognition #facial-recognition

Paper thumbnail

MagicPortrait: Temporally Consistent Face Reenactment with 3D Geometric Guidance

29 Oct 2025

University of Oulu

In this study, we propose a method for video face reenactment that integrates a 3D face parametric model into a latent diffusion framework, aiming to improve shape consistency and motion control in existing video-based face generation approaches. Our approach employs the FLAME (Faces Learned with an Articulated Model and Expressions) model as the 3D face parametric representation, providing a unified framework for modeling face expressions and head pose. This not only enables precise extraction of motion features from driving videos, but also contributes to the faithful preservation of face shape and geometry. Specifically, we enhance the latent diffusion model with rich 3D expression and detailed pose information by incorporating depth maps, normal maps, and rendering maps derived from FLAME sequences. These maps serve as motion guidance and are encoded into the denoising UNet through a specifically designed Geometric Guidance Encoder (GGE). A multi-layer feature fusion module with integrated self-attention mechanisms is used to combine facial appearance and motion latent features within the spatial domain. By utilizing the 3D face parametric model as motion guidance, our method enables parametric alignment of face identity between the reference image and the motion captured from the driving video. Experimental results on benchmark datasets show that our method excels at generating high-quality face animations with precise expression and head pose variation modeling. In addition, it demonstrates strong generalization performance on out-of-domain images. Code is publicly available at this https URL.

#computer-science #computer-vision-and-pattern-recognition #facial-recognition

Paper thumbnail

KV Cache Compression for Inference Efficiency in LLMs: A Review

08 Aug 2025

University of Oulu Shandong University of Science and Technology

Withtherapid advancement of large language models (LLMs), the context length for inference has been continuously increasing, leading to an exponential growth in the demand for Key-Value (KV) caching. This has resulted in a significant memory bottleneck, limiting the inference efficiency and scalability of the models. Therefore, optimizing the KV cache during inference is crucial for enhancing performance and efficiency. This review systematically examines current KV cache optimization techniques, including compression strategies such as selective token strategies, quantization, and attention compression. We evaluate the effectiveness, trade-offs, and application scenarios of these methods, providing a comprehensive analysis of their impact on memory usage and inference speed. We focus on identifying the limitations and challenges of existing methods, such as compatibility issues with different models and tasks. Additionally, this review highlights future research directions, including hybrid optimization techniques, adaptive dynamic strategies, and software-hardware co-design. These approaches aim to improve inference efficiency and promote the practical application of large language models.

#computer-science #distributed-parallel-and-cluster-computing

Paper thumbnail

Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge

23 Apr 2019

weichung-wang

Weichung Wang

arlindo-oliveira420

Arlindo Oliveira

ETH Zurich University of Washington logo

University of Washington

Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles disseminated across multi-parametric magnetic resonance imaging (mpMRI) scans, reflecting varying biological properties. Their heterogeneous shape, extent, and location are some of the factors that make these tumors difficult to resect, and in some cases inoperable. The amount of resected tumor is a factor also considered in longitudinal scans, when evaluating the apparent tumor for potential diagnosis of progression. Furthermore, there is mounting evidence that accurate segmentation of the various tumor sub-regions can offer the basis for quantitative image analysis towards prediction of patient overall survival. This study assesses the state-of-the-art machine learning (ML) methods used for brain tumor image analysis in mpMRI scans, during the last seven instances of the International Brain Tumor Segmentation (BraTS) challenge, i.e., 2012-2018. Specifically, we focus on i) evaluating segmentations of the various glioma sub-regions in pre-operative mpMRI scans, ii) assessing potential tumor progression by virtue of longitudinal growth of tumor sub-regions, beyond use of the RECIST/RANO criteria, and iii) predicting the overall survival from pre-operative mpMRI scans of patients that underwent gross total resection. Finally, we investigate the challenge of identifying the best ML algorithms for each of these tasks, considering that apart from being diverse on each instance of the challenge, the multi-institutional mpMRI BraTS dataset has also been a continuously evolving/growing dataset.

#ai-for-health #computer-science #artificial-intelligence

Paper thumbnail

Massive Discovery of Low-Dimensional Materials from Universal Computational Strategy

24 Sep 2025

Chalmers University of Technology University of Oulu

Low-dimensional materials have attractive properties that drive intense efforts for novel materials discovery. However, experiments are tedious for systematic discovery, and present computational methods are often tuned to two-dimensional (2D) materials, overlooking other low-dimensional materials. Here, we combined universal machine-learning interatomic potentials (UMLIPs) and an advanced, interatomic force constant (FC) -based dimensionality classification method to make a massive discovery of novel low-dimensional materials. We first benchmarked UMLIPs' first-principles-level accuracy in quantifying FCs and calculated phonons for 35,689 materials from the Materials Project database. We then used the FC-based method for dimensionality classification to discover 9139 low-dimensional materials, including 1838 0D clusters, 1760 1D chains, 3057 2D sheets/layers, and 2484 mixed-dimensionality materials, all of which conventional geometric descriptors have not recognized. By calculating the binding energies for the discovered 2D materials, we also identified 960 sheets that could be easily or potentially exfoliated from their parent bulk structures.

#materials-science #physics #computational-physics

Paper thumbnail

Rapid Salient Object Detection with Difference Convolutional Neural Networks

01 Jul 2025

Nankai University Intel Labs

This paper addresses the challenge of deploying salient object detection (SOD) on resource-constrained devices with real-time performance. While recent advances in deep neural networks have improved SOD, existing top-leading models are computationally expensive. We propose an efficient network design that combines traditional wisdom on SOD and the representation power of modern CNNs. Like biologically-inspired classical SOD methods relying on computing contrast cues to determine saliency of image regions, our model leverages Pixel Difference Convolutions (PDCs) to encode the feature contrasts. Differently, PDCs are incorporated in a CNN architecture so that the valuable contrast cues are extracted from rich feature maps. For efficiency, we introduce a difference convolution reparameterization (DCR) strategy that embeds PDCs into standard convolutions, eliminating computation and parameters at inference. Additionally, we introduce SpatioTemporal Difference Convolution (STDC) for video SOD, enhancing the standard 3D convolution with spatiotemporal contrast capture. Our models, SDNet for image SOD and STDNet for video SOD, achieve significant improvements in efficiency-accuracy trade-offs. On a Jetson Orin device, our models with

&lt;

1M parameters operate at 46 FPS and 150 FPS on streamed images and videos, surpassing the second-best lightweight models in our experiments by more than

2\times

and

3\times

in speed with superior accuracy. Code will be available at this https URL.

#computer-science #computer-vision-and-pattern-recognition #hardware-aware-algorithms

Paper thumbnail

A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges

06 Jan 2021

University of Waterloo Google Research logo

Google Research

Uncertainty quantification (UQ) plays a pivotal role in reduction of uncertainties during both optimization and decision making processes. It can be applied to solve a variety of real-world applications in science and engineering. Bayesian approximation and ensemble learning techniques are two most widely-used UQ methods in the literature. In this regard, researchers have proposed different UQ methods and examined their performance in a variety of applications such as computer vision (e.g., self-driving cars and object detection), image processing (e.g., image restoration), medical image analysis (e.g., medical image classification and segmentation), natural language processing (e.g., text classification, social media texts and recidivism risk-scoring), bioinformatics, etc. This study reviews recent advances in UQ methods used in deep learning. Moreover, we also investigate the application of these methods in reinforcement learning (RL). Then, we outline a few important applications of UQ methods. Finally, we briefly highlight the fundamental research challenges faced by UQ methods and discuss the future research directions in this field.

#ai-for-health #bayesian-deep-learning #computer-science

Paper thumbnail

PSScreen: Partially Supervised Multiple Retinal Disease Screening

01 Oct 2025

University of Oulu Center for Machine Vision and Signal Analysis (CMVS)

A new model, PSScreen, was developed to screen for multiple retinal diseases using partially labeled datasets from diverse medical sites, addressing challenges like domain shifts and incomplete annotations. The method established a new benchmark in partially supervised learning, achieving superior domain generalization on unseen data compared to prior state-of-the-art approaches and outperforming leading vision-language foundation models in zero-shot screening.

#ai-for-health #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Power-Dominance in Estimation Theory: A Third Pathological Axis

22 Sep 2025

University of Oulu

Researchers from the University of Oulu introduce "power-dominance" as a third pathological axis in estimation theory, expanding beyond the classical bias-variance decomposition. This framework reveals that estimators exceeding the true signal's mean power incur an unavoidable mean-squared error penalty, and establishes that optimal estimators inherently operate in a power-conservative regime.

#signal-processing #electrical-engineering #mathematics

Paper thumbnail

There are no more papers matching your filters at the moment.