An Extended Multi-stream Temporal-attention Adaptive GCN (EMS-TAGCN) is presented to enhance skeleton-based human action recognition by integrating adaptive graph topology learning, processing diverse skeletal data streams, and employing spatial-temporal-channel attention. The model achieved state-of-the-art performance, with accuracy gains of up to 2.34% on UCF-101 and 1.4% on NTU-RGBD cross-view over existing methods.
View blogAffectGPT introduces a new dataset, model, and benchmark to advance multimodal large language models in generative, descriptive emotion understanding. The proposed AffectGPT model, utilizing a specialized pre-fusion architecture and trained on the MER-Caption dataset, achieves over a 9% performance improvement compared to existing MLLMs on a unified evaluation framework.
View blogReloc3r presents a visual localization framework leveraging a large-scale trained relative camera pose regression network built on a Vision Transformer backbone and a minimalist motion averaging module. This approach achieves state-of-the-art accuracy, exhibits robust generalization across diverse unseen scenes, and maintains real-time inference speeds.
View blogThe IA-VLA framework enables Vision-Language-Action (VLA) models to interpret complex semantic instructions by offloading object identification to a larger Vision-Language Model (VLM) for input augmentation. This approach significantly improves VLA generalization, especially for tasks involving visually indistinguishable duplicate objects and novel instructions.
View blogA new model, PSScreen, was developed to screen for multiple retinal diseases using partially labeled datasets from diverse medical sites, addressing challenges like domain shifts and incomplete annotations. The method established a new benchmark in partially supervised learning, achieving superior domain generalization on unseen data compared to prior state-of-the-art approaches and outperforming leading vision-language foundation models in zero-shot screening.
View blogResearchers from the University of Oulu introduce "power-dominance" as a third pathological axis in estimation theory, expanding beyond the classical bias-variance decomposition. This framework reveals that estimators exceeding the true signal's mean power incur an unavoidable mean-squared error penalty, and establishes that optimal estimators inherently operate in a power-conservative regime.
View blog