alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

West China Biomedical Big Data CenterWest China HospitalSichuan University

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence

06 Dec 2025

Monash University CSIRO

A comprehensive synthesis of Large Language Models for automated software development covers the entire model lifecycle, from data curation to autonomous agents, and offers practical guidance derived from empirical experiments on pre-training, fine-tuning, and reinforcement learning, alongside a detailed analysis of challenges and future directions.

#agentic-frameworks #agents #ai-for-cybersecurity

Paper thumbnail

Accelerating Diffusion Transformers with Token-wise Feature Caching

19 Feb 2025

Sichuan University

Shanghai Jiao Tong University

Token-wise Feature Caching (ToCa) accelerates Diffusion Transformers without requiring model retraining by adaptively caching intermediate features at a granular token and layer level. This method achieves up to 2.75x speedup while preserving or improving generation quality across various text-to-image and text-to-video models.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

18 Nov 2025

Sichuan University Fudan University logo

Fudan University

Video Compression Commander (VidCom²) introduces a plug-and-play inference acceleration framework for Video Large Language Models (VideoLLMs) that intelligently prunes redundant visual tokens. The approach reduces LLM generation latency by 70.8% and peak GPU memory usage by approximately 9.6%, while retaining 99.6% of original performance at 25% token retention.

#computer-science #computer-vision-and-pattern-recognition #efficient-transformers

Paper thumbnail

FNSPID: A Comprehensive Financial News Dataset in Time Series

09 Feb 2024

Sichuan University North Carolina State University

FNSPID is a comprehensive financial dataset integrating 24 years of time-aligned stock prices and financial news for 4,775 S&P500 companies, featuring LLM-derived sentiment scores and multiple summarization methods. Experiments using FNSPID showed that Transformer models achieved a 0.988 R-squared for stock prediction, with LLM-based sentiment consistently improving accuracy across various deep learning architectures.

#statistical-finance #quantitative-finance

Paper thumbnail

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

02 Feb 2025

South China University of Technology Pazhou Laboratory

LSceneLLM, developed by researchers from South China University of Technology, Tencent Robotics X, and others, presents an adaptive framework for large 3D scene understanding that mimics human visual processing by focusing on task-relevant regions. The approach achieves state-of-the-art performance across large indoor (XR-Scene), single-room indoor (ScanQA), and large outdoor (NuscenesQA) benchmarks, significantly improving fine-grained detail recognition and setting a new standard for embodied AI applications.

#computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Instant4D: 4D Gaussian Splatting in Minutes

01 Oct 2025

University of Pittsburgh Carnegie Mellon University logo

Carnegie Mellon University

A novel system, Instant4D, reconstructs high-quality dynamic 3D scenes from uncalibrated monocular videos within minutes. It achieves a 30x speed-up over existing methods, reducing training time to as little as 2 minutes and increasing PSNR by 7.15 dB on challenging datasets compared to concurrent baselines.

#computer-science #computer-vision-and-pattern-recognition #lightweight-models

Paper thumbnail

Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook

23 Mar 2025

ETH Zurich Sichuan University

A comprehensive survey examines the integration of retrieval-augmented generation (RAG) techniques in computer vision tasks, analyzing applications across visual understanding, generation, and embodied AI while mapping key challenges and opportunities for extending RAG beyond text-based retrieval into multimodal frameworks.

#ai-for-health #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

FinBen: A Holistic Financial Benchmark for Large Language Models

19 Jun 2024

Wuhan University Sichuan University

FinBen introduces the first comprehensive open-source evaluation benchmark for large language models in finance, integrating 36 datasets across 24 tasks and seven critical financial aspects. Its extensive evaluation of 15 LLMs reveals strong performance in basic NLP but significant deficiencies in complex reasoning, quantitative forecasting, and robust decision-making, while showcasing advanced models' capabilities in areas like stock trading.

#computer-science #conversational-ai #artificial-intelligence

Paper thumbnail

Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models

11 Aug 2025

Sichuan University

Shanghai Jiao Tong University

The "Global Compression Commander" (GlobalCom²) framework accelerates inference for High-Resolution Large Vision-Language Models (HR-LVLMs) and VideoLLMs by intelligently compressing visual tokens using a global-to-local guidance strategy. It achieves a 90.9% reduction in FLOPs, a 40.0% decrease in peak GPU memory, and a 1.8x inference throughput boost at 10% token retention, while maintaining over 90% of original model performance.

#computer-science #computer-vision-and-pattern-recognition #efficient-transformers

Paper thumbnail

Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

07 Jun 2025

steven-zhang

Steven zhang

toby-yang

Toby YANG

Rensselaer Polytechnic Institute Wuhan University

Financial LLMs hold promise for advancing financial tasks and domain-specific applications. However, they are limited by scarce corpora, weak multimodal capabilities, and narrow evaluations, making them less suited for real-world application. To address this, we introduce \textit{Open-FinLLMs}, the first open-source multimodal financial LLMs designed to handle diverse tasks across text, tabular, time-series, and chart data, excelling in zero-shot, few-shot, and fine-tuning settings. The suite includes FinLLaMA, pre-trained on a comprehensive 52-billion-token corpus; FinLLaMA-Instruct, fine-tuned with 573K financial instructions; and FinLLaVA, enhanced with 1.43M multimodal tuning pairs for strong cross-modal reasoning. We comprehensively evaluate Open-FinLLMs across 14 financial tasks, 30 datasets, and 4 multimodal tasks in zero-shot, few-shot, and supervised fine-tuning settings, introducing two new multimodal evaluation datasets. Our results show that Open-FinLLMs outperforms afvanced financial and general LLMs such as GPT-4, across financial NLP, decision-making, and multi-modal tasks, highlighting their potential to tackle real-world challenges. To foster innovation and collaboration across academia and industry, we release all codes (this https URL) and models under OSI-approved licenses.

#computer-science #computational-engineering-finance-and-science #computation-and-language

Paper thumbnail

SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

09 Oct 2025

Sichuan University Tsinghua University logo

Tsinghua University

Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance

70.13

and

78.97

on safety and helpfulness across six benchmarks, surpassing both same-scale and

>10\times

larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at this https URL.

#adversarial-robustness #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization

16 Oct 2025

South China University of Technology Sichuan University

The field of Fake Image Detection and Localization (FIDL) is highly fragmented, encompassing four domains: deepfake detection (Deepfake), image manipulation detection and localization (IMDL), artificial intelligence-generated image detection (AIGC), and document image manipulation localization (Doc). Although individual benchmarks exist in some domains, a unified benchmark for all domains in FIDL remains blank. The absence of a unified benchmark results in significant domain silos, where each domain independently constructs its datasets, models, and evaluation protocols without interoperability, preventing cross-domain comparisons and hindering the development of the entire FIDL field. To close the domain silo barrier, we propose ForensicHub, the first unified benchmark & codebase for all-domain fake image detection and localization. Considering drastic variations on dataset, model, and evaluation configurations across all domains, as well as the scarcity of open-sourced baseline models and the lack of individual benchmarks in some domains, ForensicHub: i) proposes a modular and configuration-driven architecture that decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators, allowing flexible composition across all domains; ii) fully implements 10 baseline models, 6 backbones, 2 new benchmarks for AIGC and Doc, and integrates 2 existing benchmarks of DeepfakeBench and IMDLBenCo through an adapter-based design; iii) conducts indepth analysis based on the ForensicHub, offering 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards. ForensicHub represents a significant leap forward in breaking the domain silos in the FIDL field and inspiring future breakthroughs.

#adversarial-attacks #ai-for-cybersecurity #computer-science

Paper thumbnail

Integrating spoken instructions into flight trajectory prediction to optimize automation in air traffic control

18 Oct 2024

Sichuan University National Key Laboratory of Fundamental Science on Synthetic Vision

The booming air transportation industry inevitably burdens air traffic controllers' workload, causing unexpected human factor-related incidents. Current air traffic control systems fail to consider spoken instructions for traffic prediction, bringing significant challenges in detecting human errors during real-time traffic operations. Here, we present an automation paradigm integrating controlling intent into the information processing loop through the spoken instruction-aware flight trajectory prediction framework. A 3-stage progressive multi-modal learning paradigm is proposed to address the modality gap between the trajectory and spoken instructions, as well as minimize the data requirements. Experiments on a real-world dataset show the proposed framework achieves flight trajectory prediction with high predictability and timeliness, obtaining over 20% relative reduction in mean deviation error. Moreover, the generalizability of the proposed framework is also confirmed by various model architectures. The proposed framework can formulate full-automated information processing in real-world air traffic applications, supporting human error detection and enhancing aviation safety.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

Conditional Representation Learning for Customized Tasks

06 Oct 2025

Chinese Academy of Sciences Sichuan University

Conditional Representation Learning (CRL) introduces an efficient framework that leverages LLMs and VLMs to generate image representations specifically tailored to arbitrary user-specified criteria, moving beyond universal embeddings. The framework demonstrates substantial performance gains across customized classification and retrieval tasks, including notable improvements of up to 40% in few-shot accuracy and 75% in clustering for non-dominant criteria compared to baseline VLMs.

#computer-science #computer-vision-and-pattern-recognition #embedding-methods

Paper thumbnail

SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement

23 Sep 2025

Sichuan University

The Chinese University of Hong Kong

Large language models (LLMs) have achieved remarkable progress in code generation. However, existing benchmarks mainly formalize the task as a static, single-turn problem, overlooking the stepwise requirement changes and iterative workflows in real-world software development. This mismatch limits the understanding of how well LLMs can support real-world development workflows. Constructing such iterative benchmarks is challenging due to the lack of public interaction traces and the difficulty of creating discriminative, turn-specific test cases. To bridge this gap, we present SR-Eval, a benchmark specifically designed to assess LLMs on iterative code generation under Stepwise requirements Refinement. SR-Eval spans both function-level and repository-level tasks in Python and Java, enabling fine-grained and progressive evaluation across evolving requirements. The construction of SR-Eval follows a carefully designed pipeline that first leverages a multi-agent-based requirement generation method to simulate the development process and recover the multi-round interaction process from final requirements, then employs a semantic-aware discriminative test case generation component to ensure discriminative and consistent evaluation at each turn. SR-Eval comprises 443 multi-turn tasks and 1,857 questions at both function and repository levels. Using SR-Eval, we evaluate 11 representative LLMs with three prompting strategies that simulate different usage patterns. Results show that iterative code generation under stepwise requirement refinement remains highly challenging: the best-performing model achieves only 22.67% completion rate on function-level tasks and 20.00% on repository-level tasks. We further observe that prompting strategies substantially influence performance, highlighting the need for the development of advanced methods.

#computer-science #software-engineering

Paper thumbnail

Towards an AI Musician: Synthesizing Sheet Music Problems for Musical Reasoning

26 Sep 2025

Sichuan University

The Chinese University of Hong Kong

Enhancing the ability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to interpret sheet music is a crucial step toward building AI musicians. However, current research lacks both evaluation benchmarks and training data for sheet music reasoning. Inspired by mathematics, where simple operations yield infinite verifiable problems, we introduce a novel approach that treats core music theory rules, such as those governing beats and intervals, as programmatic functions to systematically synthesize a vast and diverse corpus of sheet music reasoning problems. This approach allows us to introduce a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, leading to the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Evaluation results on SSMR-Bench highlight the key role reasoning plays in interpreting sheet music, while also pointing out the ongoing challenges in understanding sheet music in a visual format. By leveraging synthetic data for RLVR, all models show significant improvements on the SSMR-Bench. Additionally, they also demonstrate considerable advancements on previously established human-crafted benchmarks, such as MusicTheoryBench and the music subset of MMMU. Finally, our results show that the enhanced reasoning ability can also facilitate music composition.

#computer-science #computation-and-language #fine-tuning

Paper thumbnail

GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis

23 Mar 2025

Sichuan University

National University of Singapore

Researchers from multiple Asian institutions introduce GEMeX, the largest chest X-ray VQA dataset with 1.6 million question-answer pairs, featuring both textual explanations and visual grounding for answers. Benchmarking shows existing large vision-language models perform poorly, while a fine-tuned model achieves substantial performance gains and improved visual grounding, highlighting the dataset's utility for developing explainable medical AI.

#ai-for-health #computer-science #computer-vision-security

Paper thumbnail

MCTrack: A Unified 3D Multi-Object Tracking Framework for Autonomous Driving

14 Oct 2024

xiyangwang

wang xiyang

jieyou-zhao

jieyou zhao

Sichuan University

University of Science and Technology of China

MCTrack introduces a unified 3D multi-object tracking framework designed to achieve state-of-the-art performance across KITTI, nuScenes, and Waymo datasets. It also proposes a standardized data format and novel motion-centric evaluation metrics, enhancing generalizability and comprehensive assessment for autonomous driving.

#autonomous-vehicles #computer-science #computer-vision-security

Paper thumbnail

Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

01 Sep 2025

Sichuan University Zhejiang University logo

Zhejiang University

Researchers from Sichuan University and collaborators developed V2Drop, a method that accelerates Large Vision-Language Models by progressively dropping redundant visual tokens based on their representational variation across LLM layers. This approach significantly enhances inference speed and reduces memory usage while preserving performance, addressing limitations of prior attention-guided compression techniques.

#computer-science #computer-vision-and-pattern-recognition #efficient-transformers

Paper thumbnail

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

20 Jul 2024

Sichuan University Emory University logo

Emory University

MIXLORA integrates Low-Rank Adaptation (LoRA) with a Mixture-of-Experts (MoE) architecture to improve Large Language Model fine-tuning for multi-task learning. The method delivers an average accuracy increase of 9.8% over LoRA in multi-task scenarios and reduces GPU memory consumption by 40% with its high-throughput optimization framework.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

There are no more papers matching your filters at the moment.