West China Biomedical Big Data CenterWest China HospitalSichuan University
From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence

A comprehensive synthesis of Large Language Models for automated software development covers the entire model lifecycle, from data curation to autonomous agents, and offers practical guidance derived from empirical experiments on pre-training, fine-tuning, and reinforcement learning, alongside a detailed analysis of challenges and future directions.

View blog
Resources
Accelerating Diffusion Transformers with Token-wise Feature Caching

Token-wise Feature Caching (ToCa) accelerates Diffusion Transformers without requiring model retraining by adaptively caching intermediate features at a granular token and layer level. This method achieves up to 2.75x speedup while preserving or improving generation quality across various text-to-image and text-to-video models.

View blog
Resources60
Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

Video Compression Commander (VidCom²) introduces a plug-and-play inference acceleration framework for Video Large Language Models (VideoLLMs) that intelligently prunes redundant visual tokens. The approach reduces LLM generation latency by 70.8% and peak GPU memory usage by approximately 9.6%, while retaining 99.6% of original performance at 25% token retention.

View blog
Resources21
FNSPID: A Comprehensive Financial News Dataset in Time Series

FNSPID is a comprehensive financial dataset integrating 24 years of time-aligned stock prices and financial news for 4,775 S&P500 companies, featuring LLM-derived sentiment scores and multiple summarization methods. Experiments using FNSPID showed that Transformer models achieved a 0.988 R-squared for stock prediction, with LLM-based sentiment consistently improving accuracy across various deep learning architectures.

View blog
Resources296
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

LSceneLLM, developed by researchers from South China University of Technology, Tencent Robotics X, and others, presents an adaptive framework for large 3D scene understanding that mimics human visual processing by focusing on task-relevant regions. The approach achieves state-of-the-art performance across large indoor (XR-Scene), single-room indoor (ScanQA), and large outdoor (NuscenesQA) benchmarks, significantly improving fine-grained detail recognition and setting a new standard for embodied AI applications.

View blog
Resources59
Instant4D: 4D Gaussian Splatting in Minutes

A novel system, Instant4D, reconstructs high-quality dynamic 3D scenes from uncalibrated monocular videos within minutes. It achieves a 30x speed-up over existing methods, reducing training time to as little as 2 minutes and increasing PSNR by 7.15 dB on challenging datasets compared to concurrent baselines.

View blog
Resources17
Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook

A comprehensive survey examines the integration of retrieval-augmented generation (RAG) techniques in computer vision tasks, analyzing applications across visual understanding, generation, and embodied AI while mapping key challenges and opportunities for extending RAG beyond text-based retrieval into multimodal frameworks.

View blog
Resources241
FinBen: A Holistic Financial Benchmark for Large Language Models

FinBen introduces the first comprehensive open-source evaluation benchmark for large language models in finance, integrating 36 datasets across 24 tasks and seven critical financial aspects. Its extensive evaluation of 15 LLMs reveals strong performance in basic NLP but significant deficiencies in complex reasoning, quantitative forecasting, and robust decision-making, while showcasing advanced models' capabilities in areas like stock trading.

View blog
Resources787
Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models

The "Global Compression Commander" (GlobalCom²) framework accelerates inference for High-Resolution Large Vision-Language Models (HR-LVLMs) and VideoLLMs by intelligently compressing visual tokens using a global-to-local guidance strategy. It achieves a 90.9% reduction in FLOPs, a 40.0% decrease in peak GPU memory, and a 1.8x inference throughput boost at 10% token retention, while maintaining over 90% of original model performance.

View blog
Resources10
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
07 Jun 2025
Financial LLMs hold promise for advancing financial tasks and domain-specific applications. However, they are limited by scarce corpora, weak multimodal capabilities, and narrow evaluations, making them less suited for real-world application. To address this, we introduce \textit{Open-FinLLMs}, the first open-source multimodal financial LLMs designed to handle diverse tasks across text, tabular, time-series, and chart data, excelling in zero-shot, few-shot, and fine-tuning settings. The suite includes FinLLaMA, pre-trained on a comprehensive 52-billion-token corpus; FinLLaMA-Instruct, fine-tuned with 573K financial instructions; and FinLLaVA, enhanced with 1.43M multimodal tuning pairs for strong cross-modal reasoning. We comprehensively evaluate Open-FinLLMs across 14 financial tasks, 30 datasets, and 4 multimodal tasks in zero-shot, few-shot, and supervised fine-tuning settings, introducing two new multimodal evaluation datasets. Our results show that Open-FinLLMs outperforms afvanced financial and general LLMs such as GPT-4, across financial NLP, decision-making, and multi-modal tasks, highlighting their potential to tackle real-world challenges. To foster innovation and collaboration across academia and industry, we release all codes (this https URL) and models under OSI-approved licenses.
View blog
Resources
SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance 70.1370.13 and 78.9778.97 on safety and helpfulness across six benchmarks, surpassing both same-scale and >10×>10\times larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at this https URL.
View blog
Resources5
ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization
The field of Fake Image Detection and Localization (FIDL) is highly fragmented, encompassing four domains: deepfake detection (Deepfake), image manipulation detection and localization (IMDL), artificial intelligence-generated image detection (AIGC), and document image manipulation localization (Doc). Although individual benchmarks exist in some domains, a unified benchmark for all domains in FIDL remains blank. The absence of a unified benchmark results in significant domain silos, where each domain independently constructs its datasets, models, and evaluation protocols without interoperability, preventing cross-domain comparisons and hindering the development of the entire FIDL field. To close the domain silo barrier, we propose ForensicHub, the first unified benchmark & codebase for all-domain fake image detection and localization. Considering drastic variations on dataset, model, and evaluation configurations across all domains, as well as the scarcity of open-sourced baseline models and the lack of individual benchmarks in some domains, ForensicHub: i) proposes a modular and configuration-driven architecture that decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators, allowing flexible composition across all domains; ii) fully implements 10 baseline models, 6 backbones, 2 new benchmarks for AIGC and Doc, and integrates 2 existing benchmarks of DeepfakeBench and IMDLBenCo through an adapter-based design; iii) conducts indepth analysis based on the ForensicHub, offering 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards. ForensicHub represents a significant leap forward in breaking the domain silos in the FIDL field and inspiring future breakthroughs.
View blog
Resources
Integrating spoken instructions into flight trajectory prediction to optimize automation in air traffic control
The booming air transportation industry inevitably burdens air traffic controllers' workload, causing unexpected human factor-related incidents. Current air traffic control systems fail to consider spoken instructions for traffic prediction, bringing significant challenges in detecting human errors during real-time traffic operations. Here, we present an automation paradigm integrating controlling intent into the information processing loop through the spoken instruction-aware flight trajectory prediction framework. A 3-stage progressive multi-modal learning paradigm is proposed to address the modality gap between the trajectory and spoken instructions, as well as minimize the data requirements. Experiments on a real-world dataset show the proposed framework achieves flight trajectory prediction with high predictability and timeliness, obtaining over 20% relative reduction in mean deviation error. Moreover, the generalizability of the proposed framework is also confirmed by various model architectures. The proposed framework can formulate full-automated information processing in real-world air traffic applications, supporting human error detection and enhancing aviation safety.
View blog
Resources8
Conditional Representation Learning for Customized Tasks

Conditional Representation Learning (CRL) introduces an efficient framework that leverages LLMs and VLMs to generate image representations specifically tailored to arbitrary user-specified criteria, moving beyond universal embeddings. The framework demonstrates substantial performance gains across customized classification and retrieval tasks, including notable improvements of up to 40% in few-shot accuracy and 75% in clustering for non-dominant criteria compared to baseline VLMs.

View blog
Resources19
SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement
Large language models (LLMs) have achieved remarkable progress in code generation. However, existing benchmarks mainly formalize the task as a static, single-turn problem, overlooking the stepwise requirement changes and iterative workflows in real-world software development. This mismatch limits the understanding of how well LLMs can support real-world development workflows. Constructing such iterative benchmarks is challenging due to the lack of public interaction traces and the difficulty of creating discriminative, turn-specific test cases. To bridge this gap, we present SR-Eval, a benchmark specifically designed to assess LLMs on iterative code generation under Stepwise requirements Refinement. SR-Eval spans both function-level and repository-level tasks in Python and Java, enabling fine-grained and progressive evaluation across evolving requirements. The construction of SR-Eval follows a carefully designed pipeline that first leverages a multi-agent-based requirement generation method to simulate the development process and recover the multi-round interaction process from final requirements, then employs a semantic-aware discriminative test case generation component to ensure discriminative and consistent evaluation at each turn. SR-Eval comprises 443 multi-turn tasks and 1,857 questions at both function and repository levels. Using SR-Eval, we evaluate 11 representative LLMs with three prompting strategies that simulate different usage patterns. Results show that iterative code generation under stepwise requirement refinement remains highly challenging: the best-performing model achieves only 22.67% completion rate on function-level tasks and 20.00% on repository-level tasks. We further observe that prompting strategies substantially influence performance, highlighting the need for the development of advanced methods.
View blog
Resources
Towards an AI Musician: Synthesizing Sheet Music Problems for Musical Reasoning
Enhancing the ability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to interpret sheet music is a crucial step toward building AI musicians. However, current research lacks both evaluation benchmarks and training data for sheet music reasoning. Inspired by mathematics, where simple operations yield infinite verifiable problems, we introduce a novel approach that treats core music theory rules, such as those governing beats and intervals, as programmatic functions to systematically synthesize a vast and diverse corpus of sheet music reasoning problems. This approach allows us to introduce a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, leading to the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Evaluation results on SSMR-Bench highlight the key role reasoning plays in interpreting sheet music, while also pointing out the ongoing challenges in understanding sheet music in a visual format. By leveraging synthetic data for RLVR, all models show significant improvements on the SSMR-Bench. Additionally, they also demonstrate considerable advancements on previously established human-crafted benchmarks, such as MusicTheoryBench and the music subset of MMMU. Finally, our results show that the enhanced reasoning ability can also facilitate music composition.
View blog
Resources
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis

Researchers from multiple Asian institutions introduce GEMeX, the largest chest X-ray VQA dataset with 1.6 million question-answer pairs, featuring both textual explanations and visual grounding for answers. Benchmarking shows existing large vision-language models perform poorly, while a fine-tuned model achieves substantial performance gains and improved visual grounding, highlighting the dataset's utility for developing explainable medical AI.

View blog
Resources10
MCTrack: A Unified 3D Multi-Object Tracking Framework for Autonomous Driving

MCTrack introduces a unified 3D multi-object tracking framework designed to achieve state-of-the-art performance across KITTI, nuScenes, and Waymo datasets. It also proposes a standardized data format and novel motion-centric evaluation metrics, enhancing generalizability and comprehensive assessment for autonomous driving.

View blog
Resources130
Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

Researchers from Sichuan University and collaborators developed V2Drop, a method that accelerates Large Vision-Language Models by progressively dropping redundant visual tokens based on their representational variation across LLM layers. This approach significantly enhances inference speed and reduces memory usage while preserving performance, addressing limitations of prior attention-guided compression techniques.

View blog
Resources8
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

MIXLORA integrates Low-Rank Adaptation (LoRA) with a Mixture-of-Experts (MoE) architecture to improve Large Language Model fine-tuning for multi-task learning. The method delivers an average accuracy increase of 9.8% over LoRA in multi-task scenarios and reduces GPU memory consumption by 40% with its high-throughput optimization framework.

View blog
Resources
There are no more papers matching your filters at the moment.