State Key Lab of AI SafetyInstitute of Computing TechnologyCAS
Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond
General world models represent a crucial pathway toward achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual environments to decision-making systems. Recently, the emergence of the Sora model has attained significant attention due to its remarkable simulation capabilities, which exhibits an incipient comprehension of physical laws. In this survey, we embark on a comprehensive exploration of the latest advancements in world models. Our analysis navigates through the forefront of generative methodologies in video generation, where world models stand as pivotal constructs facilitating the synthesis of highly realistic visual content. Additionally, we scrutinize the burgeoning field of autonomous-driving world models, meticulously delineating their indispensable role in reshaping transportation and urban mobility. Furthermore, we delve into the intricacies inherent in world models deployed within autonomous agents, shedding light on their profound significance in enabling intelligent interactions within dynamic environmental contexts. At last, we examine challenges and limitations of world models, and discuss their potential future directions. We hope this survey can serve as a foundational reference for the research community and inspire continued innovation. This survey will be regularly updated at: this https URL.
View blog
Resources438
Large Language Models as Computable Approximations to Solomonoff Induction

This paper establishes a foundational theoretical link between Large Language Models (LLMs) and Algorithmic Information Theory (AIT), demonstrating that LLM training and inference can be viewed as computable approximations of the Solomonoff prior and Solomonoff induction. Leveraging this understanding, the work proposes a few-shot example selection strategy based on model confidence that improved classification accuracy on tested datasets, for instance, increasing Qwen2.5 3B's accuracy from 76.62% to 90.07% on the SMS dataset.

View blog
Resources
Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation
11 Jun 2025

Chain-of-Action (CoA) proposes a visuo-motor policy that generates robot trajectories autoregressively in reverse, starting from a task goal and reasoning backward to the current state. This approach addresses compounding errors and enhances spatial generalization, achieving an average success rate of 0.552 on 60 RLBench tasks and demonstrating improved performance on real-world Fetch robot manipulation.

View blog
Resources70
Game-theoretic LLM: Agent Workflow for Negotiation Games

Researchers developed game-theory-inspired workflows to systematically enhance the strategic decision-making capabilities of large language models (LLMs) in various negotiation and strategic games. Integrating classical game theory principles, these workflows enabled LLM agents to achieve near-optimal allocations in incomplete-information negotiations, with up to 100% agreement and envy-freeness, and significantly improved adherence to Nash Equilibria in complete-information games compared to baseline LLM performance.

View blog
Resources
A Theory for Token-Level Harmonization in Retrieval-Augmented Generation

Researchers at CAS Key Laboratory of AI Safety developed a theoretical framework to quantify the benefit and detriment of retrieved information in Retrieval-Augmented Generation (RAG) at the token level. This framework formalizes benefit as distribution completion and detriment as distribution contradiction, enabling a practical method (Tok-RAG) that improves RAG robustness and performance across diverse tasks with minimal computational overhead.

View blog
Resources9
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Audio Large Language Models (ALLMs) have gained widespread adoption, yet their trustworthiness remains underexplored. Existing evaluation frameworks, designed primarily for text, fail to address unique vulnerabilities introduced by audio's acoustic properties. We identify significant trustworthiness risks in ALLMs arising from non-semantic acoustic cues, including timbre, accent, and background noise, which can manipulate model behavior. We propose AudioTrust, a comprehensive framework for systematic evaluation of ALLM trustworthiness across audio-specific risks. AudioTrust encompasses six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. The framework implements 26 distinct sub-tasks using a curated dataset of over 4,420 audio samples from real-world scenarios, including daily conversations, emergency calls, and voice assistant interactions. We conduct comprehensive evaluations across 18 experimental configurations using human-validated automated pipelines. Our evaluation of 14 state-of-the-art open-source and closed-source ALLMs reveals significant limitations when confronted with diverse high-risk audio scenarios, providing insights for secure deployment of audio models. Code and data are available at this https URL.
View blog
Resources208
Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models

Researchers from Chinese Academy of Sciences developed Gradient-Adaptive Policy Optimization (GAPO), a fine-tuning method for Large Language Models that employs gradient rescaling to balance multiple, potentially conflicting objectives like helpfulness and harmlessness. GAPO (p=1) consistently outperformed existing multi-objective alignment baselines in both model-based and GPT-4o evaluations, achieving superior average helpfulness and harmlessness scores, while P-GAPO enabled the generation of a better Pareto front for user-preferred trade-offs.

View blog
Resources
UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation
03 Oct 2024

This work introduces UncertaintyRAG, a lightweight and unsupervised retrieval model for long-context Retrieval-Augmented Generation (RAG). It leverages Signal-to-Noise Ratio (SNR)-based span uncertainty to estimate semantic similarity between text chunks, enhancing robustness to distribution shifts and achieving state-of-the-art average performance on long-context QA and summarization benchmarks while utilizing only 4% of the training data compared to baseline models.

View blog
Resources
DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning
Detecting AI-involved text is essential for combating misinformation, plagiarism, and academic misconduct. However, AI text generation includes diverse collaborative processes (AI-written text edited by humans, human-written text edited by AI, and AI-generated text refined by other AI), where various or even new LLMs could be involved. Texts generated through these varied processes exhibit complex characteristics, presenting significant challenges for detection. Current methods model these processes rather crudely, primarily employing binary classification (purely human vs. AI-involved) or multi-classification (treating human-AI collaboration as a new class). We observe that representations of texts generated through different processes exhibit inherent clustering relationships. Therefore, we propose DETree, a novel approach that models the relationships among different processes as a Hierarchical Affinity Tree structure, and introduces a specialized loss function that aligns text representations with this tree. To facilitate this learning, we developed RealBench, a comprehensive benchmark dataset that automatically incorporates a wide spectrum of hybrid texts produced through various human-AI collaboration processes. Our method improves performance in hybrid text detection tasks and significantly enhances robustness and generalization in out-of-distribution scenarios, particularly in few-shot learning conditions, further demonstrating the promise of training-based approaches in OOD settings. Our code and dataset are available at this https URL.
View blog
Resources4
VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

The VL-SAE framework introduces a Sparse Autoencoder architecture that interprets and enhances vision-language alignment in Vision-Language Models by mapping both modalities to a unified concept set. This approach improves the interpretability of cross-modal reasoning and demonstrates performance gains in zero-shot classification and hallucination reduction.

View blog
Resources1
MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing
This paper proposes MotionVerse, a unified framework that harnesses the capabilities of Large Language Models (LLMs) to comprehend, generate, and edit human motion in both single-person and multi-person scenarios. To efficiently represent motion data, we employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens. Furthermore, we introduce a \textit{Delay Parallel} Modeling strategy, which temporally staggers the encoding of residual token streams. This design enables LLMs to effectively capture inter-stream dependencies while maintaining computational efficiency comparable to single-stream modeling. Moreover, to alleviate modality interference between motion and language, we design a \textit{dual-tower architecture} with modality-specific parameters, ensuring stable integration of motion information for both comprehension and generation tasks. Comprehensive ablation studies demonstrate the effectiveness of each component in MotionVerse, and extensive experiments showcase its superior performance across a wide range of motion-relevant tasks.
View blog
Resources
Continual Forgetting for Pre-trained Vision Models
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners. These requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify two key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. To address them, we propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we use LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. GS-LoRA is effective, parameter-efficient, data-efficient, and easy to implement. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that GS-LoRA manages to forget specific classes with minimal impact on other classes. Codes will be released on \url{this https URL}.
View blog
Resources57
FakingRecipe: Detecting Fake News on Short Video Platforms from the Perspective of Creative Process
As short-form video-sharing platforms become a significant channel for news consumption, fake news in short videos has emerged as a serious threat in the online information ecosystem, making developing detection methods for this new scenario an urgent need. Compared with that in text and image formats, fake news on short video platforms contains rich but heterogeneous information in various modalities, posing a challenge to effective feature utilization. Unlike existing works mostly focusing on analyzing what is presented, we introduce a novel perspective that considers how it might be created. Through the lens of the creative process behind news video production, our empirical analysis uncovers the unique characteristics of fake news videos in material selection and editing. Based on the obtained insights, we design FakingRecipe, a creative process-aware model for detecting fake news short videos. It captures the fake news preferences in material selection from sentimental and semantic aspects and considers the traits of material editing from spatial and temporal aspects. To improve evaluation comprehensiveness, we first construct FakeTT, an English dataset for this task, and conduct experiments on both FakeTT and the existing Chinese FakeSV dataset. The results show FakingRecipe's superiority in detecting fake news on short video platforms.
View blog
Resources32
Improving Molecular Graph Generation with Flow Matching and Optimal Transport

Researchers from the Chinese Academy of Sciences and King Abdullah University of Science and Technology introduced GGFlow, the first discrete flow matching generative model that incorporates optimal transport for molecular graphs. This model achieves nearly perfect chemical validity and state-of-the-art performance in both unconditional and property-guided molecule generation with significantly fewer inference steps.

View blog
Resources
Music Style Transfer with Time-Varying Inversion of Diffusion Models

A method for music style transfer is introduced that leverages diffusion models with time-varying textual inversion, allowing users to transfer styles from any audio example, including non-musical sounds, to existing melodies while preserving structural content. This approach demonstrates superior performance in both content preservation and style fit compared to existing state-of-the-art techniques.

View blog
Resources56
KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction

KnowCoder, developed by ICT, CAS, enhances Universal Information Extraction (UIE) by introducing a code-style schema representation that leverages LLMs' inherent code understanding. The model demonstrates superior generalization across diverse information extraction tasks, achieving a 12.5% relative improvement in zero-shot NER F1 over leading baselines and outperforming prior state-of-the-art models in Relation and Event Extraction after fine-tuning.

View blog
Resources65
DIVERSIFY: A General Framework for Time Series Out-of-distribution Detection and Generalization

DIVERSIFY is a framework that tackles out-of-distribution detection and generalization for time series data by explicitly identifying and characterizing latent distributions without relying on predefined domain labels. It consistently outperforms baseline methods on OOD detection across seven diverse datasets, demonstrating its ability to learn robust representations for non-stationary time series.

View blog
Resources
D4^4M: Dataset Distillation via Disentangled Diffusion Model
Dataset distillation offers a lightweight synthetic dataset for fast network training with promising test accuracy. To imitate the performance of the original dataset, most approaches employ bi-level optimization and the distillation space relies on the matching architecture. Nevertheless, these approaches either suffer significant computational costs on large-scale datasets or experience performance decline on cross-architectures. We advocate for designing an economical dataset distillation framework that is independent of the matching architectures. With empirical observations, we argue that constraining the consistency of the real and synthetic image spaces will enhance the cross-architecture generalization. Motivated by this, we introduce Dataset Distillation via Disentangled Diffusion Model (D4^4M), an efficient framework for dataset distillation. Compared to architecture-dependent methods, D4^4M employs latent diffusion model to guarantee consistency and incorporates label information into category prototypes. The distilled datasets are versatile, eliminating the need for repeated generation of distinct datasets for various architectures. Through comprehensive experiments, D4^4M demonstrates superior performance and robust generalization, surpassing the SOTA methods across most aspects.
View blog
Resources
MEC-Quant: Maximum Entropy Coding for Extremely Low Bit Quantization-Aware Training
Quantization-Aware Training (QAT) has driven much attention to produce efficient neural networks. Current QAT still obtains inferior performances compared with the Full Precision (FP) counterpart. In this work, we argue that quantization inevitably introduce biases into the learned representation, especially under the extremely low-bit setting. To cope with this issue, we propose Maximum Entropy Coding Quantization (MEC-Quant), a more principled objective that explicitly optimizes on the structure of the representation, so that the learned representation is less biased and thus generalizes better to unseen in-distribution samples. To make the objective end-to-end trainable, we propose to leverage the minimal coding length in lossy data coding as a computationally tractable surrogate for the entropy, and further derive a scalable reformulation of the objective based on Mixture Of Experts (MOE) that not only allows fast computation but also handles the long-tailed distribution for weights or activation values. Extensive experiments on various tasks on computer vision tasks prove its superiority. With MEC-Qaunt, the limit of QAT is pushed to the x-bit activation for the first time and the accuracy of MEC-Quant is comparable to or even surpass the FP counterpart. Without bells and whistles, MEC-Qaunt establishes a new state of the art for QAT.
View blog
Resources
Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework
Large Vision-Language Models (LVLMs) have shown significant capability in vision-language understanding. However, one critical issue that persists in these models is sycophancy, where models are unduly influenced by leading or deceptive prompts, resulting in biased outputs and hallucinations. Despite the rapid development of LVLMs, evaluating and mitigating sycophancy remains largely under-explored. In this work, we fill this gap by systematically analyzing sycophancy across multiple vision-language benchmarks and propose an inference-time mitigation framework. We curate leading queries and quantify the susceptibility of state-of-the-art LVLMs to prompt-induced bias, revealing consistent performance degradation and instability across models and tasks. Our analysis further uncovers model-specific behavioral traits, such as sentiment sensitivity and prediction polarity shifts under sycophancy. To mitigate these issues, we propose a training-free, model-agnostic framework that operates entirely at inference time. Our approach first employs a query neutralizer, leveraging an language model to suppress implicit sycophantic bias in user queries. We then introduce a sycophancy-aware contrastive decoding mechanism that dynamically recalibrates token-level output distributions by contrasting responses to neutralized and leading queries. Finally, an adaptive logits refinement module further modifies the contrasted logits by integrating both a adaptive plausibility filter and query sentiment scaler, ensuring coherent and robust generation. Extensive experiments demonstrate that this framework effectively mitigates sycophancy across all evaluated models, while maintaining performance on neutral prompts. Our results suggest that sycophancy in LVLMs is a general and urgent challenge, and that inference-time strategies offer a promising path toward trustworthy multimodal reasoning.
View blog
Resources
There are no more papers matching your filters at the moment.