alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

CAS Key Laboratory of AI SecurityInstitute of Computing TechnologyCAS

Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond

28 Oct 2025

botian-shi

Botian Shi

domingo

Domingo

kai-wang449

Kai Wang

Northwestern Polytechnical University Shanghai Artificial Intelligence Laboratory

General world models represent a crucial pathway toward achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual environments to decision-making systems. Recently, the emergence of the Sora model has attained significant attention due to its remarkable simulation capabilities, which exhibits an incipient comprehension of physical laws. In this survey, we embark on a comprehensive exploration of the latest advancements in world models. Our analysis navigates through the forefront of generative methodologies in video generation, where world models stand as pivotal constructs facilitating the synthesis of highly realistic visual content. Additionally, we scrutinize the burgeoning field of autonomous-driving world models, meticulously delineating their indispensable role in reshaping transportation and urban mobility. Furthermore, we delve into the intricacies inherent in world models deployed within autonomous agents, shedding light on their profound significance in enabling intelligent interactions within dynamic environmental contexts. At last, we examine challenges and limitations of world models, and discuss their potential future directions. We hope this survey can serve as a foundational reference for the research community and inspire continued innovation. This survey will be regularly updated at: this https URL.

#autonomous-vehicles #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

11 Jun 2025

Chain-of-Action (CoA) proposes a visuo-motor policy that generates robot trajectories autoregressively in reverse, starting from a task goal and reasoning backward to the current state. This approach addresses compounding errors and enhances spatial generalization, achieving an average success rate of 0.552 on 60 RLBench tasks and demonstrating improved performance on real-world Fetch robot manipulation.

#chain-of-thought #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Game-theoretic LLM: Agent Workflow for Negotiation Games

12 Nov 2024

University of California, Santa Barbara Harvard University logo

Harvard University

Researchers developed game-theory-inspired workflows to systematically enhance the strategic decision-making capabilities of large language models (LLMs) in various negotiation and strategic games. Integrating classical game theory principles, these workflows enabled LLM agents to achieve near-optimal allocations in incomplete-information negotiations, with up to 100% agreement and envy-freeness, and significantly improved adherence to Nash Equilibria in complete-information games compared to baseline LLM performance.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

A Theory for Token-Level Harmonization in Retrieval-Augmented Generation

28 Feb 2025

Chinese Academy of Sciences Institute of Computing Technology

Researchers at CAS Key Laboratory of AI Safety developed a theoretical framework to quantify the benefit and detriment of retrieved information in Retrieval-Augmented Generation (RAG) at the token level. This framework formalizes benefit as distribution completion and detriment as distribution contradiction, enabling a practical method (Tok-RAG) that improves RAG robustness and performance across diverse tasks with minimal computational overhead.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

30 Sep 2025

kun-wang

Kun Wang

Shanghai Jiao Tong University Tsinghua University logo

Tsinghua University

Audio Large Language Models (ALLMs) have gained widespread adoption, yet their trustworthiness remains underexplored. Existing evaluation frameworks, designed primarily for text, fail to address unique vulnerabilities introduced by audio's acoustic properties. We identify significant trustworthiness risks in ALLMs arising from non-semantic acoustic cues, including timbre, accent, and background noise, which can manipulate model behavior. We propose AudioTrust, a comprehensive framework for systematic evaluation of ALLM trustworthiness across audio-specific risks. AudioTrust encompasses six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. The framework implements 26 distinct sub-tasks using a curated dataset of over 4,420 audio samples from real-world scenarios, including daily conversations, emergency calls, and voice assistant interactions. We conduct comprehensive evaluations across 18 experimental configurations using human-validated automated pipelines. Our evaluation of 14 state-of-the-art open-source and closed-source ALLMs reveals significant limitations when confronted with diverse high-risk audio scenarios, providing insights for secure deployment of audio models. Code and data are available at this https URL.

#adversarial-attacks #computer-science #artificial-intelligence

Paper thumbnail

UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation

03 Oct 2024

This work introduces UncertaintyRAG, a lightweight and unsupervised retrieval model for long-context Retrieval-Augmented Generation (RAG). It leverages Signal-to-Noise Ratio (SNR)-based span uncertainty to estimate semantic similarity between text chunks, enhancing robustness to distribution shifts and achieving state-of-the-art average performance on long-context QA and summarization benchmarks while utilizing only 4% of the training data compared to baseline models.

#computer-science #computation-and-language #efficient-transformers

Paper thumbnail

VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

24 Oct 2025

Chinese Academy of Sciences Institute of Computing Technology

The VL-SAE framework introduces a Sparse Autoencoder architecture that interprets and enhances vision-language alignment in Vision-Language Models by mapping both modalities to a unified concept set. This approach improves the interpretability of cross-modal reasoning and demonstrates performance gains in zero-shot classification and hallucination reduction.

#computer-science #computer-vision-and-pattern-recognition #machine-learning

Paper thumbnail

MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing

28 Sep 2025

Chinese Academy of Sciences Institute of Computing Technology

This paper proposes MotionVerse, a unified framework that harnesses the capabilities of Large Language Models (LLMs) to comprehend, generate, and edit human motion in both single-person and multi-person scenarios. To efficiently represent motion data, we employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens. Furthermore, we introduce a \textit{Delay Parallel} Modeling strategy, which temporally staggers the encoding of residual token streams. This design enables LLMs to effectively capture inter-stream dependencies while maintaining computational efficiency comparable to single-stream modeling. Moreover, to alleviate modality interference between motion and language, we design a \textit{dual-tower architecture} with modality-specific parameters, ensuring stable integration of motion information for both comprehension and generation tasks. Comprehensive ablation studies demonstrate the effectiveness of each component in MotionVerse, and extensive experiments showcase its superior performance across a wide range of motion-relevant tasks.

#computer-science #computer-vision-and-pattern-recognition #embedding-methods

Paper thumbnail

Continual Forgetting for Pre-trained Vision Models

18 Jul 2024

zhaohongbozhaohb

Hongbo Zhao

Shanghai Artificial Intelligence Laboratory UCAS

For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners. These requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify two key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. To address them, we propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we use LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. GS-LoRA is effective, parameter-efficient, data-efficient, and easy to implement. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that GS-LoRA manages to forget specific classes with minimal impact on other classes. Codes will be released on \url{this https URL}.

#computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

FakingRecipe: Detecting Fake News on Short Video Platforms from the Perspective of Creative Process

23 Jul 2024

Chinese Academy of Sciences

National University of Singapore

As short-form video-sharing platforms become a significant channel for news consumption, fake news in short videos has emerged as a serious threat in the online information ecosystem, making developing detection methods for this new scenario an urgent need. Compared with that in text and image formats, fake news on short video platforms contains rich but heterogeneous information in various modalities, posing a challenge to effective feature utilization. Unlike existing works mostly focusing on analyzing what is presented, we introduce a novel perspective that considers how it might be created. Through the lens of the creative process behind news video production, our empirical analysis uncovers the unique characteristics of fake news videos in material selection and editing. Based on the obtained insights, we design FakingRecipe, a creative process-aware model for detecting fake news short videos. It captures the fake news preferences in material selection from sentimental and semantic aspects and considers the traits of material editing from spatial and temporal aspects. To improve evaluation comprehensiveness, we first construct FakeTT, an English dataset for this task, and conduct experiments on both FakeTT and the existing Chinese FakeSV dataset. The results show FakingRecipe's superiority in detecting fake news on short video platforms.

#computer-science #computer-vision-and-pattern-recognition #computers-and-society

Paper thumbnail

Improving Molecular Graph Generation with Flow Matching and Optimal Transport

08 Nov 2024

Chinese Academy of Sciences King Abdullah University of Science and Technology

Researchers from the Chinese Academy of Sciences and King Abdullah University of Science and Technology introduced GGFlow, the first discrete flow matching generative model that incorporates optimal transport for molecular graphs. This model achieves nearly perfect chemical validity and state-of-the-art performance in both unconditional and property-guided molecule generation with significantly fewer inference steps.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

Music Style Transfer with Time-Varying Inversion of Diffusion Models

21 Feb 2024

Chinese Academy of Sciences Kuaishou Technology

A method for music style transfer is introduced that leverages diffusion models with time-varying textual inversion, allowing users to transfer styles from any audio example, including non-musical sounds, to existing melodies while preserving structural content. This approach demonstrates superior performance in both content preservation and style fit compared to existing state-of-the-art techniques.

#computer-science #sound #audio-and-speech-processing

Paper thumbnail

TEA: Test-time Energy Adaptation

27 Feb 2024

fei-sun

Fei Sun

Chinese Academy of Sciences Kuaishou Technology

Researchers from CAS Key Laboratory of AI Security and Kuaishou Technology propose TEA (Test-time Energy Adaptation), a novel method that reinterprets a pre-trained classifier as an energy-based model to address distribution shifts. This approach directly aligns the model's perception of the data distribution with the incoming test data, leading to state-of-the-art generalization performance and improved confidence calibration across various image corruption and domain generalization benchmarks.

#computer-science #machine-learning #domain-adaptation

Paper thumbnail

KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction

14 Mar 2024

Chinese Academy of Sciences Institute of Computing Technology

KnowCoder, developed by ICT, CAS, enhances Universal Information Extraction (UIE) by introducing a code-style schema representation that leverages LLMs' inherent code understanding. The model demonstrates superior generalization across diverse information extraction tasks, achieving a 12.5% relative improvement in zero-shot NER F1 over leading baselines and outperforming prior state-of-the-art models in Relation and Event Extraction after fine-tuning.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

DIVERSIFY: A General Framework for Time Series Out-of-distribution Detection and Generalization

04 Aug 2023

Fudan University Tsinghua University logo

Tsinghua University

DIVERSIFY is a framework that tackles out-of-distribution detection and generalization for time series data by explicitly identifying and characterizing latent distributions without relying on predefined domain labels. It consistently outperforms baseline methods on OOD detection across seven diverse datasets, demonstrating its ability to learn robust representations for non-stationary time series.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

^4

M: Dataset Distillation via Disentangled Diffusion Model

21 Jul 2024

NCSU Institute of Computing Technology

Dataset distillation offers a lightweight synthetic dataset for fast network training with promising test accuracy. To imitate the performance of the original dataset, most approaches employ bi-level optimization and the distillation space relies on the matching architecture. Nevertheless, these approaches either suffer significant computational costs on large-scale datasets or experience performance decline on cross-architectures. We advocate for designing an economical dataset distillation framework that is independent of the matching architectures. With empirical observations, we argue that constraining the consistency of the real and synthetic image spaces will enhance the cross-architecture generalization. Motivated by this, we introduce Dataset Distillation via Disentangled Diffusion Model (D

^4

M), an efficient framework for dataset distillation. Compared to architecture-dependent methods, D

^4

M employs latent diffusion model to guarantee consistency and incorporates label information into category prototypes. The distilled datasets are versatile, eliminating the need for repeated generation of distinct datasets for various architectures. Through comprehensive experiments, D

^4

M demonstrates superior performance and robust generalization, surpassing the SOTA methods across most aspects.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

MEC-Quant: Maximum Entropy Coding for Extremely Low Bit Quantization-Aware Training

19 Sep 2025

Chinese Academy of Sciences Beijing University of Technology

Quantization-Aware Training (QAT) has driven much attention to produce efficient neural networks. Current QAT still obtains inferior performances compared with the Full Precision (FP) counterpart. In this work, we argue that quantization inevitably introduce biases into the learned representation, especially under the extremely low-bit setting. To cope with this issue, we propose Maximum Entropy Coding Quantization (MEC-Quant), a more principled objective that explicitly optimizes on the structure of the representation, so that the learned representation is less biased and thus generalizes better to unseen in-distribution samples. To make the objective end-to-end trainable, we propose to leverage the minimal coding length in lossy data coding as a computationally tractable surrogate for the entropy, and further derive a scalable reformulation of the objective based on Mixture Of Experts (MOE) that not only allows fast computation but also handles the long-tailed distribution for weights or activation values. Extensive experiments on various tasks on computer vision tasks prove its superiority. With MEC-Qaunt, the limit of QAT is pushed to the x-bit activation for the first time and the accuracy of MEC-Quant is comparable to or even surpass the FP counterpart. Without bells and whistles, MEC-Qaunt establishes a new state of the art for QAT.

#computer-science #computer-vision-and-pattern-recognition #energy-efficient-ml

Paper thumbnail

Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework

23 Jun 2025

xiao-junbin

xiao junbin

University of Illinois at Urbana-Champaign Chinese Academy of Sciences logo

Chinese Academy of Sciences

Large Vision-Language Models (LVLMs) have shown significant capability in vision-language understanding. However, one critical issue that persists in these models is sycophancy, where models are unduly influenced by leading or deceptive prompts, resulting in biased outputs and hallucinations. Despite the rapid development of LVLMs, evaluating and mitigating sycophancy remains largely under-explored. In this work, we fill this gap by systematically analyzing sycophancy across multiple vision-language benchmarks and propose an inference-time mitigation framework. We curate leading queries and quantify the susceptibility of state-of-the-art LVLMs to prompt-induced bias, revealing consistent performance degradation and instability across models and tasks. Our analysis further uncovers model-specific behavioral traits, such as sentiment sensitivity and prediction polarity shifts under sycophancy. To mitigate these issues, we propose a training-free, model-agnostic framework that operates entirely at inference time. Our approach first employs a query neutralizer, leveraging an language model to suppress implicit sycophantic bias in user queries. We then introduce a sycophancy-aware contrastive decoding mechanism that dynamically recalibrates token-level output distributions by contrasting responses to neutralized and leading queries. Finally, an adaptive logits refinement module further modifies the contrasted logits by integrating both a adaptive plausibility filter and query sentiment scaler, ensuring coherent and robust generation. Extensive experiments demonstrate that this framework effectively mitigates sycophancy across all evaluated models, while maintaining performance on neutral prompts. Our results suggest that sycophancy in LVLMs is a general and urgent challenge, and that inference-time strategies offer a promising path toward trustworthy multimodal reasoning.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Mixed-Precision Neural Network Quantization via Learned Layer-wise Importance

05 Mar 2023

Shanghai Jiao Tong University Tsinghua University logo

Tsinghua University

The exponentially large discrete search space in mixed-precision quantization (MPQ) makes it hard to determine the optimal bit-width for each layer. Previous works usually resort to iterative search methods on the training set, which consume hundreds or even thousands of GPU-hours. In this study, we reveal that some unique learnable parameters in quantization, namely the scale factors in the quantizer, can serve as importance indicators of a layer, reflecting the contribution of that layer to the final accuracy at certain bit-widths. These importance indicators naturally perceive the numerical transformation during quantization-aware training, which can precisely provide quantization sensitivity metrics of layers. However, a deep network always contains hundreds of such indicators, and training them one by one would lead to an excessive time cost. To overcome this issue, we propose a joint training scheme that can obtain all indicators at once. It considerably speeds up the indicators training process by parallelizing the original sequential training processes. With these learned importance indicators, we formulate the MPQ search problem as a one-time integer linear programming (ILP) problem. That avoids the iterative search and significantly reduces search time without limiting the bit-width search space. For example, MPQ search on ResNet18 with our indicators takes only 0.06 s, which improves time efficiency exponentially compared to iterative search methods. Also, extensive experiments show our approach can achieve SOTA accuracy on ImageNet for far-ranging models with various constraints (e.g., BitOps, compress rate). Code is available on this https URL.

#computer-science #computer-vision-and-pattern-recognition #machine-learning

Paper thumbnail

Provable Adaptivity of Adam under Non-uniform Smoothness

24 Jun 2024

Chinese Academy of Sciences The Chinese University of Hong Kong, Shenzhen

Adam is widely adopted in practical applications due to its fast convergence. However, its theoretical analysis is still far from satisfactory. Existing convergence analyses for Adam rely on the bounded smoothness assumption, referred to as the \emph{L-smooth condition}. Unfortunately, this assumption does not hold for many deep learning tasks. Moreover, we believe that this assumption obscures the true benefit of Adam, as the algorithm can adapt its update magnitude according to local smoothness. This important feature of Adam becomes irrelevant when assuming globally bounded smoothness. This paper studies the convergence of randomly reshuffled Adam (RR Adam) with diminishing learning rate, which is the major version of Adam adopted in deep learning tasks. We present the first convergence analysis of RR Adam without the bounded smoothness assumption. We demonstrate that RR Adam can maintain its convergence properties when smoothness is linearly bounded by the gradient norm, referred to as the \emph{

(L_0, L_1)

-smooth condition. We further compare Adam to SGD when both methods use diminishing learning rate. We refine the existing lower bound of SGD and show that SGD can be slower than Adam. To our knowledge, this is the first time that Adam and SGD are rigorously compared in the same setting and the advantage of Adam is revealed.

#computer-science #machine-learning #mathematics

Paper thumbnail

There are no more papers matching your filters at the moment.