alphaXiv

History

Papers Benchmarks

Westlake University

1,552

22 Sep 2025

computer-science robotics

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Zhejiang University

Westlake University The Hong Kong University of Science and Technology (Guangzhou)Beijing University of Posts and Telecommunications State Key Laboratory of Networking and Switching Technology OpenHelix Team

Researchers from Beijing University of Posts and Telecommunications, Westlake University, and Zhejiang University, along with the OpenHelix Team, introduce VLA-Adapter, an efficient method to bridge vision-language representations to robotic actions. The approach enables state-of-the-art level performance with a tiny-scale 0.5B parameter backbone without robotic data pre-training, achieving a 97.3% average success rate on the LIBERO benchmark and providing a 3x faster inference speed (219.2 Hz) than comparable methods.

578

1,299

16 Oct 2025

computer-science continual-learning computer-vision-and-pattern-recognition

TTT3R: 3D Reconstruction as Test-Time Training

Westlake University University of Tübingen

TTT3R improves the length generalization of recurrent 3D reconstruction models by integrating a test-time training (TTT) approach with a confidence-aware state update rule. This method maintains constant memory and real-time inference, achieving a 2x improvement in global pose estimation accuracy for long sequences on datasets like ScanNet and TUM-D compared to prior RNN-based baselines.

208

74,104

22 Jun 2025

computer-science artificial-intelligence computation-and-language

Learning to Reason under Off-Policy Guidance

Shanghai AI Laboratory

Nanjing University

The Chinese University of Hong Kong

Westlake University

Yafu Li

Jianhao Yan

LUFFY introduces a framework that enhances Large Reasoning Models (LRMs) by integrating off-policy guidance into Reinforcement Learning with Verifiable Rewards (RLVR). This approach enables LRMs to acquire new reasoning capabilities from stronger external policies, achieving state-of-the-art performance on math benchmarks, superior generalization on out-of-distribution tasks, and successfully training weaker foundation models where on-policy methods fail.

333

781

30 Sep 2025

agents bayesian-optimization computer-science

DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively

Westlake University

Yixuan Weng

DeepScientist, developed by WestlakeNLP, is an autonomous AI system that successfully surpassed human state-of-the-art on three frontier AI tasks within a month-long cycle, including a 142.8% improvement in agent failure attribution and a 7.9% higher AUROC for AI text detection. The system employs a goal-driven Bayesian optimization approach and a multi-agent architecture to conduct large-scale, iterative scientific exploration and validate findings.

717

13 Oct 2025

computer-science robotics

Towards a Unified Understanding of Robot Manipulation: A Comprehensive Survey

Chinese Academy of Sciences BAAI

Zhejiang University

Westlake University

Peking University

University of Sydney Xian Jiaotong University Hong Kong University of Science and Technology (Guangzhou)

This survey synthesizes the extensive and fragmented field of robot manipulation, providing a comprehensive overview that unifies diverse methodologies and challenges under novel classification systems. It structures the landscape by introducing new taxonomies for high-level planning, low-level learning-based control, and key bottlenecks, while outlining future research directions.

504

717

07 Oct 2025

computer-science computer-vision-and-pattern-recognition multi-task-learning

Human3R: Everyone Everywhere All at Once

Westlake University Tübingen AI Center Max-Planck Institute for Informatics Uni of T¨ubingen

HUMAN3R presents a unified, feed-forward framework for online 4D human-scene reconstruction from monocular video. The system jointly estimates multi-person global human motions, dense 3D scene geometry, and camera parameters in real-time at 15 FPS, outperforming or matching prior methods on various reconstruction benchmarks while consuming only 8 GB of GPU memory.

307

705

01 Feb 2024

mesoscale-and-nanoscale-physics physics chemical-physics

Enhancement of Chiral-Induced Spin Selectivity via Circularly Polarized Light

Westlake University Westlake Institute for Advanced Study

The notion of chiral-induced spin selectivity (CISS) has attracted intensive research interest recently. However, the practical applications of the CISS effects face challenges due to relatively low spin polarization. In this Letter, we propose a non-perturbative theory illustrating how circularly polarized (CP) light enhances CISS effects through strong light-matter interactions. We introduce a Floquet electronic friction model to study the nonadiabatic dynamics and spin transport through a chiral molecule in a molecule junction subjected to external driving. Our results show that the interplay of the nonadiabatic effects and light-matter interactions can significantly (

>90\%

) enhance electron spin polarization under CP light. Our predictions can be very useful in experiments for using CP light to control spin current in chiral molecular junctions.

571

01 Oct 2025

adversarial-robustness computer-science computer-vision-and-pattern-recognition

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

Fudan University

Zhejiang University

Westlake University Hebei University of Technology BUPT Zhengzhou University

Researchers from Westlake and Zhejiang Universities introduced VLA-RFT, a framework for fine-tuning Vision-Language-Action policies by interacting with a learned world model to generate verified rewards. This method enhanced robustness to environmental perturbations and improved average task success rates by 4.5 percentage points on the LIBERO benchmark with significantly fewer training iterations.

3,767

01 Jul 2025

computer-science artificial-intelligence machine-learning

The Curse of Depth in Large Language Models

University of Oxford

Westlake University Dalian University of Technology

Emory University University of Surrey

LI Pengxiang

Researchers at Westlake University, Emory University, Dalian University of Technology, University of Surrey, and University of Oxford investigated the 'Curse of Depth' in large language models, demonstrating that Pre-Layer Normalization leads to exponential output variance growth, rendering deep layers ineffective. They propose LayerNorm Scaling (LNS), a hyperparameter-free method that reduces variance growth to a polynomial rate, leading to improved pre-training perplexity and an average 1.8% gain on downstream tasks across various LLM scales.

544

21 Oct 2025

computer-science artificial-intelligence fine-tuning

SimKO: Simple Pass@K Policy Optimization

CUHK

Westlake University

University of British Columbia

SimKO is a method that improves Large Language Models trained with Reinforcement Learning with Verifiable Rewards by mitigating a phenomenon called "probability over-concentration" during token generation. The approach employs asymmetric gradient redistribution to enhance `pass@K` performance while also improving `pass@1` on various math and logical reasoning tasks, consistently outperforming existing RLVR techniques.

474

27 Sep 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

Westlake University

Nanyang Technological University

WORLDFORGE presents a training-free guidance framework that enables video diffusion models to achieve precise 3D/4D camera trajectory control while preserving their pre-trained generative priors. This framework consistently outperforms existing baselines, reducing FID to 96.08 for 3D static scenes and FVD to 93.17 for 4D dynamic scenes, and supports various video post-production tasks.

680

04 Nov 2025

computer-science computer-vision-and-pattern-recognition generative-models

Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey

California Institute of Technology

University of Cambridge

University of California, San Diego

Nanyang Technological University

MIT Max-Planck Institute for Informatics Hillbot

A comprehensive survey systematically reviews advancements in feed-forward 3D reconstruction and view synthesis since 2020, categorizing methods by underlying scene representations such as NeRF, pointmaps, and 3D Gaussian Splatting. It details how deep learning has enabled significantly faster and more generalizable 3D vision, highlighting diverse applications and critical open research challenges.

1,946

11 Mar 2025

chain-of-thought computer-science computation-and-language

DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process

University College London

Zhejiang University

Westlake University

DeepReview proposes a multi-stage framework that simulates a human expert's deep thinking process for LLM-based paper reviews, decomposing the task into novelty assessment, multi-dimensional evaluation, and reliability verification. The resulting DeepReviewer-14B model, trained on a synthesized 13K dataset, achieves notable improvements in review rating accuracy and ranking quality over larger baseline models.

268

648

28 Aug 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

National University of Singapore

Zhejiang University

Westlake University

Columbia University Xiamen University

Rice University

University of Wisconsin-Madison Salesforce AI Research

University of Central Florida

Kele Shao

This survey provides the first systematic review of multimodal long-context token compression, categorizing techniques across images, videos, and audio by both modality and algorithmic mechanism. It reveals how diverse compression strategies address the quadratic complexity of self-attention in Multimodal Large Language Models (MLLMs), improving efficiency and enabling new applications like real-time robotic perception and high-resolution medical image analysis.

182

2,474

06 Jun 2024

computer-science artificial-intelligence computation-and-language

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Zhejiang University

Westlake University Tencent AI Lab

WebVoyager is an end-to-end web agent framework that utilizes Large Multimodal Models to perform complex tasks by interacting directly with real-world websites. It achieved a 59.1% task success rate on a new benchmark of real-world web tasks, outperforming text-only and "All Tools" baselines, and introduces a reliable GPT-4V-powered automatic evaluation protocol.

614

332

26 Sep 2025

agents computer-science artificial-intelligence

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

Google DeepMind

National University of Singapore

Nanjing University

Westlake University

Peking University Southeast University Institute of Science Tokyo

TrustJudge introduces a probabilistic framework to systematically mitigate two fundamental inconsistencies—score-comparison and pairwise transitivity—within LLM-as-a-judge evaluation. The method significantly reduces conflict ratios and non-transitivity rates by employing distribution-sensitive scoring and likelihood-aware aggregation, while maintaining or enhancing evaluation accuracy across various large language models and tasks.

329

11 Oct 2025

agentic-frameworks agents computer-science

Don't Just Fine-tune the Agent, Tune the Environment

Nanjing University

Zhejiang University

Westlake University Shanghai Innovation Institute AWorld Team, Inclusion AI

Researchers from Zhejiang University, Ant Group, and others introduced ENVIRONMENT TUNING, a training paradigm for Large Language Model (LLM) agents that focuses on modifying the learning environment itself. This method enables agents to achieve robust generalization and stability in complex, multi-turn tool-use tasks despite extreme data scarcity, significantly boosting performance on benchmarks like BFCL V3 by up to 18.50% and improving out-of-distribution generalization where supervised fine-tuning models collapse.

315

22 Oct 2025

computer-science artificial-intelligence computation-and-language

dInfer: An Efficient Inference Framework for Diffusion Language Models

Ant Group

Zhejiang University

Westlake University

Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components--model, diffusion iteration manager, decoding strategy, and KV-cache manager--and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on

8\times

H800 GPUs. Compared to prior systems, dInfer delivers a

10\times

speedup over Fast-dLLM while maintaining similar model performance. Even compared to the AR model (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with the latest vLLM inference engine, dInfer still delivers a

2

3\times

speedup. The implementation of dInfer is open-sourced at this https URL.

687

10 Oct 2025

computer-science computer-vision-and-pattern-recognition inference-optimization

HoliTom: Holistic Token Merging for Fast Video Large Language Models

Zhejiang University

Westlake University

Columbia University

Rice University Salesforce AI Research

Kele Shao

HoliTom introduces a training-free, holistic token merging framework for Video Large Language Models, synergistically combining outer-LLM spatio-temporal compression with inner-LLM merging. This approach reduces computational costs to 6.9% of original FLOPs while retaining 99.1% of performance and accelerating inference by 2.28 times.

2,221

01 Oct 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

Max Planck Institute for Intelligent Systems

Westlake University University of Tübingen, Tübingen AI Center

Yuliang Xiu

Xingyu Chen

Easi3R introduces a training-free framework that extracts disentangled motion from dynamic videos by interpreting the attention mechanisms of a pre-trained static 3D foundation model like DUSt3R. The method achieves state-of-the-art performance in dynamic object segmentation, camera pose estimation, and 4D reconstruction without requiring any additional training or fine-tuning on dynamic datasets.

325

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

TTT3R: 3D Reconstruction as Test-Time Training

Learning to Reason under Off-Policy Guidance

DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively

Towards a Unified Understanding of Robot Manipulation: A Comprehensive Survey

Human3R: Everyone Everywhere All at Once

Enhancement of Chiral-Induced Spin Selectivity via Circularly Polarized Light

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

The Curse of Depth in Large Language Models

SimKO: Simple Pass@K Policy Optimization

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey

DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

Don't Just Fine-tune the Agent, Tune the Environment

dInfer: An Efficient Inference Framework for Diffusion Language Models

HoliTom: Holistic Token Merging for Fast Video Large Language Models

Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

Events

AI for Law

Personalize Your Feed