alphaXiv

History

Papers Benchmarks

Li Auto

3,517

25 Jun 2024

autonomous-vehicles computer-science computer-vision-and-pattern-recognition

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Tsinghua University Li Auto

Tsinghua University and Li Auto developed DriveVLM and DriveVLM-Dual, a VLM-integrated autonomous driving system that enhances scene understanding and planning, especially in complex "long-tail" scenarios. The hybrid DriveVLM-Dual system achieved state-of-the-art performance on the nuScenes planning task and demonstrated real-time asynchronous operation on production vehicle hardware with an average inference speed of 410 ms.

1,018

18 Aug 2024

autonomous-vehicles computer-science computer-vision-security

Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting

Zhejiang University Li Auto

阎赟之

Chenxu Zhou

Street Gaussians presents an explicit 3D Gaussian Splatting framework tailored for modeling dynamic urban scenes, achieving real-time novel view synthesis and drastically reduced training times compared to prior methods. The approach renders at 135 FPS and trains in just half an hour on the Waymo Open Dataset, while producing sharper and more detailed views, especially for moving objects, leading to a PSNR of 34.61 and 30.23 for dynamic elements.

992

517

01 Jul 2025

autonomous-vehicles computer-science computer-vision-and-pattern-recognition

World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model

NUS

Tsinghua University Li Auto CASIA PCL

Yupeng Zheng

World4Drive presents an end-to-end autonomous driving framework that generates planning trajectories directly from raw sensor data, eliminating the need for manual perception annotations by leveraging intention-aware physical latent world models. This system achieves a 46.7% relative reduction in collision rate and 3.75x faster training convergence compared to previous self-supervised methods on the nuScenes dataset.

1,015

26 Aug 2025

autonomous-vehicles computer-science computer-vision-security

StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models

Zhejiang University

Cornell University Li Auto

阎赟之

Haoyu Guo

A framework synthesizes photorealistic street view videos with precise camera control by conditioning a video diffusion model on LiDAR point clouds. This method enables superior view extrapolation in dynamic urban scenes and facilitates real-time rendering through distillation into a 3D Gaussian Splatting representation.

127

26 Oct 2025

computer-science machine-learning robotics

SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling

Carnegie Mellon University Shanghai AI Laboratory

Tsinghua University Li Auto Zhongguancun Academy

Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process. We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs. To address this, we reparameterize the velocity network using principles from modern sequential models, introducing two stable architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which utilizes a decoded velocity. We then develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies. Our approach supports both from-scratch and offline-to-online learning and achieves state-of-the-art performance on continuous control and robotic manipulation benchmarks, eliminating the need for common workarounds like policy distillation or surrogate objectives.

578

24 Apr 2025

attention-mechanisms computer-science artificial-intelligence

Cognitive Memory in Large Language Models

Li Auto

scout shan

Researchers from Li Auto provide a comprehensive survey and categorization of memory mechanisms in Large Language Models, mapping them to human cognitive memory types. The work details current approaches for memory acquisition, management, and utilization, while also identifying key limitations when compared to human cognitive abilities like generalization and adaptive forgetting.

463

13 Dec 2024

autonomous-vehicles computer-science computer-vision-security

GaussianAD: Gaussian-Centric End-to-End Autonomous Driving

Tsinghua University

Peking University Li Auto

jiapengxmu

This paper introduces a new Gaussian-centric framework for end-to-end autonomous driving that uses 3D Gaussian scene representations to balance comprehensiveness and efficiency

1,080

13 Oct 2021

autonomous-vehicles computer-science computer-vision-security

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

Carnegie Mellon University

Tsinghua University Toyota Research Institute Li Auto

DETR3D, developed by researchers from MIT, Toyota Research Institute, and others, introduces an end-to-end framework for 3D object detection directly from multi-view images. It achieves state-of-the-art performance among camera-only methods on the nuScenes dataset by leveraging 3D-to-2D projection and multi-view feature aggregation with learnable object queries.

833

696

18 Jul 2025

autonomous-vehicles computer-science computer-vision-and-pattern-recognition

PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

Institute for AI Industry Research (AIR), Tsinghua University

Chinese Academy of Sciences

Beihang University

Beijing Jiaotong University Li Auto

PosePilot introduces a lightweight, plug-and-play module that enhances camera pose controllability in generative world models by explicitly integrating self-supervised depth and ego-motion estimation. This approach significantly improves geometric consistency and accuracy in generated videos, particularly for autonomous driving simulations, and demonstrates strong cross-domain adaptability.

28 Sep 2025

attention-mechanisms computer-science artificial-intelligence

ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

Beijing University of Posts and Telecommunications Li Auto

ChunkLLM introduces a lightweight, pluggable framework to accelerate large language model inference, achieving up to 4.48x speedup and reducing KV cache usage by nearly 50% on long texts. This is accomplished while maintaining high performance across various long- and short-text benchmarks through dynamic semantic chunking, attention distillation, and an Intra-Chunk Attention Consistency mechanism.

443

18 Mar 2022

autonomous-vehicles computer-science computer-vision-security

HDMapNet: An Online HD Map Construction and Evaluation Framework

Tsinghua University

MIT Li Auto

HDMapNet introduces an online framework for constructing local, vectorized high-definition (HD) semantic maps in real-time using only onboard camera and/or LiDAR sensors. The system achieves state-of-the-art performance, with a camera-LiDAR fusion model yielding an IoU of 44.5% and a mAP of 30.6% on the nuScenes dataset, significantly outperforming previous methods and single-modal approaches.

14 Oct 2025

autonomous-vehicles computer-science computer-vision-and-pattern-recognition

DrivingScene: A Multi-Task Online Feed-Forward 3D Gaussian Splatting Method for Dynamic Driving Scenes

Harbin Institute of Technology Li Auto Chongqing Research Institute of HIT

Real-time, high-fidelity reconstruction of dynamic driving scenes is challenged by complex dynamics and sparse views, with prior methods struggling to balance quality and efficiency. We propose DrivingScene, an online, feed-forward framework that reconstructs 4D dynamic scenes from only two consecutive surround-view images. Our key innovation is a lightweight residual flow network that predicts the non-rigid motion of dynamic objects per camera on top of a learned static scene prior, explicitly modeling dynamics via scene flow. We also introduce a coarse-to-fine training paradigm that circumvents the instabilities common to end-to-end approaches. Experiments on nuScenes dataset show our image-only method simultaneously generates high-quality depth, scene flow, and 3D Gaussian point clouds online, significantly outperforming state-of-the-art methods in both dynamic reconstruction and novel view synthesis.

143

24 Jul 2025

computer-science computer-vision-and-pattern-recognition fine-tuning

QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation

University of Science and Technology of China Harbin Institute of Technology Li Auto

A new fine-tuning framework called QR-LoRA was developed, employing QR decomposition to enable disentangled control of visual attributes in text-to-image generative models. It produced higher quality images with independent content and style manipulation while reducing trainable parameters by 50% compared to traditional LoRA, consistently performing well on SDXL, SD3, and FLUX.1-dev.

17 Oct 2025

agentic-frameworks agents computer-science

SIADAFIX: issue description response for adaptive program repair

Li Auto

Researchers at LI AUTO's Code Intelligence Team developed SIADAFIX, an adaptive framework for automated program repair that integrates "fast" and "slow thinking" modes to intelligently orchestrate repair workflows. By analyzing issue descriptions and dynamically adjusting the repair strategy, SIADAFIX achieves a 60.7% Pass@1 on the SWE-bench Lite benchmark with Claude-4 Sonnet, improving upon existing LLM-based approaches.

134

04 Apr 2025

computer-science computer-vision-and-pattern-recognition few-shot-learning

TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference

Li Auto

Li Auto researchers introduce TokenFLEX, a framework enabling flexible visual token counts in Vision-Language Models through dynamic token training and an adaptive projector architecture, reducing visual token usage by 28% and training time by 13% while maintaining competitive performance across benchmarks when compared to fixed-token approaches.

167

17 Jul 2025

computer-science contrastive-learning computer-vision-and-pattern-recognition

Salvaging the Overlooked: Leveraging Class-Aware Contrastive Learning for Multi-Class Anomaly Detection

Li Auto UNSW Sydney JLU SCAU

Researchers at UNSW Sydney developed Class-aware Contrastive Learning (CCL), a modular framework that mitigates inter-class confusion in multi-class anomaly detection by integrating local and global contrastive losses with existing reconstruction-based models. This approach achieved an Image-level AUROC of 90.6% across 60 object categories and demonstrated comparable performance using pseudo-class labels, making it suitable for truly unsupervised scenarios.

109

23 May 2025

chain-of-thought computer-science artificial-intelligence

GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs

Li Auto

scout shan

Researchers at Li Auto developed GeoGramBench, a new benchmark to evaluate Large Language Models' ability to derive spatial geometric understanding from procedural code. The evaluation revealed that current LLMs struggle with constructing global spatial representations from symbolic instructions, particularly on complex tasks where even top models achieved less than 50% accuracy.

323

06 Dec 2024

computer-science computer-vision-security computer-vision-and-pattern-recognition

MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects

University of Cambridge Li Auto UNSW Sydney SCAU

The MANTA dataset, developed by researchers from UNSW Sydney and collaborators, introduces the first large-scale, multi-view, visual-text resource for anomaly detection in tiny objects (4-20 mm³), featuring over 137,000 multi-view images across 38 categories. This dataset addresses challenges like object heterogeneity and unpredictable poses, demonstrating that multi-view approaches significantly improve anomaly detection, with an average I-AUROC of 91% on its visual tasks.

4,057

27 Oct 2025

autonomous-vehicles computer-science computer-vision-and-pattern-recognition

Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method

National University of Singapore

Shanghai Jiao Tong University

Tsinghua University Eastern Institute of Technology Li Auto Ningbo Institute of Digital Twin

Driving scene generation is a critical domain for autonomous driving, enabling downstream applications, including perception and planning evaluation. Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities; however, their performance heavily depends on annotated occupancy data, which still remains scarce. To overcome this limitation, we curate Nuplan-Occ, the largest semantic occupancy dataset to date, constructed from the widely used Nuplan benchmark. Its scale and diversity facilitate not only large-scale generative modeling but also autonomous driving downstream applications. Based on this dataset, we develop a unified framework that jointly synthesizes high-quality semantic occupancy, multi-view videos, and LiDAR point clouds. Our approach incorporates a spatio-temporal disentangled architecture to support high-fidelity spatial expansion and temporal forecasting of 4D dynamic occupancy. To bridge modal gaps, we further propose two novel techniques: a Gaussian splatting-based sparse point map rendering strategy that enhances multi-view video generation, and a sensor-aware embedding strategy that explicitly models LiDAR sensor properties for realistic multi-LiDAR simulation. Extensive experiments demonstrate that our method achieves superior generation fidelity and scalability compared to existing approaches, and validates its practical value in downstream tasks. Repo: this https URL

736

05 Jan 2025

attention-mechanisms computer-science computer-vision-security

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

Harbin Institute of Technology

Peking University Li Auto

张志路

MV-VTON, developed by researchers from Harbin Institute of Technology, Peking University, and Li Auto, introduces and solves the Multi-View Virtual Try-On task by generating realistic images of a person wearing a garment from various angles. The work leverages diffusion models with novel view-adaptive feature selection and joint attention blocks, outperforming existing methods on both multi-view and frontal-view tasks and demonstrating superior garment detail preservation. A new Multi-View Garment (MVG) dataset was also collected and released to support this novel research direction.

143

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting

World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model

StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models

SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling

Cognitive Memory in Large Language Models

GaussianAD: Gaussian-Centric End-to-End Autonomous Driving

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

HDMapNet: An Online HD Map Construction and Evaluation Framework

DrivingScene: A Multi-Task Online Feed-Forward 3D Gaussian Splatting Method for Dynamic Driving Scenes

QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation

SIADAFIX: issue description response for adaptive program repair

TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference

Salvaging the Overlooked: Leveraging Class-Aware Contrastive Learning for Multi-Class Anomaly Detection

GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs

MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects

Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

Events

AI for Law

Personalize Your Feed