alphaXiv

History

Papers Benchmarks

Pika Labs

4,241

08 Nov 2025

computer-science computer-vision-and-pattern-recognition machine-learning

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

UC Berkeley

NVIDIA CMU

MIT SJTU Princeton Pika Labs

SVDQuant introduces a 4-bit post-training quantization method for diffusion models that absorbs outliers using a high-precision low-rank component, coupled with the Nunchaku inference engine. This approach enables state-of-the-art image quality while achieving up to 3.6x memory reduction and 10.1x end-to-end inference speedup across various models and hardware.

682

603

02 Jul 2024

computer-science computer-vision-and-pattern-recognition generative-models

Consistency Flow Matching: Defining Straight Flows with Velocity Consistency

Stanford University

Peking University

University of Texas at Austin Pika Labs

Minkai Xu

Consistency Flow Matching introduces a new paradigm for training generative ODE models, defining straight flows by enforcing velocity consistency along trajectories. This approach achieves superior sample quality and significantly faster convergence compared to existing methods, particularly on high-resolution image generation tasks.

192

230

05 May 2024

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Stanford University

Peking University Pika Labs

Minkai Xu

Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: https://github.com/YangLing0818/RPG-DiffusionMaster

1,751

139

20 Oct 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

DynVFX: Augmenting Real Videos with Dynamic Content

Weizmann Institute of Science Pika Labs

We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained vision-language model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.

There are no more papers matching your filters at the moment.