alphaXiv

539

25 Nov 2025

chain-of-thought computer-science computer-vision-and-pattern-recognition

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

New York University

University of Oxford

Northwestern University UCSB CMU

Purdue University University of Rochester

Brown University

University of Virginia Sony Group Corporation

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: this https URL

104

702

30 Mar 2024

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion

Stanford University Sony AI Sony Group Corporation

Consistency Trajectory Models (CTM), developed by Sony AI in collaboration with Carnegie Mellon and Stanford, unify score-based diffusion models and distillation methods to enable high-quality image generation with very few steps. CTM achieves new state-of-the-art FID scores on CIFAR-10 (1.63 at 2 NFEs) and ImageNet 64x64 (1.73 at 2 NFEs), while also ensuring sample quality improves with increased sampling steps, a limitation of prior distillation models.

303

2,912

07 Apr 2025

computer-science computer-vision-and-pattern-recognition machine-learning

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

University of Illinois at Urbana-Champaign Sony AI Sony Group Corporation

MMAudio introduces a multimodal joint training paradigm for high-quality video-to-audio synthesis, training a unified transformer network from scratch on both audio-visual and large-scale audio-text datasets. This approach achieves state-of-the-art performance in audio quality, semantic alignment, and temporal synchronization on public benchmarks, while being competitive with larger proprietary models.

1,107

511

30 Jan 2025

computer-science artificial-intelligence computers-and-society

International AI Safety Report

Stephen Casper

A landmark international scientific assessment, led by Prof. Yoshua Bengio and 96 global AI experts, provides the first comprehensive evidence-based framework for understanding advanced AI safety risks and mitigation strategies, establishing crucial scientific foundations for international policy while highlighting urgent challenges requiring proactive governance.

3,687

09 May 2025

computer-science machine-learning generative-models

Distillation of Discrete Diffusion through Dimensional Correlations

the University of Tokyo Sony AI Sony Group Corporation

Researchers at Sony Group Corporation, Sony AI, and The University of Tokyo developed Di4C, a method for distilling discrete diffusion models that explicitly learns dimensional correlations. This approach enables substantial speed-ups, achieving a 2x acceleration on ImageNet VQ-space image generation and over 2x for masked diffusion language models, while maintaining or improving sample quality and diversity.

9

121

26 Nov 2025

agents chain-of-thought computer-science

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

University of Rochester MIT-IBM Watson AI Lab Sony Group Corporation

A video reasoning agent, Video-R4, is presented, enabling large multimodal models to perform "visual rumination" by iteratively selecting and refining visual evidence in text-rich videos. This approach achieves state-of-the-art results on video question answering benchmarks and exhibits robust zero-shot generalization across diverse multimodal tasks.

2,928

90

02 Oct 2025

agents chain-of-thought computer-science

Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems

Carnegie Mellon University Sony Group Corporation

The SCoT (Streaming Chain-of-Thought) framework integrates Chain-of-Thought reasoning into streaming full-duplex End-to-End Spoken Dialogue Systems, enabling low-latency, simultaneous listening and speaking with enhanced semantic coherence. This approach, particularly the SCoT-Response variant, generated more human-like turn-taking behavior and emotional alignment while outperforming existing baselines in dialogue quality.

71

03 Dec 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

C3G: Learning Compact 3D Representations with 2K Gaussians

KAIST

ETH Zürich Sony AI Sony Group Corporation ETH Z ody { The body.text-align: center; } <body> <h1>Punto 4: Array de Objetos</h1> <div id=arrayContainer ></div> <script> const data = [ { nombre:Producto A , precio: 25.50, cantidad: 3 }, { nombre:Producto B , precio: 12.75, cantidad: 10 }, { nombre:Producto C , precio: 50.00, cantidad: 1 }, { nombre:Producto D , precio: 5.20, cantidad: 5 }, { nombre:Producto E , precio: 100.00, cantidad: 2 } ]; const container = document.getElementById('arrayContainer'); data.forEach(item => { const itemDiv = document.createElement('div'); itemDiv.className = 'array-item'; itemDiv.innerHTML = ` <h3>${item.nombre}</h3> <p>Precio: $${item.precio.toFixed(2)}</p> <p>Cantidad: ${item.cantidad}</p> `; container.appendChild(itemDiv); }); </script> </body> </html>

C3G introduces a feed-forward framework that generates compact 3D representations using approximately 2,000 Gaussians from sparse, unposed multi-view images. This framework achieves competitive novel view synthesis and significantly enhances 3D scene understanding and multi-view correspondence by efficiently lifting uncompressed 2D semantic features into view-invariant 3D features, drastically reducing memory overhead by up to 15 times.

11

69

16 Oct 2025

computer-science computer-vision-and-pattern-recognition generative-models

3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation

KAIST Sony AI Sony Group Corporation ETH Z R r ich

We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : this https URL

10

1,414

31 Oct 2025

computer-science computer-vision-and-pattern-recognition geometric-deep-learning

D $^2$ USt3R: Enhancing 3D Reconstruction for Dynamic Scenes

KAIST Korea University Sony AI Sony Group Corporation

Takuya Narihira

·

Jaewoo Jung

In this work, we address the task of 3D reconstruction in dynamic scenes, where object motions frequently degrade the quality of previous 3D pointmap regression methods, such as DUSt3R, that are originally designed for static 3D scene reconstruction. Although these methods provide an elegant and powerful solution in static settings, they struggle in the presence of dynamic motions that disrupt alignment based solely on camera poses. To overcome this, we propose

D^2USt3R

that directly regresses Static-Dynamic Aligned Pointmaps (SDAP) that simultaneiously capture both static and dynamic 3D scene geometry. By explicitly incorporating both spatial and temporal aspects, our approach successfully encapsulates 3D dense correspondence to the proposed pointmaps, enhancing downstream tasks. Extensive experimental evaluations demonstrate that our proposed approach consistently achieves superior 3D reconstruction performance across various datasets featuring complex motions.

10

732

25 Feb 2025

computer-science computer-vision-and-pattern-recognition machine-learning

MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

Sony AI Sony Group Corporation

This study aims to construct an audio-video generative model with minimal computational cost by leveraging pre-trained single-modal generative models for audio and video. To achieve this, we propose a novel method that guides single-modal models to cooperatively generate well-aligned samples across modalities. Specifically, given two pre-trained base diffusion models, we train a lightweight joint guidance module to adjust scores separately estimated by the base models to match the score of joint distribution over audio and video. We show that this guidance can be computed using the gradient of the optimal discriminator, which distinguishes real audio-video pairs from fake ones independently generated by the base models. Based on this analysis, we construct a joint guidance module by training this discriminator. Additionally, we adopt a loss function to stabilize the discriminator's gradient and make it work as a noise estimator, as in standard diffusion models. Empirical evaluations on several benchmark datasets demonstrate that our method improves both single-modal fidelity and multimodal alignment with relatively few parameters. The code is available at: this https URL

6

250

17 Aug 2025

computer-science sound audio-and-speech-processing

DiffVox: A Differentiable Model for Capturing and Analysing Vocal Effects Distributions

Sony AI

Queen Mary University of London Sony Group Corporation

A differentiable model named DiffVox captures and analyzes vocal effect parameter distributions from hundreds of professionally produced music tracks. The model successfully extracts vocal presets, revealing non-Gaussian parameter distributions and strong correlations, which provides data-driven priors for developing more realistic AI-powered music production tools and reveals connections to perceptual attributes like spaciousness.

35

2,995

09 Apr 2025

chain-of-thought computer-science computer-vision-and-pattern-recognition

Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

CMU

Purdue University University of Rochester Sony Group Corporation

Liu He

This paper presents CAT-V, a training-free framework that generates fine-grained, object-centric video captions based on user-selected objects via spatiotemporal multimodal prompting. The framework effectively combines pre-trained segmentation, temporal analysis, and multimodal language models to produce detailed, temporally-aware narratives, demonstrating its versatility across various object types and user interactions.

55

58

28 May 2025

adversarial-attacks computer-science artificial-intelligence

A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?

Sony AI Sony Group Corporation

We introduce the Robust Audio Watermarking Benchmark (RAW-Bench), a benchmark for evaluating deep learning-based audio watermarking methods with standardized and systematic comparisons. To simulate real-world usage, we introduce a comprehensive audio attack pipeline with various distortions such as compression, background noise, and reverberation, along with a diverse test dataset including speech, environmental sounds, and music recordings. Evaluating four existing watermarking methods on RAW-bench reveals two main insights: (i) neural compression techniques pose the most significant challenge, even when algorithms are trained with such compressions; and (ii) training with audio attacks generally improves robustness, although it is insufficient in some cases. Furthermore, we find that specific distortions, such as polarity inversion, time stretching, or reverb, seriously affect certain methods. The evaluation framework is accessible at this http URL.

47

10 Oct 2025

computer-science computer-vision-and-pattern-recognition machine-learning

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

Sony AI Sony Group Corporation

We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at this https URL.

2

462

28 Mar 2024

bayesian-deep-learning computer-science artificial-intelligence

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

Sony Group Corporation SonyAI

Chieh-Hsin Lai

HQ-VAE, developed by Sony AI, presents a novel framework for learning hierarchical discrete latent representations using a unified variational Bayes approach. This method effectively mitigates codebook and layer collapse issues in VQ-VAE variants, leading to improved reconstruction accuracy and simplified training for generative models on diverse data like images and audio.

564

16 Feb 2025

agentic-frameworks agents computer-science

Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems

Sony Group Corporation

TalkHier, a framework from Sony Group Corporation, enhances LLM multi-agent systems by introducing a structured communication protocol and a hierarchical refinement system. This approach achieved state-of-the-art performance across MMLU, WikiQA, and Camera datasets.

39

42

19 Sep 2025

computer-science sound audio-and-speech-processing

LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control

Sony Group Corporation

Fine-grained control over voice impressions (e.g., making a voice brighter or calmer) is a key frontier for creating more controllable text-to-speech. However, this nascent field faces two key challenges. The first is the problem of impression leakage, where the synthesized voice is undesirably influenced by the speaker's reference audio, rather than the separately specified target impression, and the second is the lack of a public, annotated corpus. To mitigate impression leakage, we propose two methods: 1) a training strategy that separately uses an utterance for speaker identity and another utterance of the same speaker for target impression, and 2) a novel reference-free model that generates a speaker embedding solely from the target impression, achieving the benefits of improved robustness against the leakage and the convenience of reference-free generation. Objective and subjective evaluations demonstrate a significant improvement in controllability. Our best method reduced the mean squared error of 11-dimensional voice impression vectors from 0.61 to 0.41 objectively and from 1.15 to 0.92 subjectively, while maintaining high fidelity. To foster reproducible research, we introduce LibriTTS-VI, the first public voice impression dataset released with clear annotation standards, built upon the LibriTTS-R corpus.

4

153

08 Oct 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

University of Rochester Sony Group Corporation

Researchers from the University of Rochester and Sony Group Corporation developed PU-VALOR, a dataset of pseudo-untrimmed audio-visual videos with precise temporal annotations, and AVicuna, a multimodal large language model. This work enhances fine-grained temporal understanding in complex video-language tasks, achieving state-of-the-art performance in video QA and audio-visual event dense localization.

19

42

13 Oct 2025

computer-science contrastive-learning artificial-intelligence

Automatic Music Sample Identification with Multi-Track Contrastive Learning

Sony AI Sony Group Corporation

The research from Sony AI develops a multi-track contrastive learning approach to automatically identify music samples, achieving a 15% improvement in mean Average Precision on the Sample100 dataset. This method generates realistic training data by mixing instrument stems from different songs, allowing the system to robustly identify source material even when heavily transformed and embedded within new compositions.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

International AI Safety Report

Distillation of Discrete Diffusion through Dimensional Correlations

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems

C3G: Learning Compact 3D Representations with 2K Gaussians

3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation

D $^2$ USt3R: Enhancing 3D Reconstruction for Dynamic Scenes

MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

DiffVox: A Differentiable Model for Capturing and Analysing Vocal Effects Distributions

Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems

LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

Automatic Music Sample Identification with Multi-Track Contrastive Learning

Events

AI for Law

Personalize Your Feed

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

International AI Safety Report

Distillation of Discrete Diffusion through Dimensional Correlations

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems

C3G: Learning Compact 3D Representations with 2K Gaussians

3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation

D2^22USt3R: Enhancing 3D Reconstruction for Dynamic Scenes

MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

DiffVox: A Differentiable Model for Capturing and Analysing Vocal Effects Distributions

Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems

LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

Automatic Music Sample Identification with Multi-Track Contrastive Learning

Events

AI for Law

Personalize Your Feed

D $^2$ USt3R: Enhancing 3D Reconstruction for Dynamic Scenes