Sony Group Corporation
Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: this https URL
104
Consistency Trajectory Models (CTM), developed by Sony AI in collaboration with Carnegie Mellon and Stanford, unify score-based diffusion models and distillation methods to enable high-quality image generation with very few steps. CTM achieves new state-of-the-art FID scores on CIFAR-10 (1.63 at 2 NFEs) and ImageNet 64x64 (1.73 at 2 NFEs), while also ensuring sample quality improves with increased sampling steps, a limitation of prior distillation models.
303
MMAudio introduces a multimodal joint training paradigm for high-quality video-to-audio synthesis, training a unified transformer network from scratch on both audio-visual and large-scale audio-text datasets. This approach achieves state-of-the-art performance in audio quality, semantic alignment, and temporal synchronization on public benchmarks, while being competitive with larger proprietary models.
1,107
Researchers at Sony Group Corporation, Sony AI, and The University of Tokyo developed Di4C, a method for distilling discrete diffusion models that explicitly learns dimensional correlations. This approach enables substantial speed-ups, achieving a 2x acceleration on ImageNet VQ-space image generation and over 2x for masked diffusion language models, while maintaining or improving sample quality and diversity.
9
A video reasoning agent, Video-R4, is presented, enabling large multimodal models to perform "visual rumination" by iteratively selecting and refining visual evidence in text-rich videos. This approach achieves state-of-the-art results on video question answering benchmarks and exhibits robust zero-shot generalization across diverse multimodal tasks.
2,928
The SCoT (Streaming Chain-of-Thought) framework integrates Chain-of-Thought reasoning into streaming full-duplex End-to-End Spoken Dialogue Systems, enabling low-latency, simultaneous listening and speaking with enhanced semantic coherence. This approach, particularly the SCoT-Response variant, generated more human-like turn-taking behavior and emotional alignment while outperforming existing baselines in dialogue quality.
We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : this https URL
10
·
In this work, we address the task of 3D reconstruction in dynamic scenes, where object motions frequently degrade the quality of previous 3D pointmap regression methods, such as DUSt3R, that are originally designed for static 3D scene reconstruction. Although these methods provide an elegant and powerful solution in static settings, they struggle in the presence of dynamic motions that disrupt alignment based solely on camera poses. To overcome this, we propose D2USt3RD^2USt3R that directly regresses Static-Dynamic Aligned Pointmaps (SDAP) that simultaneiously capture both static and dynamic 3D scene geometry. By explicitly incorporating both spatial and temporal aspects, our approach successfully encapsulates 3D dense correspondence to the proposed pointmaps, enhancing downstream tasks. Extensive experimental evaluations demonstrate that our proposed approach consistently achieves superior 3D reconstruction performance across various datasets featuring complex motions.
10
This study aims to construct an audio-video generative model with minimal computational cost by leveraging pre-trained single-modal generative models for audio and video. To achieve this, we propose a novel method that guides single-modal models to cooperatively generate well-aligned samples across modalities. Specifically, given two pre-trained base diffusion models, we train a lightweight joint guidance module to adjust scores separately estimated by the base models to match the score of joint distribution over audio and video. We show that this guidance can be computed using the gradient of the optimal discriminator, which distinguishes real audio-video pairs from fake ones independently generated by the base models. Based on this analysis, we construct a joint guidance module by training this discriminator. Additionally, we adopt a loss function to stabilize the discriminator's gradient and make it work as a noise estimator, as in standard diffusion models. Empirical evaluations on several benchmark datasets demonstrate that our method improves both single-modal fidelity and multimodal alignment with relatively few parameters. The code is available at: this https URL
6
A differentiable model named DiffVox captures and analyzes vocal effect parameter distributions from hundreds of professionally produced music tracks. The model successfully extracts vocal presets, revealing non-Gaussian parameter distributions and strong correlations, which provides data-driven priors for developing more realistic AI-powered music production tools and reveals connections to perceptual attributes like spaciousness.
35
This paper presents CAT-V, a training-free framework that generates fine-grained, object-centric video captions based on user-selected objects via spatiotemporal multimodal prompting. The framework effectively combines pre-trained segmentation, temporal analysis, and multimodal language models to produce detailed, temporally-aware narratives, demonstrating its versatility across various object types and user interactions.
55
We introduce the Robust Audio Watermarking Benchmark (RAW-Bench), a benchmark for evaluating deep learning-based audio watermarking methods with standardized and systematic comparisons. To simulate real-world usage, we introduce a comprehensive audio attack pipeline with various distortions such as compression, background noise, and reverberation, along with a diverse test dataset including speech, environmental sounds, and music recordings. Evaluating four existing watermarking methods on RAW-bench reveals two main insights: (i) neural compression techniques pose the most significant challenge, even when algorithms are trained with such compressions; and (ii) training with audio attacks generally improves robustness, although it is insufficient in some cases. Furthermore, we find that specific distortions, such as polarity inversion, time stretching, or reverb, seriously affect certain methods. The evaluation framework is accessible at this http URL.
We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at this https URL.
2
HQ-VAE, developed by Sony AI, presents a novel framework for learning hierarchical discrete latent representations using a unified variational Bayes approach. This method effectively mitigates codebook and layer collapse issues in VQ-VAE variants, leading to improved reconstruction accuracy and simplified training for generative models on diverse data like images and audio.
TalkHier, a framework from Sony Group Corporation, enhances LLM multi-agent systems by introducing a structured communication protocol and a hierarchical refinement system. This approach achieved state-of-the-art performance across MMLU, WikiQA, and Camera datasets.
39
Fine-grained control over voice impressions (e.g., making a voice brighter or calmer) is a key frontier for creating more controllable text-to-speech. However, this nascent field faces two key challenges. The first is the problem of impression leakage, where the synthesized voice is undesirably influenced by the speaker's reference audio, rather than the separately specified target impression, and the second is the lack of a public, annotated corpus. To mitigate impression leakage, we propose two methods: 1) a training strategy that separately uses an utterance for speaker identity and another utterance of the same speaker for target impression, and 2) a novel reference-free model that generates a speaker embedding solely from the target impression, achieving the benefits of improved robustness against the leakage and the convenience of reference-free generation. Objective and subjective evaluations demonstrate a significant improvement in controllability. Our best method reduced the mean squared error of 11-dimensional voice impression vectors from 0.61 to 0.41 objectively and from 1.15 to 0.92 subjectively, while maintaining high fidelity. To foster reproducible research, we introduce LibriTTS-VI, the first public voice impression dataset released with clear annotation standards, built upon the LibriTTS-R corpus.
4
Researchers from the University of Rochester and Sony Group Corporation developed PU-VALOR, a dataset of pseudo-untrimmed audio-visual videos with precise temporal annotations, and AVicuna, a multimodal large language model. This work enhances fine-grained temporal understanding in complex video-language tasks, achieving state-of-the-art performance in video QA and audio-visual event dense localization.
19
The research from Sony AI develops a multi-track contrastive learning approach to automatically identify music samples, achieving a 15% improvement in mean Average Precision on the Sample100 dataset. This method generates realistic training data by mixing instrument stems from different songs, allowing the system to robustly identify source material even when heavily transformed and embedded within new compositions.
There are no more papers matching your filters at the moment.