PICO
VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory
04 Dec 2025

VideoSSM, developed by researchers at The University of Hong Kong and PICO, ByteDance, introduces a hybrid state-space memory architecture to enable autoregressive long video generation. The model maintains temporal consistency and dynamism over minute-scale durations, achieving superior quality and preventing motion drift or content repetition while operating with linear computational complexity.

View blog
Resources
XRoboToolkit: A Cross-Platform Framework for Robot Teleoperation
The rapid advancement of Vision-Language-Action models has created an urgent need for large-scale, high-quality robot demonstration datasets. Although teleoperation is the predominant method for data collection, current approaches suffer from limited scalability, complex setup procedures, and suboptimal data quality. This paper presents XRoboToolkit, a cross-platform framework for extended reality based robot teleoperation built on the OpenXR standard. The system features low-latency stereoscopic visual feedback, optimization-based inverse kinematics, and support for diverse tracking modalities including head, controller, hand, and auxiliary motion trackers. XRoboToolkit's modular architecture enables seamless integration across robotic platforms and simulation environments, spanning precision manipulators, mobile robots, and dexterous hands. We demonstrate the framework's effectiveness through precision manipulation tasks and validate data quality by training VLA models that exhibit robust autonomous performance.
View blog
Resources
EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh
05 Jun 2025

EX-4D, developed by Pico, ByteDance, introduces a method for synthesizing high-quality, camera-controllable (4D) videos from monocular input, particularly excelling under extreme viewpoints. The framework leverages a novel Depth Watertight Mesh and a simulated masking strategy, consistently outperforming state-of-the-art baselines in FID and FVD metrics and achieving a 70.70% user preference for physical consistency.

View blog
Resources247
4K4DGen: Panoramic 4D Generation at 4K Resolution
03 Oct 2024

A framework named 4K4DGen enables the creation of dynamic, immersive 4D panoramic scenes at 4K resolution (4096x2048) from a single static input image. It achieves superior visual and video quality compared to existing methods, with user studies showing 81% preference and quantitative metrics like FID at 16.59 versus a competitor's high 50s.

View blog
Resources
EMHI: A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs
06 Nov 2025
Egocentric human pose estimation (HPE) using wearable sensors is essential for VR/AR applications. Most methods rely solely on either egocentric-view images or sparse Inertial Measurement Unit (IMU) signals, leading to inaccuracies due to self-occlusion in images or the sparseness and drift of inertial sensors. Most importantly, the lack of real-world datasets containing both modalities is a major obstacle to progress in this field. To overcome the barrier, we propose EMHI, a multimodal \textbf{E}gocentric human \textbf{M}otion dataset with \textbf{H}ead-Mounted Display (HMD) and body-worn \textbf{I}MUs, with all data collected under the real VR product suite. Specifically, EMHI provides synchronized stereo images from downward-sloping cameras on the headset and IMU data from body-worn sensors, along with pose annotations in SMPL format. This dataset consists of 885 sequences captured by 58 subjects performing 39 actions, totaling about 28.5 hours of recording. We evaluate the annotations by comparing them with optical marker-based SMPL fitting results. To substantiate the reliability of our dataset, we introduce MEPoser, a new baseline method for multimodal egocentric HPE, which employs a multimodal fusion encoder, temporal feature encoder, and MLP-based regression heads. The experiments on EMHI show that MEPoser outperforms existing single-modal methods and demonstrates the value of our dataset in solving the problem of egocentric HPE. We believe the release of EMHI and the method could advance the research of egocentric HPE and expedite the practical implementation of this technology in VR/AR products.
View blog
Resources
HMD-Poser: On-Device Real-time Human Motion Tracking from Scalable Sparse Observations
06 Mar 2024
It is especially challenging to achieve real-time human motion tracking on a standalone VR Head-Mounted Display (HMD) such as Meta Quest and PICO. In this paper, we propose HMD-Poser, the first unified approach to recover full-body motions using scalable sparse observations from HMD and body-worn IMUs. In particular, it can support a variety of input scenarios, such as HMD, HMD+2IMUs, HMD+3IMUs, etc. The scalability of inputs may accommodate users' choices for both high tracking accuracy and easy-to-wear. A lightweight temporal-spatial feature learning network is proposed in HMD-Poser to guarantee that the model runs in real-time on HMDs. Furthermore, HMD-Poser presents online body shape estimation to improve the position accuracy of body joints. Extensive experimental results on the challenging AMASS dataset show that HMD-Poser achieves new state-of-the-art results in both accuracy and real-time performance. We also build a new free-dancing motion dataset to evaluate HMD-Poser's on-device performance and investigate the performance gap between synthetic data and real-captured sensor data. Finally, we demonstrate our HMD-Poser with a real-time Avatar-driving application on a commercial HMD. Our code and free-dancing motion dataset are available this https URL
View blog
Resources
Relative Pose Estimation through Affine Corrections of Monocular Depth Priors
Monocular depth estimation (MDE) models have undergone significant advancements over recent years. Many MDE models aim to predict affine-invariant relative depth from monocular images, while recent developments in large-scale training and vision foundation models enable reasonable estimation of metric (absolute) depth. However, effectively leveraging these predictions for geometric vision tasks, in particular relative pose estimation, remains relatively under explored. While depths provide rich constraints for cross-view image alignment, the intrinsic noise and ambiguity from the monocular depth priors present practical challenges to improving upon classic keypoint-based solutions. In this paper, we develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities, covering both calibrated and uncalibrated conditions. We further propose a hybrid estimation pipeline that combines our proposed solvers with classic point-based solvers and epipolar constraints. We find that the affine correction modeling is beneficial to not only the relative depth priors but also, surprisingly, the "metric" ones. Results across multiple datasets demonstrate large improvements of our approach over classic keypoint-based baselines and PnP-based solutions, under both calibrated and uncalibrated setups. We also show that our method improves consistently with different feature matchers and MDE models, and can further benefit from very recent advances on both modules. Code is available at this https URL
View blog
Resources98
OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
03 Dec 2025
Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.
View blog
Resources
Scalable multilayer diffractive neural network with all-optical nonlinear activation
All-optical diffractive neural networks (DNNs) offer a promising alternative to electronics-based neural network processing due to their low latency, high throughput, and inherent spatial parallelism. However, the lack of reconfigurability and nonlinearity limits existing all-optical DNNs to handling only simple tasks. In this study, we present a folded optical system that enables a multilayer reconfigurable DNN using a single spatial light modulator. This platform not only enables dynamic weight reconfiguration for diverse classification challenges but crucially integrates a mirror-coated silicon substrate exhibiting instantaneous \c{hi}(3) nonlinearity. The incorporation of all-optical nonlinear activation yields substantial accuracy improvements across benchmark tasks, with performance gains becoming increasingly significant as both network depth and task complexity escalate. Our system represents a critical advancement toward realizing scalable all-optical neural networks with complex architectures, potentially achieving computational capabilities that rival their electronic counterparts while maintaining photonic advantages.
View blog
Resources
Multi-modal Relation Distillation for Unified 3D Representation Learning
18 Sep 2024
Recent advancements in multi-modal pre-training for 3D point clouds have demonstrated promising results by aligning heterogeneous features across 3D shapes and their corresponding 2D images and language descriptions. However, current straightforward solutions often overlook intricate structural relations among samples, potentially limiting the full capabilities of multi-modal learning. To address this issue, we introduce Multi-modal Relation Distillation (MRD), a tri-modal pre-training framework, which is designed to effectively distill reputable large Vision-Language Models (VLM) into 3D backbones. MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations. Notably, MRD achieves significant improvements in downstream zero-shot classification tasks and cross-modality retrieval tasks, delivering new state-of-the-art performance.
View blog
Resources
There are no more papers matching your filters at the moment.