ViPRA transforms video prediction models into control policies for physical robots, leveraging large-scale unlabeled human and robot videos for pretraining. This framework achieves an average success rate of 69.8% in simulation with discrete actions and 54.1% in real-world single-arm tasks with continuous control, significantly outperforming existing baselines.
View blogRobots Imitating Generated Videos (RIGVid) enables robots to perform complex manipulation tasks by learning purely from AI-generated videos, bypassing the need for any physical demonstrations. The system employs a Vision-Language Model to filter synthetic videos for quality and then extracts 6D object pose trajectories for robot execution, achieving performance comparable to learning from real human data.
View blog