One-Shot Transfer of Long-Horizon Extrinsic Manipulation Through Contact Retargeting

BibTex

Copy

@misc{liu2024oneshottransferlonghorizon,
      title={One-Shot Transfer of Long-Horizon Extrinsic Manipulation Through Contact Retargeting}, 
      author={C. Karen Liu and Sirui Chen and Ruocheng Wang and Clemens Eppner and Albert Wu},
      year={2024},
      eprint={2404.07468},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2404.07468}, 
}

AI Audio Lecture + Q&A

0:00 / 0:00

One-Shot Transfer of Long-Horizon Extrinsic Manipulation Through Contact Retargeting

Transcript

John: Welcome to Advanced Robotics and Manipulation. Today's lecture is on 'One-Shot Transfer of Long-Horizon Extrinsic Manipulation Through Contact Retargeting.' We've seen a trend with papers like 'Sequential Dexterity' focusing on chaining policies, but this work from researchers at Stanford and NVIDIA tackles generalization from a different angle, focusing on the physics of contact. It pushes back against the idea that massive datasets are always the answer for complex manipulation. John: Yes, Noah? Noah: Hi Professor. So when you say 'extrinsic manipulation', you mean using the environment itself as a tool, right? Like pushing an object against a wall to reorient it? John: Exactly. Instead of relying solely on the gripper, the robot strategically uses surfaces like walls and tables to manipulate objects. This is powerful, but it's incredibly difficult to generalize. A policy trained in one scene with specific contact points will likely fail if you move the wall or change the object's shape. This paper's main contribution is a framework to solve that specific problem. Noah: And it does this from a single demonstration? John: Correct. The core idea is to decompose a long-horizon task into a sequence of simpler, short-horizon primitives—like Push, Pull, or Pivot. The single human demonstration isn't used to mimic the exact trajectory. Instead, it's used to extract the high-level sequence of these primitives and the approximate object states at which transitions occur. Noah: So, are these primitives learned from scratch for each new task, or are they part of a pre-existing library? John: They are part of a pre-existing library. The authors prepared four robust, goal-conditioned primitives. The framework is actually agnostic to how you get them—some were made with standard controllers, one used reinforcement learning. The crucial insight isn't about the primitives themselves, but about ensuring the correct initial conditions for each one. The key to making the whole sequence work is a process they call 'contact retargeting'. Noah: Can you elaborate on what 'contact retargeting' actually does? John: Certainly. It's a two-stage process that happens before executing each primitive in the sequence. First, it solves for the object's pose. It asks: 'Given the next action is a pivot against the wall, what is the precise object pose that satisfies the contact requirements for that pivot?' It formulates this as an Inverse Kinematics problem, using the demonstration only as a rough guide. Once it finds that target object pose, it then solves a second IK problem for the robot's configuration, asking: 'Now, where does my gripper need to be to actually perform that pivot?' This ensures each primitive starts from a valid physical state, which is why it generalizes so well to new objects and environments. Noah: Wait, so this relies heavily on an IK solver and very accurate 6D pose estimation of the object. Doesn't that introduce its own set of challenges and potential points of failure, especially with sim-to-real transfer? John: That's a very sharp observation, and you're right. The authors are transparent about this. In their hardware experiments, the main causes of failure were indeed the perception system providing a noisy pose estimate, or the IK solver failing to find a solution for a complex contact configuration. However, what this demonstrates is the robustness of the core framework. The logic of retargeting holds up; the implementation is simply limited by the state of its underlying components, like perception. Noah: So how did it perform overall? John: The hardware results were quite strong. They achieved an overall success rate of over 80% across four different long-horizon tasks, using 10 different test objects that varied in shape and size, all from a single demonstration with one specific object. To prove the importance of their method, they ran an ablation study where they removed the contact enforcement step—essentially, they just used the rough pose from the demonstration. The success rate plummeted to 37%. This really isolates the contact retargeting as the critical component for success. Noah: This hierarchical approach seems more structured than, say, an end-to-end diffusion planner like we saw in 'DexHandDiff'. Is the trade-off here flexibility versus robustness and data efficiency? John: That's an excellent way to frame it. End-to-end methods offer the promise of learning complex behaviors without human-specified structure, but they often require vast amounts of data and can be brittle. This paper makes a compelling case for a modular approach. By injecting domain knowledge—the importance of contact states—they achieve remarkable data efficiency and generalization. It shifts the problem from learning low-level motor control to learning the high-level task structure, which is defined by these contact configurations. Noah: So a key assumption seems to be that the robot can reposition itself freely between primitives, when the object is in a stable, 'freestanding' state. How limiting is that for more dynamic tasks where contact is maintained throughout? John: You've pinpointed a key simplification and a direction for future work. The current framework assumes contact switches happen at these stable states. This prevents it from handling more fluid motions, like sliding an object along a wall while simultaneously reorienting the gripper. Generalizing the retargeting formulation to handle continuous contact would be the next major step, allowing for even more dynamic and coupled interactions. John: So to wrap up, this paper presents a framework for one-shot transfer of extrinsic manipulation by decomposing tasks into primitives and using 'contact retargeting' to robustly chain them in new scenes. The key lesson here is that for complex physical interaction, explicitly reasoning about contact states can be more powerful and generalizable than trying to learn everything end-to-end. It’s about finding the right level of abstraction for the problem. John: Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

One-Shot Transfer of Long-Horizon Extrinsic Manipulation Through Contact Retargeting