Transcript
John: Welcome back to our seminar on Advanced Multimodal Reasoning. Today's lecture is on 'Thinking with Programming Vision: Towards a Unified View for Thinking with Images' from researchers at Zhejiang University and ByteDance. We've seen a lot of recent work exploring how MLLMs can use tools, like in 'Thyme: Think Beyond Images' or 'OpenThinkIMG'. This paper pushes that idea further by treating visual interaction not just as tool use, but as a form of programming. It addresses a fundamental weakness in current models. Yes, Noah?
Noah: Hi Professor. So, is the core motivation that even top models like GPT-4o fail when an image is just rotated or flipped, and this paper aims to fix that with tools?
John: Precisely. The authors start by demonstrating this surprising brittleness. They show that state-of-the-art models suffer significant performance degradation, sometimes up to 80 percent, from simple orientation changes that a human would instantly correct. This provides a clear, undeniable need for tools, moving beyond tasks where tools offer only marginal gains.
Noah: So what's their solution? Just giving the model a 'rotate' and 'flip' tool?
John: That's the interesting part. Instead of a fixed set of tools, they propose a framework called CodeVision. The core idea is to have the model generate code, specifically Python, to manipulate the image. This 'code-as-tool' approach is universal. The model isn't limited to a pre-registered list of functions; it can theoretically invoke any operation available in a library like OpenCV or Pillow by writing the corresponding code. This makes the system far more flexible and scalable than previous methods that relied on brittle, hand-specified tool names and arguments.
Noah: That makes sense. It avoids the need to retrain the model every time you want to add a new tool. But how do you train a model to do this? Generating correct and useful code seems like a much harder problem than just picking a tool from a list.
John: It is a harder problem, and their methodology is key. They use a two-stage training process. First, they perform Supervised Fine-Tuning, or SFT. They built a high-quality dataset of about five thousand examples covering different scenarios: single-tool use, complex multi-tool chains, and even error handling. For instance, they would take an image, rotate it, and then the SFT data would show the model the correct code to fix the orientation before answering a question about it. This 'cold start' teaches the model the basic syntax and patterns of programmatic tool use.
Noah: So SFT teaches it the rules. What's the second stage?
John: The second stage is Reinforcement Learning. After the SFT gives the model a foundation, they use RL to teach it strategy. This moves the model beyond simple imitation and encourages it to explore, make efficient decisions, and generalize. To do this, they created a larger dataset and, critically, designed a very dense, multi-component reward function.
Noah: A quick question on the RL part. Why the complex reward function? Wouldn't a simple reward for getting the final answer correct be sufficient?
John: The ablation studies show it's not sufficient. A simple outcome-based reward can lead to 'reward hacking,' where the model gets the right answer for the wrong reasons. Their dense reward has three parts. First, an outcome reward for accuracy. Second, a 'strategy shaping' reward that gives positive signals for using necessary tools correctly, like rewarding a crop operation that has a high IoU with the ground truth region. And third, constraint penalties that punish inefficient or irrelevant actions, like using too many steps or cropping a completely irrelevant part of the image. This process-level guidance is crucial for learning robust strategies.
Noah: That sounds computationally intensive, but I can see how it would lead to better policies. Does this approach allow for capabilities that weren't explicitly trained?
John: It does. One of the most significant findings is the emergence of novel behaviors. The model learns to use tools that weren't in its 'must-use' list during training, like adjusting contrast or converting an image to grayscale to make text more legible. It also learns to chain multiple operations efficiently in a single turn and, importantly, to recover from errors by reading runtime feedback and generating corrected code. This is a step towards more genuine reasoning and agency.
Noah: This seems to align with the trend we see in works like 'PyVision' and others, where generating code is becoming the preferred interface for external tools. How does this shift our view of MLLMs?
John: I think it solidifies that view. It repositions the MLLM not just as a passive observer that describes images, but as an active agent that can programmatically interact with its visual input to solve problems. By addressing the fundamental brittleness to simple transformations, it makes these models more reliable for real-world use. Furthermore, the framework is a blueprint for integrating a potentially unlimited ecosystem of visual tools, from simple filters to complex computational photography algorithms. It's a conceptual shift towards building more adaptable and genuinely intelligent visual agents.
John: So, to wrap up, this paper makes two key contributions. First, it identifies and provides a robust solution to the overlooked fragility of MLLMs to common image corruptions. Second, it pioneers a flexible and scalable 'code-as-tool' paradigm that fosters advanced reasoning and emergent problem-solving capabilities. The main takeaway is that by teaching models to programmatically interact with images, we can make them significantly more robust, capable, and agent-like. Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.