Researchers from Carnegie Mellon University and Apple developed a method to enhance Vision-Language Models' chain-of-thought reasoning by distilling 193k detailed rationales from GPT-4o, then applying supervised fine-tuning and Direct Preference Optimization. This approach yielded an average gain of 11.7 points in CoT prediction and improved generalization to direct answer tasks, while also enabling the model to act as a reasoning verifier.
View blogThis research introduces a method for directly optimizing video large multimodal models (LMMs) for factual consistency by leveraging a language model's reward derived from detailed video captions. The approach significantly improves factual alignment, achieving an 8.1% average accuracy gain on video question answering benchmarks, while drastically reducing the cost of alignment data collection.
View blogResearchers from MIT, CMU, UMass Amherst, and MIT-IBM Watson AI Lab introduced the Scientific Generative Agent (SGA), a bilevel optimization framework that integrates large language models with differentiable physical simulations for scientific discovery. This framework achieved quantitatively superior results in tasks like constitutive law discovery and molecular design, yielding novel and expert-validated solutions.
View blogA multi-agent reinforcement learning framework from researchers at AI2, MIT, CMU, and Boston University trains a smaller language model to generate natural language critiques, which then guide fixed, black-box large language models to refine their outputs. This method consistently improves the fixed model's performance, achieving gains such as a 27-point absolute increase in exact match accuracy on a synthetic alphabetization task over a supervised critique generation baseline.
View blog