Researchers from AI2, OpenLocus, and UMass Amherst introduce DISCOVERYBENCH, a new benchmark designed to evaluate large language models' ability to perform multi-step data-driven scientific discovery. The benchmark, comprising 264 real-world tasks and 903 synthetic tasks, reveals that current state-of-the-art LLMs achieve a maximum Hypothesis Matching Score of 25%, indicating significant limitations in autonomous discovery.
View blogThis position paper explores the use of Large Generative Models (LGMs) for end-to-end data-driven scientific discovery, proposing a hybrid system that combines LGM capabilities with robust external tools and active human feedback. Their proof-of-concept, DATAVOYAGER, demonstrated the potential for automated hypothesis generation and verification from existing datasets while also highlighting the necessity of human oversight and tool integration to mitigate LGM limitations.
View blog