Large Multimodal Models often produce semantically plausible but visually inaccurate outputs when interpreting scene text, particularly with non-semantic content. This work introduces a training-free framework, combining a coarse-to-fine attention-based text region estimator with adaptive internal layer correction, which enhances visual grounding and improves LMM performance on scene text tasks by up to 5.5% F1 score on the challenging TextHalu-Bench.
ProJudge introduces a comprehensive framework, benchmark, and dataset for evaluating and improving Multi-Modal Large Language Models' ability to judge the step-by-step reasoning processes in scientific problem-solving. Fine-tuning open-source models on the ProJudge-173k dataset significantly enhanced their process evaluation capabilities, bringing their performance closer to proprietary models like GPT-4o.
1
Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global summarization and local retrieval capabilities. 3) The benchmark also introduces a set of paired perception reasoning tasks, ranging from visual text perception to cross-modal reasoning between textual and visual information. Extensive experiments on 18 state-of-the-art Large Multimodal Models (LMMs) reveal that current models struggle across most tasks, with significant room for improvement. Further analysis highlights the impact of both model-intrinsic factors, such as input resolution and OCR capability, and external factors, including the use of auxiliary information and Chain-of-Thought reasoning strategies. We hope VidText will fill the current gap in video understanding benchmarks and serve as a foundation for future research on multimodal reasoning with video text in dynamic environments.
SDSS-V will be an all-sky, multi-epoch spectroscopic survey of over six million objects. It is designed to decode the history of the Milky Way, trace the emergence of the chemical elements, reveal the inner workings of stars, and investigate the origin of planets. It will also create an integral-field spectroscopic map of the gas in the Galaxy and the Local Group that is 1,000x larger than the current state of the art and at high enough spatial resolution to reveal the self-regulation mechanisms of galactic ecosystems. SDSS-V will pioneer systematic, spectroscopic monitoring across the whole sky, revealing changes on timescales from 20 minutes to 20 years. The survey will thus track the flickers, flares, and radical transformations of the most luminous persistent objects in the universe: massive black holes growing at the centers of galaxies. The scope and flexibility of SDSS-V will be unique among extant and future spectroscopic surveys: it is all-sky, with matched survey infrastructures in both hemispheres; it provides near-IR and optical multi-object fiber spectroscopy that is rapidly reconfigurable to serve high target densities, targets of opportunity, and time-domain monitoring; and it provides optical, ultra-wide-field integral field spectroscopy. SDSS-V, with its programs anticipated to start in 2020, will be well-timed to multiply the scientific output from major space missions (e.g., TESS, Gaia, eROSITA) and ground-based projects. SDSS-V builds on the 25-year heritage of SDSS's advances in data analysis, collaboration infrastructure, and product deliverables. The project is now refining its science scope, optimizing the survey strategies, and developing new hardware that builds on the SDSS-IV infrastructure. We present here an overview of the current state of these developments as we seek to build our worldwide consortium of institutional and individual members.
There are no more papers matching your filters at the moment.