alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

test-time-inference

Learning Unmasking Policies for Diffusion Language Models

09 Dec 2025

Lightweight reinforcement learning policies were trained to automate the unmasking process for Diffusion Large Language Models (dLLMs), improving inference efficiency without sacrificing generation quality. These policies consistently outperformed heuristic methods, particularly in full-diffusion generation settings, and demonstrated transferability across different dLLM architectures and sequence lengths.

#test-time-inference #agents #computer-science

Paper thumbnail

Asynchronous Reasoning: Training-Free Interactive Thinking LLMs

11 Dec 2025

Researchers from Yandex and academic partners introduce AsyncReasoning, a training-free framework that enables existing Large Language Models to concurrently reason, process new inputs, and generate responses. This method drastically reduces user-perceived delays by 6-11x (Time to First Token from minutes to seconds) while preserving most of the reasoning accuracy and allowing for real-time safety checks.

#test-time-inference #agents #chain-of-thought

Paper thumbnail

SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images

09 Dec 2025

SegEarth-OV3 introduces a training-free adaptation of the Segment Anything Model 3 (SAM 3) for open-vocabulary semantic segmentation in remote sensing images. The method establishes a new state-of-the-art for training-free approaches, achieving a 53.4% average mIoU across eight remote sensing benchmarks, an improvement of 12.7% mIoU over previous methods.

#test-time-inference #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos

09 Dec 2025

SAM-Body4D introduces a training-free framework for 4D human body mesh recovery from videos, synergistically combining promptable video object segmentation and image-based human mesh recovery models with an occlusion-aware mask refinement module. The system produces temporally consistent and robust mesh trajectories, effectively handling occlusions and maintaining identity across frames.

#test-time-inference #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

11 Dec 2025

Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.

#test-time-inference #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Algorithmic Thinking Theory

04 Dec 2025

Researchers from Google, NYU, ETH Zurich, and Stanford present a theoretical framework to formalize how large language models perform complex, iterative reasoning. The framework characterizes reasoning "oracles" and algorithms, proving that branching and genetic algorithms can achieve optimal success probabilities for models where oracle accuracy can decay with context size, and explains phenomena like "overthinking."

#test-time-inference #agentic-frameworks #agents

Paper thumbnail

Exploring Test-time Scaling via Prediction Merging on Large-Scale Recommendation

08 Dec 2025

Mila - Quebec AI Institute McGill University logo

McGill University

Inspired by the success of language models (LM), scaling up deep learning recommendation systems (DLRS) has become a recent trend in the community. All previous methods tend to scale up the model parameters during training time. However, how to efficiently utilize and scale up computational resources during test time remains underexplored, which can prove to be a scaling-efficient approach and bring orthogonal improvements in LM domains. The key point in applying test-time scaling to DLRS lies in effectively generating diverse yet meaningful outputs for the same instance. We propose two ways: One is to explore the heterogeneity of different model architectures. The other is to utilize the randomness of model initialization under a homogeneous architecture. The evaluation is conducted across eight models, including both classic and SOTA models, on three benchmarks. Sufficient evidence proves the effectiveness of both solutions. We further prove that under the same inference budget, test-time scaling can outperform parameter scaling. Our test-time scaling can also be seamlessly accelerated with the increase in parallel servers when deployed online, without affecting the inference time on the user side. Code is available.

#test-time-inference #computer-science #information-retrieval

Paper thumbnail

Metacognitive Sensitivity for Test-Time Dynamic Model Selection

11 Dec 2025

A key aspect of human cognition is metacognition - the ability to assess one's own knowledge and judgment reliability. While deep learning models can express confidence in their predictions, they often suffer from poor calibration, a cognitive bias where expressed confidence does not reflect true competence. Do models truly know what they know? Drawing from human cognitive science, we propose a new framework for evaluating and leveraging AI metacognition. We introduce meta-d', a psychologically-grounded measure of metacognitive sensitivity, to characterise how reliably a model's confidence predicts its own accuracy. We then use this dynamic sensitivity score as context for a bandit-based arbiter that performs test-time model selection, learning which of several expert models to trust for a given task. Our experiments across multiple datasets and deep learning model combinations (including CNNs and VLMs) demonstrate that this metacognitive approach improves joint-inference accuracy over constituent models. This work provides a novel behavioural account of AI models, recasting ensemble selection as a problem of evaluating both short-term signals (confidence prediction scores) and medium-term traits (metacognitive sensitivity).

#test-time-inference #agents #computer-science

Paper thumbnail

SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

05 Dec 2025

University of Maryland

SIMPACT introduces a framework that enables Vision-Language Models to perform zero-shot, physics-aware robotic manipulation by integrating an automatically constructed multi-physics simulator into the planning loop. This approach achieves success rates up to 90% on tasks such as shaping rope and Play-Doh, significantly outperforming geometric and VLM-based baselines.

#test-time-inference #agents #computer-science

Paper thumbnail

Short-Context Dominance: How Much Local Context Natural Language Actually Needs?

08 Dec 2025

This research quantifies the 'short-context dominance' in natural language, revealing that 75-80% of next-token predictions in large language models depend on local contexts of 32-96 tokens. It introduces a ground-truth-independent method, Long-Short Distribution Shift (LSDS), to detect when longer contexts are truly needed, and a targeted decoding algorithm, TaBoo, that consistently improves performance on long-range reasoning tasks.

#test-time-inference #attention-mechanisms #computer-science

Paper thumbnail

SplatPainter: Interactive Authoring of 3D Gaussians from 2D Edits via Test-Time Training

05 Dec 2025

Stanford University Adobe logo

SplatPainter, a framework developed by researchers from Stanford University and Adobe Research, enables interactive, high-fidelity editing of 3D Gaussian Splatting assets directly from 2D inputs. It achieves identity-preserving local and global modifications at sub-second speeds, bridging a critical gap in 3D content creation workflows.

#test-time-inference #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression

06 Dec 2025

Large language models (LLMs) excel across many natural language tasks, yet their generalisation to structural perturbations in logical contexts remains poorly understood. We introduce a controlled evaluation framework that probes reasoning reliability through four targeted stress tests: (1) rule deletion, removing either redundant or essential rules from a multi-step inference chain; (2) contradictory evidence injection; (3) logic-preserving rewrites generated through several families of equivalence laws (contrapositive, double negation, implication, De Morgan, identity, and commutativity); and (4) multi-law equivalence stacking that introduces 2-5 simultaneous logical transformations. Across three representative model families: BERT, Qwen2, and LLaMA-like models. Our experiments reveal a strikingly consistent pattern: all models achieve perfect accuracy on the base tasks and remain fully generalise to redundant rule deletion and all equivalence-based rewrites (single or multi-law), but fail sharply under essential rule deletion (dropping to 25% accuracy) and collapse completely in the presence of explicit contradictions (0% accuracy). These results demonstrate that LLMs possess stable invariance to semantic-preserving logical transformations, yet remain fundamentally brittle to missing or conflicting evidence. Our framework provides a clean diagnostic tool for isolating such reasoning failure modes and highlights persistent gaps in the logical generalisation abilities of current LLMs.

#test-time-inference #adversarial-robustness #computer-science

Paper thumbnail

Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

05 Dec 2025

Researchers from Peking University, Princeton University, and other institutions developed ZoomClick, a training-free inference strategy that enhances GUI grounding accuracy by dynamically zooming into relevant regions, and GUIZoom-Bench, a diagnostic benchmark for evaluating zoom behaviors. ZoomClick achieved a new state-of-the-art of 73.1% accuracy on ScreenSpot-Pro and a 66.7% relative improvement on UI-Vision, enabling smaller models to outperform larger unaugmented counterparts.

#test-time-inference #agents #computer-science

Paper thumbnail

ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models

06 Dec 2025

Current language model safety paradigms often fall short in emotionally charged or high-stakes settings, where refusal-only approaches may alienate users and naive compliance can amplify risk. We propose ProSocialAlign, a test-time, parameter-efficient framework that steers generation toward safe, empathetic, and value-aligned responses without retraining the base model. We formalize five human-centered objectives and cast safety as lexicographic constrained generation: first, applying hard constraints to eliminate harmful continuations; then optimizing for prosocial quality within the safe set. Our method combines (i) directional regulation, a harm-mitigation mechanism that subtracts a learned "harm vector" in parameter space, and (ii) preference-aware autoregressive reward modeling trained jointly across attributes with gradient conflict resolution, enabling fine-grained, user-controllable decoding. Empirical evaluations across five safety benchmarks demonstrate state-of-the-art performance, reducing unsafe leakage and boosting alignment to human values, with strong gains across multiple evaluation metrics. ProSocialAlign offers a robust and modular foundation for generating context-sensitive, safe, and human-aligned responses at inference time.

#test-time-inference #computer-science #computation-and-language

Paper thumbnail

TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment

10 Dec 2025

Seoul National University Samsung

Despite recent advances, diffusion-based text-to-image models still struggle with accurate text rendering. Several studies have proposed fine-tuning or training-free refinement methods for accurate text rendering. However, the critical issue of text omission, where the desired text is partially or entirely missing, remains largely overlooked. In this work, we propose TextGuider, a novel training-free method that encourages accurate and complete text appearance by aligning textual content tokens and text regions in the image. Specifically, we analyze attention patterns in MM-DiT models, particularly for text-related tokens intended to be rendered in the image. Leveraging this observation, we apply latent guidance during the early stage of denoising steps based on two loss functions that we introduce. Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.

#test-time-inference #attention-mechanisms #computer-science

Paper thumbnail

ReJump: A Tree-Jump Representation for Analyzing and Improving LLM Reasoning

09 Dec 2025

Researchers introduce ReJump, a dual tree-jump representation that models LLM reasoning by capturing hierarchical problem-solving steps and dynamic action flows, including backtracking. This framework enables diagnosing reasoning inefficiencies and improving model performance on complex tasks through guided test-time selection.

#test-time-inference #agents #chain-of-thought

Paper thumbnail

Prompting Science Report 4: Playing Pretend: Expert Personas Don't Improve Factual Accuracy

05 Dec 2025

This study from Generative AI Labs at The Wharton School empirically demonstrates that assigning expert personas to large language models generally does not improve their factual accuracy on challenging objective questions from benchmarks like GPQA Diamond and MMLU-Pro. The research found that most persona conditions yielded performance statistically similar to a baseline without personas, with low-knowledge personas often decreasing accuracy and some mismatched expert roles causing models to refuse to answer.

#test-time-inference #agents #computer-science

Paper thumbnail

Adapting Like Humans: A Metacognitive Agent with Test-time Reasoning

28 Nov 2025

Metacognitive Test-time Reasoning (MCTR) imbues Vision-Language Models with human-like fluid intelligence through a dual-level metacognitive architecture and test-time reinforcement learning. This framework achieves state-of-the-art zero-shot adaptation, securing 9 out of 12 top-1 results on unseen Atari games and improving average unseen performance by 275% over the SFT baseline.

#test-time-inference #agents #computer-science

Paper thumbnail

Plantain: Plan-Answer Interleaved Reasoning

02 Dec 2025

Google DeepMind's Plantain framework enables large language models to interleave planning with intermediate answers, addressing user experience issues in reasoning tasks. This approach reduces the time-to-first-response by over 60% and maintains or improves task accuracy across various benchmarks by allowing early user intervention.

#test-time-inference #agents #chain-of-thought

Paper thumbnail

Toward More Reliable Artificial Intelligence: Reducing Hallucinations in Vision-Language Models

08 Dec 2025

Vision-language models (VLMs) frequently generate hallucinated content plausible but incorrect claims about image content. We propose a training-free self-correction framework enabling VLMs to iteratively refine responses through uncertainty-guided visual re-attention. Our method combines multidimensional uncertainty quantification (token entropy, attention dispersion, semantic consistency, claim confidence) with attention-guided cropping of under-explored regions. Operating entirely with frozen, pretrained VLMs, our framework requires no gradient updates. We validate our approach on the POPE and MMHAL BENCH benchmarks using the Qwen2.5-VL-7B [23] architecture. Experimental results demonstrate that our method reduces hallucination rates by 9.8 percentage points compared to the baseline, while improving object existence accuracy by 4.7 points on adversarial splits. Furthermore, qualitative analysis confirms that uncertainty-guided re-attention successfully grounds corrections in visual evidence where standard decoding fails. We validate our approach on Qwen2.5-VL-7B [23], with plans to extend validation across diverse architectures in future versions. We release our code and methodology to facilitate future research in trustworthy multimodal systems.

#test-time-inference #attention-mechanisms #computer-science

Paper thumbnail

There are no more papers matching your filters at the moment.