test-time-inference
Learning Unmasking Policies for Diffusion Language Models
09 Dec 2025

Lightweight reinforcement learning policies were trained to automate the unmasking process for Diffusion Large Language Models (dLLMs), improving inference efficiency without sacrificing generation quality. These policies consistently outperformed heuristic methods, particularly in full-diffusion generation settings, and demonstrated transferability across different dLLM architectures and sequence lengths.

View blog
Resources
Asynchronous Reasoning: Training-Free Interactive Thinking LLMs
11 Dec 2025

Researchers from Yandex and academic partners introduce AsyncReasoning, a training-free framework that enables existing Large Language Models to concurrently reason, process new inputs, and generate responses. This method drastically reduces user-perceived delays by 6-11x (Time to First Token from minutes to seconds) while preserving most of the reasoning accuracy and allowing for real-time safety checks.

View blog
Resources
SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images
09 Dec 2025

SegEarth-OV3 introduces a training-free adaptation of the Segment Anything Model 3 (SAM 3) for open-vocabulary semantic segmentation in remote sensing images. The method establishes a new state-of-the-art for training-free approaches, achieving a 53.4% average mIoU across eight remote sensing benchmarks, an improvement of 12.7% mIoU over previous methods.

View blog
Resources
SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos
09 Dec 2025

SAM-Body4D introduces a training-free framework for 4D human body mesh recovery from videos, synergistically combining promptable video object segmentation and image-based human mesh recovery models with an occlusion-aware mask refinement module. The system produces temporally consistent and robust mesh trajectories, effectively handling occlusions and maintaining identity across frames.

View blog
Resources
DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
11 Dec 2025
Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.
View blog
Resources
Algorithmic Thinking Theory
04 Dec 2025

Researchers from Google, NYU, ETH Zurich, and Stanford present a theoretical framework to formalize how large language models perform complex, iterative reasoning. The framework characterizes reasoning "oracles" and algorithms, proving that branching and genetic algorithms can achieve optimal success probabilities for models where oracle accuracy can decay with context size, and explains phenomena like "overthinking."

View blog
Resources2
Exploring Test-time Scaling via Prediction Merging on Large-Scale Recommendation
Inspired by the success of language models (LM), scaling up deep learning recommendation systems (DLRS) has become a recent trend in the community. All previous methods tend to scale up the model parameters during training time. However, how to efficiently utilize and scale up computational resources during test time remains underexplored, which can prove to be a scaling-efficient approach and bring orthogonal improvements in LM domains. The key point in applying test-time scaling to DLRS lies in effectively generating diverse yet meaningful outputs for the same instance. We propose two ways: One is to explore the heterogeneity of different model architectures. The other is to utilize the randomness of model initialization under a homogeneous architecture. The evaluation is conducted across eight models, including both classic and SOTA models, on three benchmarks. Sufficient evidence proves the effectiveness of both solutions. We further prove that under the same inference budget, test-time scaling can outperform parameter scaling. Our test-time scaling can also be seamlessly accelerated with the increase in parallel servers when deployed online, without affecting the inference time on the user side. Code is available.
View blog
Resources
Metacognitive Sensitivity for Test-Time Dynamic Model Selection
11 Dec 2025
A key aspect of human cognition is metacognition - the ability to assess one's own knowledge and judgment reliability. While deep learning models can express confidence in their predictions, they often suffer from poor calibration, a cognitive bias where expressed confidence does not reflect true competence. Do models truly know what they know? Drawing from human cognitive science, we propose a new framework for evaluating and leveraging AI metacognition. We introduce meta-d', a psychologically-grounded measure of metacognitive sensitivity, to characterise how reliably a model's confidence predicts its own accuracy. We then use this dynamic sensitivity score as context for a bandit-based arbiter that performs test-time model selection, learning which of several expert models to trust for a given task. Our experiments across multiple datasets and deep learning model combinations (including CNNs and VLMs) demonstrate that this metacognitive approach improves joint-inference accuracy over constituent models. This work provides a novel behavioural account of AI models, recasting ensemble selection as a problem of evaluating both short-term signals (confidence prediction scores) and medium-term traits (metacognitive sensitivity).
View blog
Resources
SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

SIMPACT introduces a framework that enables Vision-Language Models to perform zero-shot, physics-aware robotic manipulation by integrating an automatically constructed multi-physics simulator into the planning loop. This approach achieves success rates up to 90% on tasks such as shaping rope and Play-Doh, significantly outperforming geometric and VLM-based baselines.

View blog
Resources
Short-Context Dominance: How Much Local Context Natural Language Actually Needs?
08 Dec 2025

This research quantifies the 'short-context dominance' in natural language, revealing that 75-80% of next-token predictions in large language models depend on local contexts of 32-96 tokens. It introduces a ground-truth-independent method, Long-Short Distribution Shift (LSDS), to detect when longer contexts are truly needed, and a targeted decoding algorithm, TaBoo, that consistently improves performance on long-range reasoning tasks.

View blog
Resources
SplatPainter: Interactive Authoring of 3D Gaussians from 2D Edits via Test-Time Training

SplatPainter, a framework developed by researchers from Stanford University and Adobe Research, enables interactive, high-fidelity editing of 3D Gaussian Splatting assets directly from 2D inputs. It achieves identity-preserving local and global modifications at sub-second speeds, bridging a critical gap in 3D content creation workflows.

View blog
Resources
Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression
06 Dec 2025
Large language models (LLMs) excel across many natural language tasks, yet their generalisation to structural perturbations in logical contexts remains poorly understood. We introduce a controlled evaluation framework that probes reasoning reliability through four targeted stress tests: (1) rule deletion, removing either redundant or essential rules from a multi-step inference chain; (2) contradictory evidence injection; (3) logic-preserving rewrites generated through several families of equivalence laws (contrapositive, double negation, implication, De Morgan, identity, and commutativity); and (4) multi-law equivalence stacking that introduces 2-5 simultaneous logical transformations. Across three representative model families: BERT, Qwen2, and LLaMA-like models. Our experiments reveal a strikingly consistent pattern: all models achieve perfect accuracy on the base tasks and remain fully generalise to redundant rule deletion and all equivalence-based rewrites (single or multi-law), but fail sharply under essential rule deletion (dropping to 25% accuracy) and collapse completely in the presence of explicit contradictions (0% accuracy). These results demonstrate that LLMs possess stable invariance to semantic-preserving logical transformations, yet remain fundamentally brittle to missing or conflicting evidence. Our framework provides a clean diagnostic tool for isolating such reasoning failure modes and highlights persistent gaps in the logical generalisation abilities of current LLMs.
View blog
Resources
Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding
05 Dec 2025

Researchers from Peking University, Princeton University, and other institutions developed ZoomClick, a training-free inference strategy that enhances GUI grounding accuracy by dynamically zooming into relevant regions, and GUIZoom-Bench, a diagnostic benchmark for evaluating zoom behaviors. ZoomClick achieved a new state-of-the-art of 73.1% accuracy on ScreenSpot-Pro and a 66.7% relative improvement on UI-Vision, enabling smaller models to outperform larger unaugmented counterparts.

View blog
Resources
ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models
06 Dec 2025
Current language model safety paradigms often fall short in emotionally charged or high-stakes settings, where refusal-only approaches may alienate users and naive compliance can amplify risk. We propose ProSocialAlign, a test-time, parameter-efficient framework that steers generation toward safe, empathetic, and value-aligned responses without retraining the base model. We formalize five human-centered objectives and cast safety as lexicographic constrained generation: first, applying hard constraints to eliminate harmful continuations; then optimizing for prosocial quality within the safe set. Our method combines (i) directional regulation, a harm-mitigation mechanism that subtracts a learned "harm vector" in parameter space, and (ii) preference-aware autoregressive reward modeling trained jointly across attributes with gradient conflict resolution, enabling fine-grained, user-controllable decoding. Empirical evaluations across five safety benchmarks demonstrate state-of-the-art performance, reducing unsafe leakage and boosting alignment to human values, with strong gains across multiple evaluation metrics. ProSocialAlign offers a robust and modular foundation for generating context-sensitive, safe, and human-aligned responses at inference time.
View blog
Resources
TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment
Despite recent advances, diffusion-based text-to-image models still struggle with accurate text rendering. Several studies have proposed fine-tuning or training-free refinement methods for accurate text rendering. However, the critical issue of text omission, where the desired text is partially or entirely missing, remains largely overlooked. In this work, we propose TextGuider, a novel training-free method that encourages accurate and complete text appearance by aligning textual content tokens and text regions in the image. Specifically, we analyze attention patterns in MM-DiT models, particularly for text-related tokens intended to be rendered in the image. Leveraging this observation, we apply latent guidance during the early stage of denoising steps based on two loss functions that we introduce. Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.
View blog
Resources
ReJump: A Tree-Jump Representation for Analyzing and Improving LLM Reasoning
09 Dec 2025

Researchers introduce ReJump, a dual tree-jump representation that models LLM reasoning by capturing hierarchical problem-solving steps and dynamic action flows, including backtracking. This framework enables diagnosing reasoning inefficiencies and improving model performance on complex tasks through guided test-time selection.

View blog
Resources
Prompting Science Report 4: Playing Pretend: Expert Personas Don't Improve Factual Accuracy
05 Dec 2025

This study from Generative AI Labs at The Wharton School empirically demonstrates that assigning expert personas to large language models generally does not improve their factual accuracy on challenging objective questions from benchmarks like GPQA Diamond and MMLU-Pro. The research found that most persona conditions yielded performance statistically similar to a baseline without personas, with low-knowledge personas often decreasing accuracy and some mismatched expert roles causing models to refuse to answer.

View blog
Resources
Adapting Like Humans: A Metacognitive Agent with Test-time Reasoning
28 Nov 2025

Metacognitive Test-time Reasoning (MCTR) imbues Vision-Language Models with human-like fluid intelligence through a dual-level metacognitive architecture and test-time reinforcement learning. This framework achieves state-of-the-art zero-shot adaptation, securing 9 out of 12 top-1 results on unseen Atari games and improving average unseen performance by 275% over the SFT baseline.

View blog
Resources
Plantain: Plan-Answer Interleaved Reasoning
02 Dec 2025

Google DeepMind's Plantain framework enables large language models to interleave planning with intermediate answers, addressing user experience issues in reasoning tasks. This approach reduces the time-to-first-response by over 60% and maintains or improves task accuracy across various benchmarks by allowing early user intervention.

View blog
Resources
Toward More Reliable Artificial Intelligence: Reducing Hallucinations in Vision-Language Models
08 Dec 2025
Vision-language models (VLMs) frequently generate hallucinated content plausible but incorrect claims about image content. We propose a training-free self-correction framework enabling VLMs to iteratively refine responses through uncertainty-guided visual re-attention. Our method combines multidimensional uncertainty quantification (token entropy, attention dispersion, semantic consistency, claim confidence) with attention-guided cropping of under-explored regions. Operating entirely with frozen, pretrained VLMs, our framework requires no gradient updates. We validate our approach on the POPE and MMHAL BENCH benchmarks using the Qwen2.5-VL-7B [23] architecture. Experimental results demonstrate that our method reduces hallucination rates by 9.8 percentage points compared to the baseline, while improving object existence accuracy by 4.7 points on adversarial splits. Furthermore, qualitative analysis confirms that uncertainty-guided re-attention successfully grounds corrections in visual evidence where standard decoding fails. We validate our approach on Qwen2.5-VL-7B [23], with plans to extend validation across diverse architectures in future versions. We release our code and methodology to facilitate future research in trustworthy multimodal systems.
View blog
Resources
There are no more papers matching your filters at the moment.