alphaXiv

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Representation Autoencoders (RAEs) redefine the latent space for Diffusion Transformers (DiT) by utilizing frozen, pretrained visual encoders with lightweight decoders. This framework achieves state-of-the-art image generation, obtaining an FID of 1.13 on ImageNet 512x512, and demonstrates up to 47x faster convergence rates than prior DiT models.

1,102

4,897

14 Nov 2025

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Brown University Meta, FAIR

LeJEPA (Latent-Euclidean Joint-Embedding Predictive Architecture) introduces a self-supervised learning framework based on a provably optimal isotropic Gaussian embedding distribution. It utilizes a heuristics-free regularization method, SIGReg, achieving robust and scalable performance, including superior results in in-domain pretraining compared to larger, generically pretrained models.

267

26,886

02 Mar 2023

computer-science computer-vision-and-pattern-recognition machine-learning

Scalable Diffusion Models with Transformers

UC Berkeley

computer-science continual-learning computer-vision-and-pattern-recognition

This paper introduces Diffusion Transformers (DiTs), a new class of diffusion models that replace the conventional U-Net backbone with a transformer architecture. By leveraging the scalability of transformers, DiTs achieve new state-of-the-art Fréchet Inception Distance (FID) scores on class-conditional ImageNet at 256x256 (2.27 FID) and 512x512 (3.04 FID) resolutions while being more compute-efficient.

6,839

1,892

06 Nov 2025

Cambrian-S: Towards Spatial Supersensing in Video

computer-science computation-and-language machine-learning

Ellis Brown

Researchers from NYU and Stanford introduce a "spatial supersensing" hierarchy for video-based Multimodal Large Language Models (MLLMs) and new VSI-SUPER benchmarks to reveal current models' limitations in genuine spatial and temporal reasoning. They develop Cambrian-S, a specialized MLLM trained on a large spatial dataset achieving state-of-the-art on VSI-Bench, and prototype a "predictive sensing" paradigm that leverages prediction error ("surprise") to robustly improve memory management and event segmentation in arbitrarily long videos.

11,469

12 Apr 2021

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

University College London Facebook AI Research

Tim Rocktäschel

Scott Yih

This paper from Facebook AI Research and University College London introduces Retrieval-Augmented Generation (RAG), a general-purpose model that combines pre-trained parametric language generation with a non-parametric differentiable retriever. RAG achieved state-of-the-art results on multiple knowledge-intensive NLP tasks, demonstrating improved factual accuracy and the ability to update its knowledge by simply swapping out its document index.

2,107

30 Jul 2024

active-learning computer-science computation-and-language

Designing Informative Metrics for Few-Shot Example Selection

University of Illinois at Urbana-Champaign

attention-mechanisms computer-science artificial-intelligence

Researchers from the University of Illinois Urbana-Champaign and New York University introduce a training-free method for few-shot example selection in sequence tagging tasks, significantly improving performance by using a complexity score based on semantic similarity, length, and label diversity. This approach achieved an 88.76 F1 score for few-shot NER with GPT-4, surpassing prior state-of-the-art by 5% absolute, and demonstrated substantial gains for smaller language models.

2,113

28 Nov 2024

DENIAHL: In-Context Features Influence LLM Needle-In-A-Haystack Abilities

New York University Databricks

Raghav Mantri

NYU researchers introduced DENIAHL, a benchmark systematically evaluating how data characteristics like type, length, and patterns affect Large Language Models' ability to recall information from long contexts. The study found smaller models are highly sensitive to these features, while GPT-3.5 demonstrated robust recall, though its performance still degraded with combinations of long item lengths and mixed data types.

2,044

05 Dec 2024

computer-science artificial-intelligence few-shot-learning

Enhancing Mathematical Reasoning in LLMs with Background Operators

computer-science computer-vision-and-pattern-recognition machine-learning

Researchers at NYU Shanghai enhance mathematical reasoning in LLMs by training them to generate verifiable Prolog code, utilizing a set of 54 standardized background mathematical operators. Their cross-validated self-training algorithm on the MATH-Prolog corpus achieved an 84.8% accuracy on competition-level problems, showing improved solution computability and diversity.

21,926

18 Jun 2025

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

KAIST

New York University Korea University Scaled Foundations

Sihyun Yu

This research introduces REPRESENTATIONALIGNMENT (REPA), a regularization technique that accelerates the training and enhances the generation quality of diffusion transformers by explicitly aligning their internal representations with features from pretrained self-supervised visual encoders. The method enables state-of-the-art generative performance on ImageNet 256x256 while reducing training iterations by over 17.5 times compared to baseline models.

832

1,544

24 Sep 2025

chain-of-thought computer-science artificial-intelligence

Soft Tokens, Hard Truths

University of Amsterdam

computer-science artificial-intelligence computer-vision-and-pattern-recognition

This work introduces a scalable reinforcement learning method for training Large Language Models to generate continuous Chains-of-Thought (CoTs) by injecting noise into mixture embeddings, overcoming prior computational and data dependency issues. Models trained with this approach achieve comparable Pass@1 accuracy, superior Pass@32 performance due to increased reasoning diversity, and improved robustness on out-of-domain tasks compared to discrete CoT training.

6,749

13 Apr 2023

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Mila - Quebec AI Institute

McGill University Meta AI (FAIR)

I-JEPA presents a Joint-Embedding Predictive Architecture (JEPA) for self-supervised image learning that predicts abstract representations of masked image blocks. This approach achieves competitive performance on ImageNet-1K linear evaluation and dense prediction tasks while significantly reducing pretraining computational costs by over 10x compared to prior methods like MAE.

2,909

1,276

17 Sep 2025

analysis-of-pdes mathematics physics

Discovery of Unstable Singularities

Google DeepMind

Brown University École Polytechnique Fédérale de Lausanne

This research details the first systematic discovery of new families of unstable singularities in canonical fluid systems, achieving unprecedented numerical accuracy including near double-float machine precision for specific solutions. It also reveals empirical asymptotic formulas relating blow-up rates to instability orders, advancing the understanding of fundamental mathematical challenges in fluid dynamics.

1,177

03 Dec 2025

computer-science artificial-intelligence machine-learning

A Definition of AGI

The lack of a concrete definition for Artificial General Intelligence (AGI) obscures the gap between today's specialized AI and human-level cognition. This paper introduces a quantifiable framework to address this, defining AGI as matching the cognitive versatility and proficiency of a well-educated adult. To operationalize this, we ground our methodology in Cattell-Horn-Carroll theory, the most empirically validated model of human cognition. The framework dissects general intelligence into ten core cognitive domains-including reasoning, memory, and perception-and adapts established human psychometric batteries to evaluate AI systems. Application of this framework reveals a highly "jagged" cognitive profile in contemporary models. While proficient in knowledge-intensive domains, current AI systems have critical deficits in foundational cognitive machinery, particularly long-term memory storage. The resulting AGI scores (e.g., GPT-4 at 27%, GPT-5 at 57%) concretely quantify both rapid progress and the substantial gap remaining before AGI.

7,867

26 May 2025

agents chain-of-thought computer-science

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

University of Washington

Imperial College London

Microsoft Singapore Management University

Northwestern University

Researchers from Northwestern University, Stanford, NYU, and Microsoft developed StarPO, a reinforcement learning framework, and RAGEN, a modular system, to train self-evolving large language model agents in multi-turn environments. The work identifies a new instability called the "Echo Trap" and proposes StarPO-S, which uses uncertainty-based filtering and gradient shaping to achieve more robust training and higher success rates across various interactive tasks.

2,294

1,113

07 Oct 2025

computer-science artificial-intelligence computation-and-language

LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures

Brown University Atlassian

LLM-JEPA introduces a Joint Embedding Predictive Architecture (JEPA)-based training objective for Large Language Models, enhancing fine-tuning and pretraining by improving accuracy, robustness against overfitting, and fostering more structured abstract representations. This method achieved superior performance across models like Llama, Gemma, and OpenELM on datasets including NL-RX-SYNTH, GSM8K, and Spider, while mitigating computational costs with a novel 'loss dropout' mechanism.

11,002

02 Jul 2025

computer-science computer-vision-and-pattern-recognition multi-modal-learning

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Yale University

Jihan Yang

Researchers from New York University, Yale University, and Stanford University introduced VSI-Bench, a video-based benchmark, to assess the visual-spatial intelligence of multimodal large language models (MLLMs). The study found a substantial performance gap between MLLMs and human capabilities in spatial reasoning, identifying relational reasoning and egocentric-allocentric transformation as key limitations, while demonstrating that explicit cognitive map generation can enhance spatial distance understanding.

603

59,288

26 May 2025

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Google DeepMind

computer-science computer-vision-and-pattern-recognition

UC Berkeley HKU

Simon Zhai

Peter Tong

A comparative study by researchers from Hong Kong University, UC Berkeley, NYU, and Google DeepMind empirically demonstrates that Reinforcement Learning (RL) promotes generalization to novel rules and visual inputs, while Supervised Fine-Tuning (SFT) tends to induce memorization, particularly in complex reasoning tasks for foundation models like Llama-3.2-Vision-11B. RL improved out-of-distribution performance by up to +61.1% on visual tasks and also enhanced underlying visual recognition capabilities.

200

1,032

10 Oct 2025

Visual Representation Alignment for Multimodal Large Language Models

KAIST

New York University Korea University

ETH Zürich Chung-Ang University

Researchers from KAIST AI and collaborators introduced VIsual Representation ALignment (VIRAL), a novel regularization strategy that directly supervises the visual pathway in Multimodal Large Language Models (MLLMs) by aligning internal representations with Vision Foundation Models. This approach prevents the loss of fine-grained visual information, leading to consistent performance improvements of up to 17.3% on specific vision-centric tasks and a 9.4% average gain over baselines like LLaVA-1.5.

115

1,131

01 Feb 2024

computer-science conversational-ai computation-and-language

Leveraging Implicit Feedback from Deployment Data in Dialogue