alphaXiv

History

Papers Benchmarks

MIT-IBM Watson AI Lab

1,986

28 Feb 2024

computer-science artificial-intelligence computation-and-language

Large Scale Generative AI Text Applied to Sports and Music

MIT-IBM Watson AI Lab IBM

IBM researchers successfully deployed a large-scale generative AI system to automate real-time text content production for major sports and music events, leveraging various large language models and adaptation techniques. The system achieved a 15x speed improvement in content generation, supporting 90 million fans globally across events like The Masters, Wimbledon, US Open, ESPN Fantasy Football, and the GRAMMY Awards.

977

01 Oct 2025

agentic-frameworks agents computer-science

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments

University of Washington MIT-IBM Watson AI Lab

The paper introduces TOUCAN, a dataset of 1.5 million tool-agentic trajectories synthesized from real-world Model Context Protocol environments to address the scarcity of high-quality, open-source training data. Models fine-tuned on TOUCAN demonstrated enhanced agentic capabilities, outperforming larger closed-source models on complex benchmarks like BFCL V3 (achieving 70.45% overall) and MCP-Universe.

114

10,504

15 Jan 2025

computer-science computation-and-language machine-learning

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

MIT-IBM Watson AI Lab Soochow University

Songlin Yang

Researchers from MIT and MIT-IBM Watson AI Lab developed a hardware-efficient, parallelized training algorithm for DeltaNet, a linear transformer model that uses the delta rule for improved associative recall. This advancement enables DeltaNet to achieve competitive language modeling performance, surpass other linear models on specific recall tasks, and demonstrate further improvements in hybrid architectures combining it with traditional attention.

4,287

27 Aug 2024

attention-mechanisms computer-science computation-and-language

Gated Linear Attention Transformers with Hardware-Efficient Training

MIT-IBM Watson AI Lab

A Gated Linear Attention (GLA) Transformer is introduced, leveraging FLASHLINEARATTENTION for hardware-efficient training and inference. This approach achieves competitive language modeling performance with LLaMA-like Transformers and other efficient models, while demonstrating higher training throughput and stronger length generalization to sequences over 20,000 tokens.

10,717

14 Mar 2024

computer-science artificial-intelligence computation-and-language

3D-VLA: A 3D Vision-Language-Action Generative World Model

South China University of Technology Wuhan University

UCLA

Shanghai Jiao Tong University University of Massachusetts Amherst

MIT MIT-IBM Watson AI Lab

Xin Yan

3D-VLA introduces a generative world model that integrates 3D scene understanding with language and action generation, enabling embodied AI systems to predict future states and plan actions in 3D environments. It leverages a new 3D embodied instruction dataset and demonstrates improved performance in 3D reasoning, goal generation, and action planning compared to 2D vision-language models.

422

375

07 Oct 2025

agentic-frameworks agents computer-science

Scientific Algorithm Discovery by Augmenting AlphaEvolve with Deep Research

University of Notre Dame IBM Research MIT-IBM Watson AI Lab

DeepEvolve, developed by the University of Notre Dame and MIT-IBM Watson AI Lab, integrates external knowledge retrieval with algorithm evolution to automate scientific algorithm discovery, consistently improving initial algorithms across nine diverse scientific benchmarks with performance gains up to 666.02%. This agent also introduces robust capabilities for cross-file code editing and systematic debugging, enhancing the practical implementation of complex algorithmic ideas.

1,790

02 Jun 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

AdaWorld: Learning Adaptable World Models with Latent Actions

Google DeepMind

Harvard University University of Massachusetts Amherst MIT-IBM Watson AI Lab

HKUST

AdaWorld introduces a novel approach to world model learning that extracts context-invariant latent actions from unlabeled videos, enabling efficient adaptation to new environments. This method achieves superior action transfer (e.g., 70.5% human success rate on LIBERO vs. 20% for baseline) and faster world model adaptation with limited data, improving visual planning success rates in game and robotic tasks.

917

23 Oct 2025

chain-of-thought computer-science artificial-intelligence

RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

MIT MIT-IBM Watson AI Lab

TANGO introduces a framework that concurrently trains an LLM generator and a generative, process-level verifier using reinforcement learning. This approach enhances the reasoning capabilities of 7B/8B-scale models, achieving an average relative improvement of 25.5% on math benchmarks and a 3.3x increase in training efficiency without requiring process-level annotations.

2,885

01 May 2025

cloud-computing computer-science artificial-intelligence

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

NVIDIA UMass Amherst

MIT MIT-IBM Watson AI Lab

QServe introduces a W4A8KV4 quantization scheme and an optimized inference system to enhance large language model serving efficiency. It achieves 1.2-3.5x higher throughput than TensorRT-LLM and enables a 3x reduction in GPU dollar cost for LLM serving on L40S GPUs.

496

2,246

24 Jul 2023

computer-science artificial-intelligence computation-and-language

3D-LLM: Injecting the 3D World into Large Language Models

South China University of Technology

University of Illinois at Urbana-Champaign

UCLA

Shanghai Jiao Tong University UMass Amherst MIT-IBM Watson AI Lab

The 3D-LLM framework from a collaboration including MIT and UMass Amherst enables large language models to understand and reason about the 3D physical world. It achieves this by generating large-scale 3D-language data and deriving 3D features from multi-view 2D images, demonstrating improved performance across tasks like 3D question answering and object grounding compared to prior methods.

1,017

1,954

22 May 2025

attention-mechanisms computer-science computation-and-language

PaTH Attention: Position Encoding via Accumulating Householder Transformations

Stanford University

Microsoft MIT-IBM Watson AI Lab

This paper introduces PaTH (Position encoding via Accumulating Householder Transformations), a data-dependent position encoding scheme for Transformers that dynamically adapts to input sequences. PaTH significantly outperforms existing methods like Rotary Position Embedding (RoPE) on sequential reasoning tasks requiring sophisticated state tracking and demonstrates superior length extrapolation capabilities on language modeling and long-context benchmarks.

3,498

182

04 Dec 2025

computer-science computer-vision-and-pattern-recognition

TTRV: Test-Time Reinforcement Learning for Vision Language Models

Stanford University

MIT MIT-IBM Watson AI Lab IISc, Bangalore Tübingen AI Center JKU Linz

Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets. Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.

760

05 Jun 2023

computer-science computer-vision-and-pattern-recognition machine-learning

Label-Free Concept Bottleneck Models

IBM Research MIT-IBM Watson AI Lab UCSD HDSI UCSD CSE

Label-free CBM introduces a framework that converts any deep neural network into a Concept Bottleneck Model without requiring labeled concept data, thereby enhancing interpretability. The method leverages large language models for automated concept generation and achieves high accuracy comparable to standard models on complex datasets like ImageNet, while providing transparent decision rules.

636

17 Dec 2024

computer-science artificial-intelligence machine-learning

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs

IBM Research University of Massachusetts Amherst MIT-IBM Watson AI Lab Red Hat AI Innovation

Ligong Han

Abhishek Bhandwaldar

This paper conducts an empirical study on supervised fine-tuning of small language models (3B-7B parameters), demonstrating that simpler training strategies, such as stacked training and larger batch sizes (e.g., 8K) with constant learning rates, can achieve higher performance on benchmarks like MMLU and MTBench compared to conventional phased training approaches. It also shows that early-stage training dynamics, specifically lower gradient norms, can predict better final model performance, enabling computational savings.

1,183

02 Feb 2025

computer-science computer-vision-and-pattern-recognition

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

South China University of Technology Pazhou Laboratory

Northeastern University Sichuan University UMass Amherst MIT-IBM Watson AI Lab Tencent Robotics X

LSceneLLM, developed by researchers from South China University of Technology, Tencent Robotics X, and others, presents an adaptive framework for large 3D scene understanding that mimics human visual processing by focusing on task-relevant regions. The approach achieves state-of-the-art performance across large indoor (XR-Scene), single-room indoor (ScanQA), and large outdoor (NuscenesQA) benchmarks, significantly improving fine-grained detail recognition and setting a new standard for embodied AI applications.

121

26 Nov 2025

agents chain-of-thought computer-science

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

University of Rochester MIT-IBM Watson AI Lab Sony Group Corporation

A video reasoning agent, Video-R4, is presented, enabling large multimodal models to perform "visual rumination" by iteratively selecting and refining visual evidence in text-rich videos. This approach achieves state-of-the-art results on video question answering benchmarks and exhibits robust zero-shot generalization across diverse multimodal tasks.

2,928

419

26 May 2024

computer-science artificial-intelligence computation-and-language

tinyBenchmarks: evaluating LLMs with fewer examples

University of Michigan IBM Research

MIT MIT-IBM Watson AI Lab University of Pompeu Fabra

Leshem Choshen

Felipe Maia Polo

Researchers from the University of Michigan, University of Pompeu Fabra, IBM Research, and MIT developed "tinyBenchmarks," a framework that reduces the number of examples needed for Large Language Model (LLM) evaluation by leveraging Item Response Theory. Their approach achieves an average performance estimation error of within 2% using as few as 100 examples per scenario, representing a reduction factor of up to 160 on common benchmarks like the Open LLM Leaderboard.

117

26 Nov 2025

agents computer-science computer-vision-and-pattern-recognition

MIRA: Multimodal Iterative Reasoning Agent for Image Editing

University of Rochester MIT-IBM Watson AI Lab

MIRA, a multimodal iterative reasoning agent, significantly enhances the ability of open-source diffusion models to perform complex instruction-guided image editing. It achieves this by employing a closed-loop perception–reasoning–action cycle, leading to improved semantic consistency and perceptual quality often comparable to or surpassing proprietary systems.

650

25 Sep 2023

computer-science computation-and-language computer-vision-and-pattern-recognition

Aligning Large Multimodal Models with Factually Augmented RLHF

University of Illinois at Urbana-Champaign

UC Berkeley CMU UMass Amherst MIT-IBM Watson AI Lab UW–Madison

Researchers from UC Berkeley, CMU, UW-Madison, and other institutions successfully adapt Reinforcement Learning from Human Feedback (RLHF) to large multimodal models (LMMs), introducing Factually Augmented RLHF (Fact-RLHF) to mitigate hallucinations and improve alignment with human preferences. The resulting LLaVA-RLHF model shows a 60% improvement in hallucination reduction on a new benchmark, MMHAL-BENCH, and achieves 95.6% of GPT-4's performance on LLaVA-Bench for general alignment.

381

545

24 May 2025

agents computer-science artificial-intelligence

TextArena

Northeastern University

National University of Singapore

MIT MIT-IBM Watson AI Lab Centre for Frontier AI Research (CFAR), A*STAR Institute of High-Performance Computing, A*STAR

Benjamin Liu

TextArena, from A*STAR and collaborators, introduces an open-source framework featuring 74 competitive text-based games to evaluate and train agentic behavior in Large Language Models. It provides a dynamic, relative skill ranking system for LLMs against each other and humans, and supports reinforcement learning, addressing current benchmark limitations.

103

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Large Scale Generative AI Text Applied to Sports and Music

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Gated Linear Attention Transformers with Hardware-Efficient Training

3D-VLA: A 3D Vision-Language-Action Generative World Model

Scientific Algorithm Discovery by Augmenting AlphaEvolve with Deep Research

AdaWorld: Learning Adaptable World Models with Latent Actions

RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

3D-LLM: Injecting the 3D World into Large Language Models

PaTH Attention: Position Encoding via Accumulating Householder Transformations

TTRV: Test-Time Reinforcement Learning for Vision Language Models

Label-Free Concept Bottleneck Models

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

tinyBenchmarks: evaluating LLMs with fewer examples

MIRA: Multimodal Iterative Reasoning Agent for Image Editing

Aligning Large Multimodal Models with Factually Augmented RLHF

TextArena

Events

AI for Law

Personalize Your Feed