alphaXiv

63,411

27 Jul 2025

computer-science computer-vision-and-pattern-recognition generative-models

Emerging Properties in Unified Multimodal Pretraining

Monash University

HKUST Shenzhen Institutes of Advanced Technology

ByteDance Seed introduced BAGEL, an open-source unified multimodal foundation model trained on trillions of interleaved text, image, and video tokens. This model demonstrates emergent reasoning abilities and achieves state-of-the-art performance among open-source alternatives, narrowing the capability gap with leading proprietary systems.

4,750

2,118

30 May 2024

computer-science computation-and-language generative-models

PiVe: Prompting with Iterative Verification Improving Graph-based Generative Capability of LLMs

University of Cambridge

Monash University VinUniversity

Researchers from Monash University, VinUniversity, and the University of Cambridge developed PiVe (Prompting with Iterative Verification), a framework that uses a specialized verifier module to iteratively correct semantic graphs generated by Large Language Models (LLMs). This method improved graph generation quality by an average of 26% across multiple datasets and enabled the creation of a high-quality text-graph dataset, GenWiki-HIQ.

33

1,009

06 Dec 2025

agentic-frameworks agents ai-for-cybersecurity

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence

A comprehensive synthesis of Large Language Models for automated software development covers the entire model lifecycle, from data curation to autonomous agents, and offers practical guidance derived from empirical experiments on pre-training, fine-tuning, and reinforcement learning, alongside a detailed analysis of challenges and future directions.

1,405

18 Oct 2025

agentic-frameworks agents computer-science

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Zhongxing Xu

A comprehensive survey by researchers from Shanghai AI Lab and various global institutions outlines the intricate relationship between scientific large language models (Sci-LLMs) and their data foundations, tracing their evolution towards autonomous agents for scientific discovery. The paper establishes a taxonomy for scientific data and knowledge, meticulously reviews over 270 datasets and 190 benchmarks, and identifies critical data challenges alongside future paradigms.

370

4,794

29 Jan 2024

computer-science artificial-intelligence machine-learning

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

Monash University

Alibaba Group Ant Group The Hong Kong University of Science and Technology (Guangzhou)Griffith University IBM Research

TIME-LLM introduces a reprogramming framework that adapts large language models for general time series forecasting by keeping the LLM backbone frozen. The approach achieves state-of-the-art performance across various benchmarks, excelling particularly in data-scarce few-shot and zero-shot settings.

1,727

607

31 Jan 2024

agent-based-systems computer-science artificial-intelligence

Traffic Flow Optimisation for Lifelong Multi-Agent Path Finding

Monash University

Carnegie Mellon University OPTIMA Australian Research Council ITTC

Multi-Agent Path Finding (MAPF) is a fundamental problem in robotics that asks us to compute collision-free paths for a team of agents, all moving across a shared map. Although many works appear on this topic, all current algorithms struggle as the number of agents grows. The principal reason is that existing approaches typically plan free-flow optimal paths, which creates congestion. To tackle this issue, we propose a new approach for MAPF where agents are guided to their destination by following congestion-avoiding paths. We evaluate the idea in two large-scale settings: one-shot MAPF, where each agent has a single destination, and lifelong MAPF, where agents are continuously assigned new destinations. Empirically, we report large improvements in solution quality for one-short MAPF and in overall throughput for lifelong MAPF.

15

1,587

20 Oct 2025

computer-science artificial-intelligence computation-and-language

GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation

Monash University Griffith University sity of Science and Technology Nanjing Univer-

GFM-RAG introduces the first graph foundation model specifically designed for Retrieval Augmented Generation (RAG), leveraging a query-dependent Graph Neural Network to capture complex, multi-hop knowledge relationships. This model achieves state-of-the-art retrieval and question answering performance on diverse datasets and generalizes to unseen domains without fine-tuning, significantly enhancing LLM reasoning capabilities.

133

358

23 Sep 2025

computer-science computer-vision-and-pattern-recognition

VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

Monash University

Tsinghua University

Zhejiang University

The Chinese University of Hong Kong University of Electronic Science and Technology of China GigaAI

VolSplat introduces a voxel-aligned prediction paradigm for feed-forward 3D Gaussian Splatting, aggregating 2D features into a 3D voxel grid to predict Gaussian parameters. This approach significantly enhances geometric consistency, robustness, and rendering quality, outperforming prior pixel-aligned methods on benchmarks like RealEstate10K and ScanNet.

84

341

20 Nov 2025

computer-science computer-vision-and-pattern-recognition image-generation

LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

Monash University

Tsinghua University

ByteDance

University of California, Santa Cruz

LIGHTFUSION introduces a double fusion framework that integrates pre-trained understanding and generation models to achieve unified multimodal capabilities. This approach delivers competitive performance across multimodal understanding, text-to-image generation, and image editing tasks using significantly fewer training tokens (approximately 35 billion) compared to existing large-scale unified models.

339

27 Oct 2025

ai-for-health computer-science computer-vision-and-pattern-recognition

UniMedVL: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

University of Cambridge

Monash University Shanghai Artificial Intelligence Laboratory

Imperial College London

Fudan University

Shanghai Jiao Tong University

The University of Hong Kong

HKUST Shanghai Innovation Institute Fuzhou University Shanghai Institute of Optics and Fine Mechanics

Researchers developed UniMedVL, a unified medical foundation model capable of simultaneously performing both understanding and generation tasks within a single architecture, leveraging the UniMed-5M multimodal dataset and a progressive curriculum learning strategy. The model achieves superior performance across diverse medical visual understanding benchmarks and demonstrates high-fidelity generation and seamless execution of complex interleaved multimodal tasks.

11

453

03 Sep 2025

computer-science information-retrieval

Youtu-GraphRAG: Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning

Monash University

The Hong Kong Polytechnic University Tencent YouTu Lab

Youtu-GraphRAG introduces a vertically unified agentic paradigm that jointly optimizes graph construction and retrieval for large language models, significantly enhancing complex reasoning accuracy and reducing token consumption by up to 90.71% across various benchmarks while mitigating knowledge leaking through novel evaluation datasets.

743

304

26 Sep 2025

agentic-frameworks agents ai-for-cybersecurity

SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios

Monash University

National University of Singapore

Zhejiang University York University

Aalto University Singapore Management University

Junkai Chen

Researchers developed SecureAgentBench, a benchmark with 105 real-world, repository-level tasks, to evaluate LLM-powered code agents' ability to generate secure code. Evaluations show that current agents achieve a mere 9.2% success rate in producing functionally correct and secure solutions, frequently introducing novel vulnerabilities and struggling even with explicit security guidance.

531

27 Aug 2025

agents chain-of-thought computer-science

Explain Before You Answer: A Survey on Compositional Visual Reasoning

University of Washington

Monash University Allen Institute for Artificial Intelligence

Stanford University Griffith University

Princeton University

A survey charts the recent trajectory of Compositional Visual Reasoning (CVR) from 2023 to 2025, introducing a five-stage taxonomy to explain its evolution and distinct advantages over monolithic approaches. The work systematically reviews over 260 papers, identifying key benefits such as enhanced interpretability and robustness, while also outlining persistent open challenges and future research directions for the field.

247

1,430

28 May 2025

computer-science computation-and-language graph-neural-networks

Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models

Monash University Nanjing University of Science and Technology Griffith University

The Graph-constrained Reasoning (GCR) framework integrates Knowledge Graph (KG) structure directly into Large Language Model (LLM) decoding, achieving 100% faithful reasoning without hallucinations on KGQA tasks. This approach consistently outperforms state-of-the-art methods on benchmarks like WebQuestionSP and Complex WebQuestions by up to 9.1% while being significantly more efficient than agent-based approaches.

100

235

03 Nov 2025

computer-science computer-vision-and-pattern-recognition robotics

Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

Monash University

Zhejiang University

Westlake University HKUST(GZ)

Meituan's LongCat-Flash-Omni is a 560-billion-parameter open-source omni-modal model that processes text, image, video, and audio to enable real-time audio-visual interaction. It achieves state-of-the-art performance on various multimodal benchmarks and shows highly competitive results against leading proprietary models.

3

969

18 Jul 2024

computer-science computer-vision-security computer-vision-and-pattern-recognition

MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

ETH Zurich

Monash University

University of Oxford

Nanyang Technological University

Microsoft University of Tübingen

Andreas Geiger

MVSplat presents an efficient, generalizable feed-forward model that generates high-quality 3D Gaussian Splatting representations from sparse multi-view images. It achieves state-of-the-art visual quality with over 2x faster inference (22 fps) and a 10x smaller model size (12M parameters) than prior methods by integrating multi-view stereo cost volumes for robust 3D geometry estimation.

951

1,223

01 Apr 2025

computer-science artificial-intelligence computation-and-language

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Alex Gu

·

Wen-Ding Li

+3

BigCodeBench is a new benchmark that evaluates Large Language Models on their ability to generate Python code requiring diverse function calls and complex instructions, revealing that current models like GPT-4o achieve a maximum of 60% accuracy on these challenging tasks, significantly lagging human performance.

19

3,303

24 Feb 2024

computer-science artificial-intelligence computation-and-language

Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning

Monash University Griffith University

The Reasoning on Graphs (RoG) framework enhances Large Language Model (LLM) reasoning by integrating Knowledge Graph (KG) structural information as explicit reasoning plans. It achieves state-of-the-art performance on KGQA benchmarks, improving Hits@1 by 22.3% and F1 by 14.4% on CWQ, while providing faithful and interpretable explanations grounded in KG paths.

371

1,668

29 Feb 2024

computer-science artificial-intelligence software-engineering

StarCoder 2 and The Stack v2: The Next Generation

Alex Gu

·

Yuxiang Wei

+7

The BigCode project releases StarCoder 2 models and The Stack v2 dataset, setting a new standard for open and ethically sourced Code LLM development. StarCoder 2 models, particularly the 15B variant, demonstrate competitive performance across code generation, completion, and reasoning tasks, often outperforming larger, closed-source alternatives, by prioritizing data quality and efficient architecture over sheer data quantity.

204

27 Oct 2025

agents chain-of-thought computer-science

Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

Monash University

University of Southern California

Fudan University Southeast University Xiaohongshu Inc.

VIDEO-THINKER, a new framework, empowers Multimodal Large Language Models to reason with videos by intrinsically developing temporal grounding and captioning abilities. The model establishes new state-of-the-art performance on various video reasoning benchmarks, achieving up to an 11.44% improvement on the VRBench out-of-domain dataset, while showcasing enhanced temporal localization (48.22% mIoU) and descriptive captioning.

20

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Emerging Properties in Unified Multimodal Pretraining

PiVe: Prompting with Iterative Verification Improving Graph-based Generative Capability of LLMs

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

Traffic Flow Optimisation for Lifelong Multi-Agent Path Finding

GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation

VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

UniMedVL: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

Youtu-GraphRAG: Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning

SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios

Explain Before You Answer: A Survey on Compositional Visual Reasoning

Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models

Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning

StarCoder 2 and The Stack v2: The Next Generation

Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

Events

AI for Law

Personalize Your Feed