alphaXiv

History

Papers Benchmarks

The University of Sydney

548

29 Oct 2025

chain-of-thought computer-science computation-and-language

SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

Shanghai Artificial Intelligence Laboratory

University of Oxford

Fudan University

University of Science and Technology of China

Beihang University

Shanghai Jiao Tong University

The Chinese University of Hong Kong The University of Sydney

SciReasoner, a scientific reasoning large language model, integrates diverse scientific data representations with natural language across multiple disciplines. The model achieved state-of-the-art performance on 54 scientific tasks and ranked among the top-2 on 101 tasks by employing a three-stage training framework that incorporates multi-representation scientific data.

2,204

03 Jun 2025

computer-science computer-vision-and-pattern-recognition efficient-transformers

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

UC Berkeley

Fudan University

Peking University The University of Sydney Panasonic Holdings Corporation

SparseVLM introduces a text-guided, training-free framework that enhances the inference efficiency of Vision-Language Models by intelligently pruning redundant visual tokens. The method reduces CUDA inference time by 43.1% and FLOPs by 62.8% on LLaVA-1.5, while maintaining 96.7% of the original model's accuracy.

4,309

21 Oct 2024

computer-science computation-and-language knowledge-distillation

A Survey on Knowledge Distillation of Large Language Models

Peking University

Microsoft University of Technology Sydney

University of Maryland

The University of Hong Kong The University of Sydney

Tianyi Zhou

Shawn Xu

This survey systematically reviews Knowledge Distillation (KD) techniques for Large Language Models (LLMs), outlining methods for transferring capabilities from large proprietary models to smaller, more accessible open-source ones. It categorizes KD approaches by algorithms, skill distillation, and verticalization, highlighting the central role of data augmentation and iterative self-improvement for democratizing advanced LLM capabilities.

811

9,646

15 Nov 2025

ai-for-health clustering-algorithms computer-science

MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention

Northwestern Polytechnical University

University of Maryland The University of Sydney

Tianyi Wang

The MIRROR framework introduces a multi-modal self-supervised learning approach for computational pathology, integrating histopathology and transcriptomics by balancing modality alignment with modality-specific information retention and mitigating redundancy through a novel style clustering module. It demonstrates superior performance in cancer subtyping and survival prediction on TCGA cohorts, outperforming existing baselines in various diagnostic tasks.

334

12 Sep 2025

computer-science computer-vision-and-pattern-recognition generative-models

InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis

Shanghai Artificial Intelligence Laboratory The University of Sydney

HKUST

Arbitrary resolution image generation provides a consistent visual experience across devices, having extensive applications for producers and consumers. Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds. To solve this, we explore the second generation upon the latent diffusion models, where the fixed latent generated by diffusion models is regarded as the content representation and we propose to decode arbitrary resolution images with a compact generated latent using a one-step generator. Thus, we present the \textbf{InfGen}, replacing the VAE decoder with the new generator, for generating images at any resolution from a fixed-size latent without retraining the diffusion models, which simplifies the process, reducing computational complexity and can be applied to any model using the same latent space. Experiments show InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds.

1,737

23 Mar 2025

agent-based-systems computer-science computation-and-language

OASIS: Open Agent Social Interaction Simulations with One Million Agents

Shanghai Artificial Intelligence Laboratory

Imperial College London

Fudan University Dalian University of Technology The University of Sydney

KAUST Max Planck Institute Xian Jiaotong University Oxford

shuyue hu

Prateek Gupta

OASIS presents an open agent social interaction simulator capable of scaling to one million LLM-based agents, designed to mimic real-world social media platforms. The platform successfully replicates and investigates complex social phenomena like information propagation, group polarization, and herd effects, providing a testbed for understanding emergent behaviors at unprecedented scales.

658

1,260

22 May 2025

attention-mechanisms computer-science computer-vision-security

Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG

Wuhan University

Nanyang Technological University The University of Sydney Shenzhen Campus of Sun Yat-sen University

王文斌

This work introduces Retrieval-Augmented Perception (RAP), a training-free framework that applies Retrieval-Augmented Generation (RAG) principles to enhance Multimodal Large Language Models' (MLLMs) understanding of high-resolution images. RAP significantly improves MLLM performance on fine-grained perception tasks by intelligently retrieving, spatially arranging, and adaptively selecting relevant visual information, leading to an average accuracy increase of 24% on high-resolution image benchmarks while also improving inference efficiency.

230

19 Nov 2025

computer-science contrastive-learning artificial-intelligence

UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

Imperial College London The University of Sydney MiroMind AI

Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.

1,004

13 Oct 2025

causal-inference computer-science artificial-intelligence

Discovering and Reasoning of Causality in the Hidden World with Large Language Models

Carnegie Mellon University

The Chinese University of Hong Kong The University of Melbourne MBZUAI The University of Sydney Hong Kong Baptist University

Chenxi Liu

Revealing hidden causal variables alongside the underlying causal mechanisms is essential to the development of science. Despite the progress in the past decades, existing practice in causal discovery (CD) heavily relies on high-quality measured variables, which are usually given by human experts. In fact, the lack of well-defined high-level variables behind unstructured data has been a longstanding roadblock to a broader real-world application of CD. This procedure can naturally benefit from an automated process that can suggest potential hidden variables in the system. Interestingly, Large language models (LLMs) are trained on massive observations of the world and have demonstrated great capability in processing unstructured data. To leverage the power of LLMs, we develop a new framework termed Causal representatiOn AssistanT (COAT) that incorporates the rich world knowledge of LLMs to propose useful measured variables for CD with respect to high-value target variables on their paired unstructured data. Instead of directly inferring causality with LLMs, COAT constructs feedback from intermediate CD results to LLMs to refine the proposed variables. Given the target variable and the paired unstructured data, we first develop COAT-MB that leverages the predictivity of the proposed variables to iteratively uncover the Markov Blanket of the target variable. Built upon COAT-MB, COAT-PAG further extends to uncover a more complete causal graph, i.e., Partial Ancestral Graph, by iterating over the target variables and actively seeking new high-level variables. Moreover, the reliable CD capabilities of COAT also extend the debiased causal inference to unstructured data by discovering an adjustment set. We establish theoretical guarantees for the CD results and verify their efficiency and reliability across realistic benchmarks and real-world case studies.

9,028

05 Mar 2025

agentic-frameworks agents computer-science

MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems

Shanghai AI Laboratory

University of Oxford

Shanghai Jiao Tong University The University of Sydney

Rui Ye

Shanghai Jiao Tong University and Shanghai AI Laboratory researchers introduce MAS-GPT, a framework that enables automatic generation of task-specific Multi-Agent Systems through a single LLM inference, demonstrating superior performance across 9 benchmarks while reducing computational costs compared to existing approaches that require multiple LLM calls or manual design.

1,241

24 Mar 2025

agentic-frameworks agents computer-science

AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration

Harbin Institute of Technology The University of Sydney

AgentDropout introduces a dynamic optimization method for LLM-based multi-agent systems, selectively eliminating less critical agents and communication links. This approach achieves substantial reductions in token consumption, such as a 21.4% decrease in prompt tokens with Llama3, while also improving task performance by an average of 2.19 over AgentPrune on reasoning and code generation benchmarks.

2,752

22 May 2025

computer-science artificial-intelligence computation-and-language

R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO

Tsinghua University

ByteDance

Nanyang Technological University The University of Sydney Beijing University of Posts and Telecommunications

Researchers from Nanyang Technological University, ByteDance, and collaborating institutions develop Share-GRPO, a reinforcement learning framework for multimodal large language models that addresses sparse reward and advantage vanishing problems by expanding question spaces through semantic transformations and sharing reward information across question variants during training, with their R1-ShareVL model achieving substantial improvements on mathematical reasoning benchmarks including 4.2-point gains on MathVista and 3.8-point gains on MathVerse compared to baseline GRPO when applied to Qwen2.5-VL models.

2,730

02 Aug 2025

adversarial-attacks adversarial-robustness computer-science

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

This comprehensive survey systematically reviews current safety research across six major large AI model paradigms and autonomous agents, presenting a detailed taxonomy of 10 attack types and corresponding defense strategies. The review identifies a predominant focus on attack methodologies (60% of papers) over defenses and outlines key open challenges for advancing AI safety.

1,054

27 Mar 2025

computer-science computer-vision-and-pattern-recognition efficient-transformers

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

Shanghai AI Laboratory

Shanghai Jiao Tong University

The Chinese University of Hong Kong The University of Sydney Shanghai Innovation Institute KREA AI

Jiakang Yuan

Shanghai AI Laboratory researchers introduce Lumina-Image 2.0, a unified text-to-image generation framework that combines a novel diffusion transformer architecture with a specialized captioning system, achieving competitive performance on benchmarks while reducing computational costs through efficient training and inference optimizations.

805

1,476

17 Oct 2024

computer-science computation-and-language efficient-transformers

A Comprehensive Overview of Large Language Models

The Chinese University of Hong Kong The University of Melbourne University of Technology Sydney The University of Sydney

Australian National University King Fahd University of Petroleum and Minerals University of Engineering and Technology The University of Western Australia Commonwealth Scientific and Industrial Research Organisation SDAIA-KFUPM Joint Research Center for Artificial Intelligence

This paper synthesizes the extensive and rapidly evolving literature on Large Language Models (LLMs), offering a structured resource on their architectures, training strategies, and applications. It provides a comprehensive overview of existing works, highlighting key design aspects, model capabilities, augmentation strategies, and efficiency techniques, while also discussing challenges and future research directions.

1,124

19 Apr 2025

computer-science computer-vision-security artificial-intelligence

Generative Physical AI in Vision: A Survey

Sungkyunkwan University Central South University The University of Sydney Guangxi Normal University The University of Western Australia

This survey systematically reviews and categorizes generative models in computer vision that produce physically plausible outputs, establishing a taxonomy of explicit and implicit physics-aware generation methods. It details six paradigms for integrating physical simulation, revealing a trend towards functional realism, and identifies current challenges in evaluation metrics.

1,553

13 Oct 2022

computer-science computer-vision-security computer-vision-and-pattern-recognition

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

JD Explore Academy The University of Sydney

ViTPose introduces a straightforward approach to human pose estimation by leveraging plain vision transformers as the exclusive backbone, paired with a minimal decoder. The method demonstrates robust performance, achieving 80.9 AP on the MS COCO test-dev set with a single model and showcasing the strong feature representation capabilities of simple ViT architectures.

1,505

1,740

20 May 2021

computer-science machine-learning knowledge-distillation

Knowledge Distillation: A Survey

The University of Sydney Birkbeck College, University of London

This survey from the UBTECH Sydney AI Centre systematically reviews Knowledge Distillation (KD), a method for transferring knowledge from large teacher models to smaller student models to enable efficient deployment on resource-constrained devices. It categorizes KD methods by knowledge type, distillation scheme, and algorithms, demonstrating its consistent effectiveness in model compression and broad applicability across diverse AI domains like computer vision, NLP, and speech recognition.

817

654

12 Apr 2025

computer-science computation-and-language multi-modal-learning

VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Shanghai AI Laboratory

Fudan University

Nanyang Technological University The University of Sydney Shanghai Innovation Institute

VisuoThink introduces a framework that enhances Large Vision-Language Model reasoning by dynamically interleaving visual and textual information within a multimodal tree search, achieving up to a 21.8% accuracy improvement on geometry problem-solving benchmarks without requiring model fine-tuning.

303

02 Jul 2025

computer-science computer-vision-security computer-vision-and-pattern-recognition

Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling

Wuhan University

Tsinghua University Nankai University National University of Defense Technology The University of Sydney University of Chinese Academic of Sciences

Zonghao Guo

D Wang

Researchers introduce OpticalRS-13M, a 13-million-image visible light remote sensing dataset, and SelectiveMAE, an efficient masked image modeling method. This pipeline achieves over a 2x speedup in pre-training and substantial memory reduction while maintaining or improving performance across various downstream remote sensing tasks compared to existing methods.

124

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

A Survey on Knowledge Distillation of Large Language Models

MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention

InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis

OASIS: Open Agent Social Interaction Simulations with One Million Agents

Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG

UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

Discovering and Reasoning of Causality in the Hidden World with Large Language Models

MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems

AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration

R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

A Comprehensive Overview of Large Language Models

Generative Physical AI in Vision: A Survey

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Knowledge Distillation: A Survey

VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling

Events

AI for Law

Personalize Your Feed