Monash University logoMonash University
ByteDance Seed introduced BAGEL, an open-source unified multimodal foundation model trained on trillions of interleaved text, image, and video tokens. This model demonstrates emergent reasoning abilities and achieves state-of-the-art performance among open-source alternatives, narrowing the capability gap with leading proprietary systems.
4,750
Researchers from Monash University, VinUniversity, and the University of Cambridge developed PiVe (Prompting with Iterative Verification), a framework that uses a specialized verifier module to iteratively correct semantic graphs generated by Large Language Models (LLMs). This method improved graph generation quality by an average of 26% across multiple datasets and enabled the creation of a high-quality text-graph dataset, GenWiki-HIQ.
33
TIME-LLM introduces a reprogramming framework that adapts large language models for general time series forecasting by keeping the LLM backbone frozen. The approach achieves state-of-the-art performance across various benchmarks, excelling particularly in data-scarce few-shot and zero-shot settings.
1,727
Multi-Agent Path Finding (MAPF) is a fundamental problem in robotics that asks us to compute collision-free paths for a team of agents, all moving across a shared map. Although many works appear on this topic, all current algorithms struggle as the number of agents grows. The principal reason is that existing approaches typically plan free-flow optimal paths, which creates congestion. To tackle this issue, we propose a new approach for MAPF where agents are guided to their destination by following congestion-avoiding paths. We evaluate the idea in two large-scale settings: one-shot MAPF, where each agent has a single destination, and lifelong MAPF, where agents are continuously assigned new destinations. Empirically, we report large improvements in solution quality for one-short MAPF and in overall throughput for lifelong MAPF.
15
GFM-RAG introduces the first graph foundation model specifically designed for Retrieval Augmented Generation (RAG), leveraging a query-dependent Graph Neural Network to capture complex, multi-hop knowledge relationships. This model achieves state-of-the-art retrieval and question answering performance on diverse datasets and generalizes to unseen domains without fine-tuning, significantly enhancing LLM reasoning capabilities.
133
VolSplat introduces a voxel-aligned prediction paradigm for feed-forward 3D Gaussian Splatting, aggregating 2D features into a 3D voxel grid to predict Gaussian parameters. This approach significantly enhances geometric consistency, robustness, and rendering quality, outperforming prior pixel-aligned methods on benchmarks like RealEstate10K and ScanNet.
84
LIGHTFUSION introduces a double fusion framework that integrates pre-trained understanding and generation models to achieve unified multimodal capabilities. This approach delivers competitive performance across multimodal understanding, text-to-image generation, and image editing tasks using significantly fewer training tokens (approximately 35 billion) compared to existing large-scale unified models.
Researchers developed UniMedVL, a unified medical foundation model capable of simultaneously performing both understanding and generation tasks within a single architecture, leveraging the UniMed-5M multimodal dataset and a progressive curriculum learning strategy. The model achieves superior performance across diverse medical visual understanding benchmarks and demonstrates high-fidelity generation and seamless execution of complex interleaved multimodal tasks.
11
Youtu-GraphRAG introduces a vertically unified agentic paradigm that jointly optimizes graph construction and retrieval for large language models, significantly enhancing complex reasoning accuracy and reducing token consumption by up to 90.71% across various benchmarks while mitigating knowledge leaking through novel evaluation datasets.
743
Researchers developed SecureAgentBench, a benchmark with 105 real-world, repository-level tasks, to evaluate LLM-powered code agents' ability to generate secure code. Evaluations show that current agents achieve a mere 9.2% success rate in producing functionally correct and secure solutions, frequently introducing novel vulnerabilities and struggling even with explicit security guidance.
A survey charts the recent trajectory of Compositional Visual Reasoning (CVR) from 2023 to 2025, introducing a five-stage taxonomy to explain its evolution and distinct advantages over monolithic approaches. The work systematically reviews over 260 papers, identifying key benefits such as enhanced interpretability and robustness, while also outlining persistent open challenges and future research directions for the field.
247
The Graph-constrained Reasoning (GCR) framework integrates Knowledge Graph (KG) structure directly into Large Language Model (LLM) decoding, achieving 100% faithful reasoning without hallucinations on KGQA tasks. This approach consistently outperforms state-of-the-art methods on benchmarks like WebQuestionSP and Complex WebQuestions by up to 9.1% while being significantly more efficient than agent-based approaches.
100
Meituan's LongCat-Flash-Omni is a 560-billion-parameter open-source omni-modal model that processes text, image, video, and audio to enable real-time audio-visual interaction. It achieves state-of-the-art performance on various multimodal benchmarks and shows highly competitive results against leading proprietary models.
3
MVSplat presents an efficient, generalizable feed-forward model that generates high-quality 3D Gaussian Splatting representations from sparse multi-view images. It achieves state-of-the-art visual quality with over 2x faster inference (22 fps) and a 10x smaller model size (12M parameters) than prior methods by integrating multi-view stereo cost volumes for robust 3D geometry estimation.
951
The Reasoning on Graphs (RoG) framework enhances Large Language Model (LLM) reasoning by integrating Knowledge Graph (KG) structural information as explicit reasoning plans. It achieves state-of-the-art performance on KGQA benchmarks, improving Hits@1 by 22.3% and F1 by 14.4% on CWQ, while providing faithful and interpretable explanations grounded in KG paths.
371
· +7
The BigCode project releases StarCoder 2 models and The Stack v2 dataset, setting a new standard for open and ethically sourced Code LLM development. StarCoder 2 models, particularly the 15B variant, demonstrate competitive performance across code generation, completion, and reasoning tasks, often outperforming larger, closed-source alternatives, by prioritizing data quality and efficient architecture over sheer data quantity.
VIDEO-THINKER, a new framework, empowers Multimodal Large Language Models to reason with videos by intrinsically developing temporal grounding and captioning abilities. The model establishes new state-of-the-art performance on various video reasoning benchmarks, achieving up to an 11.44% improvement on the VRBench out-of-domain dataset, while showcasing enhanced temporal localization (48.22% mIoU) and descriptive captioning.
20
There are no more papers matching your filters at the moment.