alphaXiv

History

Papers Benchmarks

Meituan

1,370

19 Sep 2025

agentic-frameworks agents computer-science

LongCat-Flash Technical Report

Meituan

chuyu zhang

Meituan LongCat Team's LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) model, achieves top-tier performance, particularly in agentic tasks, while demonstrating exceptional computational efficiency. It was pre-trained on 20 trillion tokens in 30 days and achieves an inference cost of $0.70 per million output tokens at over 100 tokens per second (TPS) on H800 GPUs.

728

03 Dec 2025

agents chain-of-thought computer-science

OneThinker: All-in-one Reasoning Model for Image and Video

The Chinese University of Hong Kong Meituan

Researchers from MMLab, CUHK and Meituan developed OneThinker, a unified multimodal large language model capable of diverse visual understanding tasks across images and videos. This model utilizes an EMA-GRPO algorithm to achieve robust performance across 31 benchmarks, setting new state-of-the-art results for many tasks.

472

07 Nov 2025

computer-science artificial-intelligence

Introducing LongCat-Flash-Thinking: A Technical Report

Meituan

chuyu zhang

We present LongCat-Flash-Thinking, an efficient 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model. Its advanced capabilities are cultivated through a meticulously crafted training process, beginning with long Chain-of-Thought (CoT) data cold-start and culminating in large-scale Reinforcement Learning (RL). We first employ a well-designed cold-start training strategy, which significantly enhances the reasoning potential and equips the model with specialized skills in both formal and agentic reasoning. Then, a core innovation is our domain-parallel training scheme, which decouples optimization across distinct domains (e.g., STEM, Code, Agentic) and subsequently fuses the resulting expert models into a single, nearly Pareto-optimal model. This entire process is powered by our Dynamic ORchestration for Asynchronous rollout (DORA) system, a large-scale RL framework that delivers a greater than threefold training speedup over synchronous methods on tens of thousands of accelerators. As a result, LongCat-Flash-Thinking achieves state-of-the-art performance among open-source models on a suite of complex reasoning tasks. The model exhibits exceptional efficiency in agentic reasoning, reducing average token consumption by 64.5% (from 19, 653 to 6, 965) on AIME-25, without degrading task accuracy. We release LongCat-Flash-Thinking to promote further advances in reasoning systems and agentic AI research.

211

1,962

24 Jun 2025

chain-of-thought computer-science artificial-intelligence

SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

Chinese Academy of Sciences

Shanghai Jiao Tong University Meituan

Yuqian Fu

The paper introduces Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies supervised fine-tuning and reinforcement learning for large language models (LLMs). SRFT achieves an average accuracy of 59.5% on mathematical reasoning benchmarks, improving upon zero-RL methods by 9.0% and demonstrating enhanced out-of-distribution generalization.

418

28 Oct 2025

computer-science computer-vision-and-pattern-recognition deep-reinforcement-learning

LongCat-Video Technical Report

Meituan

Meituan's LongCat-Video presents a 13.6 billion parameter foundational model for video generation, achieving high-quality, minutes-long video output at 720p 30fps with over a 10x speedup across text-to-video, image-to-video, and video continuation tasks. The model demonstrates leading performance in "Commonsense" on public benchmarks and leverages multi-reward reinforcement learning from human feedback to enhance generation quality.

645

391

05 Dec 2025

agents computer-science computer-vision-and-pattern-recognition

EditThinker: Unlocking Iterative Reasoning for Any Image Editor

Beihang University

Tsinghua University Meituan CUHK MMLab CUHK IMIXR

The EditThinker framework enhances instruction-following in any image editor by introducing an iterative reasoning process. It leverages a Multimodal Large Language Model to critique, reflect, and refine editing instructions, leading to consistent performance gains across diverse benchmarks and excelling in complex reasoning tasks.

331

21 Oct 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

National University of Singapore

Tsinghua University Meituan

3DThinker equips Vision-Language Models with an intrinsic "3D mentaling" capability, allowing them to imagine and reason about 3D scenes from limited 2D views without explicit 3D annotations or external tools. This framework achieves state-of-the-art spatial reasoning performance across various benchmarks, even surpassing advanced closed-source models and specialized baselines, and offers interpretability through reconstructable 3D latent representations.

316

13 Oct 2025

chain-of-thought computer-science artificial-intelligence

CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

CUHK Meituan HKU

CodePlot-CoT introduces a code-driven Chain-of-Thought paradigm, enabling Vision Language Models (VLMs) to generate precise visual aids by producing executable plotting code that is then rendered and re-integrated into the reasoning process. This method, along with the new Math-VR dataset, allowed CodePlot-CoT to achieve up to a 21% performance increase on mathematical visual reasoning tasks, surpassing larger models and those using direct image generation.

294

21 Oct 2025

agents chain-of-thought computer-science

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

Fudan University Meituan

Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.

274

28 Nov 2025

computer-science artificial-intelligence computation-and-language

LongCat-Flash-Omni Technical Report

Meituan

Bo Zhang

We introduce LongCat-Flash-Omni, a state-of-the-art open-source omni-modal model with 560 billion parameters, excelling at real-time audio-visual interaction. By adopting a curriculum-inspired progressive training strategy that transitions from simpler to increasingly complex modality sequence modeling tasks, LongCat-Flash-Omni attains comprehensive multimodal capabilities while maintaining strong unimodal capability. Building upon LongCat-Flash, which adopts a high-performance Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, LongCat-Flash-Omni integrates efficient multimodal perception and speech reconstruction modules. Despite its immense size of 560B parameters (with 27B activated), LongCat-Flash-Omni achieves low-latency real-time audio-visual interaction. For training infrastructure, we developed a modality-decoupled parallelism scheme specifically designed to manage the data and model heterogeneity inherent in large-scale multimodal training. This innovative approach demonstrates exceptional efficiency by sustaining over 90% of the throughput achieved by text-only training. Extensive evaluations show that LongCat-Flash-Omni achieves state-of-the-art performance on omni-modal benchmarks among open-source models. Furthermore, it delivers highly competitive results across a wide range of modality-specific tasks, including text, image, and video understanding, as well as audio understanding and generation. We provide a comprehensive overview of the model architecture design, training procedures, and data strategies, and open-source the model to foster future research and development in the community.

317

273

17 Oct 2025

computer-science artificial-intelligence computation-and-language

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

Meituan

VitaBench, developed by the Meituan LongCat Team, introduces a comprehensive benchmark for evaluating LLM agents on versatile interactive tasks derived from real-world "life-serving applications." It assesses agents across reasoning, tool use, and interaction complexity, revealing that even state-of-the-art models achieve only a 30.0% average success rate on cross-scenario tasks.

465

19 Aug 2025

computer-science computer-vision-and-pattern-recognition generative-models

InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing

Chinese Academy of Sciences

Sun Yat-Sen University Meituan

HKUST CASIA

InfiniteTalk introduces sparse-frame video dubbing, a new paradigm for audio-driven video generation that produces holistic, full-body synchronized movements for infinite-length videos while preserving visual identity and ensuring temporal continuity. The model achieves superior naturalness in full-body synchronization and competitive lip synchronization as validated by human evaluation.

2,750

259

27 Nov 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

Architecture Decoupling Is Not All You Need For Unified Multimodal Model

Tongji University

University of Science and Technology of China Meituan CUHK MMLab

This work uncovers that architectural decoupling in unified multimodal models (UMMs) improves performance by inducing task-specific attention patterns, rather than eliminating task conflicts. Researchers from CUHK MMLab and Meituan introduce an Attention Interaction Alignment (AIA) loss, a regularization technique that guides UMMs' attention toward optimal task-specific behaviors without architectural changes, enhancing both understanding and generation performance for models like Emu3 and Janus-Pro.

189

11 Nov 2025

agentic-frameworks agents computer-science

From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory

Chinese Academy of Sciences Nanjing University of Posts and Telecommunications

University College London Meituan

A trainable graph memory framework is introduced to empower LLM agents to learn and adapt strategies from their experiences. This method integrates a reinforcement-driven mechanism to distill low-level trajectories into high-level meta-cognitive strategies, achieving improved performance in zero-training inference and accelerating reinforcement learning, particularly for smaller models.

182

15 Oct 2025

agentic-frameworks agents computer-science

Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan's Intelligent Interaction Systems

Meituan

Meituan's LongCat Interaction Team developed WOWService, an intelligent interaction system leveraging a multi-stage LLM training pipeline and multi-agent architecture, to enhance user satisfaction and reduce operational costs for its local lifestyle services. Deployed on the Meituan App, the system demonstrated improvements in user satisfaction metrics (e.g., USM 1 by -27.53%, USM 2 by +25.51%) and operational efficiency.

168

08 Dec 2025

computer-science computer-vision-and-pattern-recognition

LongCat-Image Technical Report

Meituan

Meituan's LongCat-Image introduces an open-source, bilingual foundation model for image generation and editing, achieving state-of-the-art performance with a compact 6B parameter architecture. The model establishes new industry standards for Chinese character rendering, reaching 90.7% accuracy on a custom benchmark, and demonstrates robust image editing capabilities, often outperforming larger models.

309

1,349

28 Mar 2025

computer-science computer-vision-and-pattern-recognition fine-tuning

Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models

Beihang University Meituan

A framework is presented for direct 4K (4096x4096) image synthesis with latent diffusion models, integrating a new Aesthetic-4K benchmark and a Wavelet-based Fine-tuning (WLF) approach. This method generates ultra-high-resolution images with enhanced fine details and textures, outperforming prior approaches on novel detail-focused metrics.

769

22 Aug 2025

computer-science information-retrieval

MTGR: Industrial-Scale Generative Recommendation Framework in Meituan

Meituan

Meituan researchers developed MTGR, an industrial-scale generative recommendation framework that integrates traditional cross-features into a scalable transformer-based architecture. Deployed on Meituan, it achieved online gains of +1.90% PV_CTR and +1.02% UV_CTCVR with an unchanged training cost and a 12% reduction in inference cost compared to DLRM baselines.

978

16 Apr 2024

active-learning computer-science artificial-intelligence

What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning

Alibaba Group ShanghaiTech University Meituan Beijing University of Posts and Telecommunications

HKUST

Wei Liu

Instruction tuning is a standard technique employed to align large language models to end tasks and user preferences after the initial pretraining phase. Recent research indicates the critical role of data engineering in instruction tuning -- when appropriately selected, only limited data is necessary to achieve superior performance. However, we still lack a principled understanding of what makes good instruction tuning data for alignment, and how we should select data automatically and effectively. In this work, we delve deeply into automatic data selection strategies for alignment. We start with controlled studies to measure data across three dimensions: complexity, quality, and diversity, along which we examine existing methods and introduce novel techniques for enhanced data measurement. Subsequently, we propose a simple strategy to select data samples based on the measurement. We present deita (short for Data-Efficient Instruction Tuning for Alignment), a series of models fine-tuned from LLaMA and Mistral models using data samples automatically selected with our proposed approach. Empirically, deita performs better or on par with the state-of-the-art open-source alignment models with only 6K SFT training data samples -- over 10x less than the data used in the baselines. When further trained with direct preference optimization (DPO), deita-Mistral-7B + DPO trained with 6K SFT and 10K DPO samples achieve 7.55 MT-Bench and 90.06% AlpacaEval scores. We anticipate this work to provide tools on automatic data selection, facilitating data-efficient alignment. We release our models as well as the selected datasets for future researches to effectively align models more efficiently.

571

329

07 Aug 2025

chain-of-thought computer-science artificial-intelligence

Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner

Chinese Academy of Sciences Meituan

Chart-R1 introduces a vision-language model for complex chart reasoning, leveraging a novel programmatic data synthesis strategy and a two-stage training pipeline that combines Chain-of-Thought supervision with numerically sensitive reinforcement learning. The model achieves state-of-the-art performance on various benchmarks, demonstrating advanced capabilities in multi-step visual data analysis that rival or surpass larger proprietary models.

There are no more papers matching your filters at the moment.

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

LongCat-Flash Technical Report

OneThinker: All-in-one Reasoning Model for Image and Video

Introducing LongCat-Flash-Thinking: A Technical Report

SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

LongCat-Video Technical Report

EditThinker: Unlocking Iterative Reasoning for Any Image Editor

Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

LongCat-Flash-Omni Technical Report

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing

Architecture Decoupling Is Not All You Need For Unified Multimodal Model

From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory

Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan's Intelligent Interaction Systems

LongCat-Image Technical Report

Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models

MTGR: Industrial-Scale Generative Recommendation Framework in Meituan

What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning

Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner

Personalize Your Feed