alphaXiv

History

Papers Benchmarks

Meituan Inc.

890

07 Oct 2025

computer-science computer-vision-and-pattern-recognition image-generation

OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

Shanghai Jiao Tong University Meituan Inc.

OneCAT, developed by Meituan Inc. and Shanghai Jiao Tong University, introduces a pure decoder-only, encoder-free, and VAE tokenizer-free multimodal model for unified understanding, generation, and editing. This architecture significantly improves inference efficiency, achieving up to 61% faster time-to-first-token for high-resolution understanding and approximately 10x faster image generation compared to diffusion-based models, while setting new state-of-the-art among unified models on various benchmarks.

210

501

26 Aug 2025

agentic-frameworks agents computer-science

MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use

Chinese Academy of Sciences

Peking University Meituan Inc.

MUA-RL introduces a reinforcement learning framework that integrates LLM-simulated users directly into the training loop for agentic tool use, enabling agents to learn from dynamic multi-turn interactions. The method yields agents capable of robustly solving complex user queries and achieving performance comparable to or surpassing larger models on various benchmarks, including TAU-Bench and ACEBench.

1,379

10 Oct 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Multimodal Language Models See Better When They Look Shallower

National University of Singapore Meituan Inc.Zhejiang Gongshang University Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative Institute of Digital Twin GenmoAI

Researchers systematically analyzed visual layer selection in Multimodal Large Language Models (MLLMs), demonstrating that integrating features from shallow, middle, and deep Vision Transformer layers via a simple concatenation fusion outperforms conventional deep-layer reliance and more complex fusion strategies.

514

06 Aug 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

X-SAM: From Segment Anything to Any Segmentation

Sun Yat-Sen University Peng Cheng Laboratory Meituan Inc.

Researchers from Sun Yat-sen University, Peng Cheng Laboratory, and Meituan Inc. developed X-SAM, a unified Multimodal Large Language Model (MLLM) framework capable of performing seven distinct image segmentation tasks. X-SAM achieves state-of-the-art results across various benchmarks, including new records on RefCOCO, RefCOCO+, RefCOCOg, and GCG datasets, while also introducing Visual Grounded (VGD) segmentation.

283

1,122

26 Jun 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

Mr. DETR++: Instructive Multi-Route Training for Detection Transformers with Mixture-of-Experts

The University of Hong Kong Meituan Inc.

Mr. DETR++ introduces an instructive multi-route training framework for Detection Transformers, which addresses their slow training convergence by integrating one-to-one and one-to-many supervision signals. The approach, which includes novel Mixture-of-Experts designs and discards auxiliary routes during inference, achieves consistent performance gains of up to +4.0% mAP on COCO object detection while maintaining the baseline model's inference speed.

140

145

13 Oct 2025

adversarial-robustness computer-science artificial-intelligence

Making Mathematical Reasoning Adaptive

Nanjing University Meituan Inc.

The AdaR framework teaches Large Language Models to perform adaptive mathematical reasoning by combining an automated data synthesis pipeline with a specialized reinforcement learning strategy. It achieved substantial performance gains, averaging +8.50 points across diverse benchmarks, by compelling models to rely on underlying problem-solving logic rather than superficial patterns.

126

30 Sep 2025

chain-of-thought computer-science computer-vision-and-pattern-recognition

DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

Wuhan University The University of Sydney Meituan Inc.

The "thinking with images" paradigm represents a pivotal shift in the reasoning of Vision Language Models (VLMs), moving from text-dominant chain-of-thought to image-interactive reasoning. By invoking visual tools or generating intermediate visual representations, VLMs can iteratively attend to fine-grained regions, enabling deeper image understanding and more faithful multimodal reasoning. As an emerging paradigm, however, it still leaves substantial room for exploration in data construction accuracy, structural design, and broader application scenarios, which offer rich opportunities for advancing multimodal reasoning. To further advance this line of work, we present DeepSketcher, a comprehensive suite comprising both an image-text interleaved dataset and a self-contained model. The dataset contains 31k chain-of-thought (CoT) reasoning trajectories with diverse tool calls and resulting edited images, covering a wide range of data types and manipulation instructions with high annotation accuracy. Building on this resource, we design a model that performs interleaved image-text reasoning and natively generates "visual thoughts" by operating directly in the visual embedding space, rather than invoking external tools and repeatedly re-encoding generated images. This design enables tool-free and more flexible "thinking with images". Extensive experiments on multimodal reasoning benchmarks demonstrate strong performance, validating both the utility of the dataset and the effectiveness of the model design.

751

04 Sep 2025

computer-science information-retrieval

MultiConIR: Towards multi-condition Information Retrieval

Shanghai Jiao Tong University

The Hong Kong Polytechnic University Meituan Inc.Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative Institute of Digital Twin

The MULTICONIR benchmark was developed to systematically evaluate information retrieval and reranking models on multi-condition natural language queries, revealing that current state-of-the-art models suffer significant performance degradation and lack robust relevance monotonicity and format invariance. Advanced general-purpose LLMs, such as GPT-4o, demonstrated superior capabilities in these complex retrieval scenarios.

493

07 Sep 2022

computer-science computer-vision-security computer-vision-and-pattern-recognition

YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications

Meituan Inc.

Bo Zhang

For years, the YOLO series has been the de facto industry-level standard for efficient object detection. The YOLO community has prospered overwhelmingly to enrich its use in a multitude of hardware platforms and abundant scenarios. In this technical report, we strive to push its limits to the next level, stepping forward with an unwavering mindset for industry application. Considering the diverse requirements for speed and accuracy in the real environment, we extensively examine the up-to-date object detection advancements either from industry or academia. Specifically, we heavily assimilate ideas from recent network design, training strategies, testing techniques, quantization, and optimization methods. On top of this, we integrate our thoughts and practice to build a suite of deployment-ready networks at various scales to accommodate diversified use cases. With the generous permission of YOLO authors, we name it YOLOv6. We also express our warm welcome to users and contributors for further enhancement. For a glimpse of performance, our YOLOv6-N hits 35.9% AP on the COCO dataset at a throughput of 1234 FPS on an NVIDIA Tesla T4 GPU. YOLOv6-S strikes 43.5% AP at 495 FPS, outperforming other mainstream detectors at the same scale~(YOLOv5-S, YOLOX-S, and PPYOLOE-S). Our quantized version of YOLOv6-S even brings a new state-of-the-art 43.3% AP at 869 FPS. Furthermore, YOLOv6-M/L also achieves better accuracy performance (i.e., 49.5%/52.3%) than other detectors with a similar inference speed. We carefully conducted experiments to validate the effectiveness of each component. Our code is made available at this https URL.

5,757

450

08 Mar 2025

computer-science computer-vision-and-pattern-recognition

Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices

NUS Ocean University of China Meituan Inc.Zhejiang Gongshang University GenmoAI

Yingqi Fan

This research provides a systematic analysis of visual feature integration in Multimodal Large Language Models, focusing on optimal layer selection and fusion strategies. It identifies that selecting specific layers from different stages of a visual encoder and employing an external direct fusion method consistently yield the best performance across various multimodal benchmarks.

01 Oct 2025

computer-science computer-vision-and-pattern-recognition efficient-transformers

InfVSR: Breaking Length Limits of Generic Video Super-Resolution

Shanghai Jiao Tong University Meituan Inc.

Real-world videos often extend over thousands of frames. Existing video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor scalability hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which novelly reformulates VSR as an autoregressive-one-step-diffusion paradigm. This enables streaming inference while fully leveraging pre-trained video diffusion priors. First, we adapt the pre-trained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. Together, these designs enable efficient and scalable VSR for unbounded-length videos. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Code will be available at this https URL.

452

30 Dec 2023

computer-science computer-vision-and-pattern-recognition efficient-transformers

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

Zhejiang University Dalian University of Technology Meituan Inc.

Chunhua Shen (CS)

Bo Zhang

We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: this https URL.

1,281

254

27 Nov 2024

computer-science artificial-intelligence computer-vision-and-pattern-recognition

TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability

Meituan Inc.

TimeMarker from Meituan Inc. is a versatile Video-LLM that enhances video understanding across varying lengths and achieves superior temporal localization by integrating explicit Temporal Separator Tokens and an AnyLength mechanism. It demonstrated state-of-the-art performance, outperforming specialized supervised models on benchmarks like Charades-STA and ActivityNet Captions for temporal grounding.

388

08 Aug 2025

computer-science robotics

RoboTron-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction

Meituan Inc.

Meituan researchers developed RoboTron-Nav, a unified framework for language-guided visual navigation that integrates perception, planning, and prediction through multitask learning and adaptive memory. The framework achieved an 81.1% success rate on the challenging CHORES-S ObjectNav benchmark, representing a 9% absolute improvement over prior state-of-the-art methods.

640

04 Nov 2025

computer-science multimedia robotics

RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation

Meituan Inc.

Recently, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model RoboTron-Mani and the comprehensive dataset RoboData. RoboTron-Mani, on one hand, enhances 3D perception through camera parameters and occupancy supervision. On the other hand, it further incorporates Modality-Isolation-Mask and multimodal decoder blocks based on OpenFlamingo, improving modality fusion and fine-grained perception. RoboData integrats several publicly-available datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, actions, and space alignment, which facilitates comprehensive learning from diverse robotic datasets and offers one complete evaluation system. Trained on RoboData, RoboTron-Mani is the first generalist policy that surpasses expert models, enabling simultaneous evaluation of all tasks across multiple datasets, rather than being limited to specific data or task selections. Specifically, RoboTron-Mani boosts manipulation performance by increasing the average sequence length on CALVIN from 1.7 to 3.5, enabling cross-embodiment generalization, and achieving state-of-the-art results on both simulated and real-world datasets.

276

18 Jun 2025

computer-science artificial-intelligence computation-and-language

GRAM: A Generative Foundation Reward Model for Reward Generalization

Northeastern University Meituan Inc.NiuTrans Research Institute of Psychology, CAS

GRAM introduces a Generative Foundation Reward Model that leverages a two-stage training approach with unlabeled and labeled data, significantly improving reward model generalization and reducing reliance on extensive human preference annotations for LLM alignment. The model demonstrates superior out-of-distribution performance and efficient adaptation to new tasks.

276

04 Jun 2025

computer-science computation-and-language efficient-transformers

SkipGPT: Dynamic Layer Pruning Reinvented with Token Awareness and Module Decoupling

Shanghai Jiao Tong University

Nanjing University Tencent Inc Eastern Institute of Technology Southwest Jiaotong University Meituan Inc.

Researchers from Eastern Institute of Technology, Tencent, and collaborators develop SkipGPT, a dynamic layer pruning framework that introduces token-aware global routing and decoupled pruning policies for MLP versus self-attention modules, achieving over 40% parameter reduction while maintaining or exceeding original model performance through a two-stage training paradigm that first tunes lightweight routers (0.01% parameters) then applies LoRA fine-tuning, with SkipGPT-RT retaining over 90% performance on LLaMA2-7B/13B at 25% pruning and over 95% on LLaMA3.1-8B while outperforming static methods like ShortGPT and dynamic approaches like MoD-D across commonsense reasoning benchmarks, revealing through routing behavior analysis that attention modules exhibit higher redundancy than MLPs and that computational needs shift contextually with later tokens requiring more attention but less MLP processing, challenging the fixed 1:1 attention-MLP architecture design while demonstrating that joint training of routers with pre-trained parameters causes instability compared to their stable disentangled optimization approach.

246

07 Mar 2025

computer-science machine-learning fine-tuning

On a Connection Between Imitation Learning and RLHF

Chinese Academy of Sciences Pennsylvania State University Tencent AI Lab Meituan Inc.

This work demonstrates that Reinforcement Learning from Human Feedback (RLHF) for LLM alignment can be framed as an imitation learning problem on preference data distributions. This insight leads to Direct Imitation Learning (DIL), an algorithm that improves performance across various benchmarks, including the UltraFeedback Binarized and Anthropic Helpful-Harmless datasets, by directly optimizing the imitation objective.

20 Oct 2025

attention-mechanisms computer-science computation-and-language

$\mathcal{V}isi\mathcal{P}runer$ : Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

National University of Singapore

Shanghai Jiao Tong University Hong Kong Polytechnic University Meituan Inc.Institute of Digital Twin, EIT, Ningbo Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, EIT, Ningbo

Researchers developed VisiPruner, a training-free framework that leverages a new understanding of how multimodal large language models process visual information discontinuously across layers. This approach significantly reduces computational overhead by up to 99% in visual-related attention and 53.9% in total FLOPs while preserving performance across various models and benchmarks.

206

28 Oct 2025

chain-of-thought computer-science computation-and-language

NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables

Chinese Academy of Sciences Meituan Inc.

王岚睿

Researchers from the Chinese Academy of Sciences and Meituan introduced NEEDLEINATABLE (NIAT), the first long-context tabular benchmark to evaluate large language models' (LLMs) fine-grained understanding of individual cells within structured tables. This work revealed significant performance gaps in locating cells by index and a "lost-in-the-middle-table" phenomenon, while a proposed data synthesis method improved models' average accuracy by 14.55% on downstream tabular benchmarks.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use

Multimodal Language Models See Better When They Look Shallower

X-SAM: From Segment Anything to Any Segmentation

Mr. DETR++: Instructive Multi-Route Training for Detection Transformers with Mixture-of-Experts

Making Mathematical Reasoning Adaptive

DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

MultiConIR: Towards multi-condition Information Retrieval

YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications

Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices

InfVSR: Breaking Length Limits of Generic Video Super-Resolution

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability

RoboTron-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction

RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation

GRAM: A Generative Foundation Reward Model for Reward Generalization

SkipGPT: Dynamic Layer Pruning Reinvented with Token Awareness and Module Decoupling

On a Connection Between Imitation Learning and RLHF

$\mathcal{V}isi\mathcal{P}runer$ : Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables

Events

AI for Law

Personalize Your Feed

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use

Multimodal Language Models See Better When They Look Shallower

X-SAM: From Segment Anything to Any Segmentation

Mr. DETR++: Instructive Multi-Route Training for Detection Transformers with Mixture-of-Experts

Making Mathematical Reasoning Adaptive

DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

MultiConIR: Towards multi-condition Information Retrieval

YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications

Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices

InfVSR: Breaking Length Limits of Generic Video Super-Resolution

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability

RoboTron-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction

RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation

GRAM: A Generative Foundation Reward Model for Reward Generalization

SkipGPT: Dynamic Layer Pruning Reinvented with Token Awareness and Module Decoupling

On a Connection Between Imitation Learning and RLHF

VisiPruner\mathcal{V}isi\mathcal{P}runerVisiPruner: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables

Events

AI for Law

Personalize Your Feed

$\mathcal{V}isi\mathcal{P}runer$ : Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs