alphaXiv

History

Papers Benchmarks

Sensetime

919

26 May 2025

agents chain-of-thought computer-science

StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization

Nanjing University Shenzhen University Sensetime

StepSearch introduces a Step-wise Proximal Policy Optimization (StePPO) framework to improve Large Language Models' (LLMs) multi-hop reasoning and search capabilities through fine-grained, token-level supervision, incorporating information gain and redundancy penalties. This approach yields state-of-the-art performance on various multi-hop question answering benchmarks, showing particular effectiveness for smaller models and robust generalization across diverse knowledge bases.

1,109

22 May 2025

chain-of-thought computer-science artificial-intelligence

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Beihang University CUHK MMLab Sensetime HKU MMLab

Researchers from HKU MMLab, CUHK MMLab, Sensetime, and Beihang University develop GoT-R1, a framework that applies reinforcement learning to Generation Chain-of-Thought reasoning for visual generation, enabling multimodal large language models to autonomously discover effective reasoning strategies beyond predefined templates while achieving state-of-the-art performance on T2I-CompBench through a dual-stage training process that combines supervised fine-tuning with Group Relative Policy Optimization and a comprehensive multi-dimensional reward system evaluating both reasoning chains and generated images.

2,527

13 Mar 2025

computer-science computer-vision-and-pattern-recognition

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Shanghai AI Laboratory CUHK MMLab HKU THU Sensetime BUAA

Linjiang Huang

A unified framework integrates reasoning capabilities into visual generation and editing by combining multimodal language models with a semantic-spatial guidance module, demonstrating superior performance on GenEval and Emu-Edit benchmarks while enabling interactive control through explicit reasoning chains.

162

683

22 Feb 2024

computer-science artificial-intelligence computation-and-language

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Shanghai AI Laboratory

The Chinese University of Hong Kong Sensetime

The MATH-Vision (MATH-V) dataset provides a new benchmark for evaluating multimodal mathematical reasoning in LMMs, revealing a substantial performance gap between leading models (GPT-4V achieving 22.76%) and human performance (75.66%) on diverse, competition-derived problems. This highlights current LMM limitations primarily in reasoning and vision recognition, rather than basic calculation or knowledge.

122

17 Sep 2025

agents computer-science conversational-ai

MOOM: Maintenance, Organization and Optimization of Memory in Ultra-Long Role-Playing Dialogues

Beijing University of Posts and Telecommunications Sensetime Beijing Key Laboratory of Network System and Network Culture

MOOM, a dual-branch memory system, integrates literary theory to manage long-term memory in ultra-long role-playing dialogues for Large Language Models. It extracts plot development and character portrayal while employing a competition-inhibition forgetting mechanism, outperforming state-of-the-art methods in memory accuracy and efficiency, and utilizing fewer computational resources.

582

25 Oct 2022

attention-mechanisms computer-science computer-vision-security

A Unified Model for Multi-class Anomaly Detection

CUHK

Shanghai Jiao Tong University Sensetime

UniAD presents a single, unified framework for multi-class anomaly detection that effectively prevents models from learning trivial identity mappings. It achieved a mean AUROC of 96.5% for anomaly detection and 96.8% for localization on MVTec-AD, outperforming previous state-of-the-art methods by significant margins while maintaining performance across diverse object categories.

277

309

16 Jan 2025

computer-science computer-vision-and-pattern-recognition graphics

SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation

Nanyang Technological University Sensetime

Researchers at SenseTime and NTU investigated the impact of data and model scaling for expressive human pose and shape estimation (EHPS), creating generalist foundation models like SMPLest-X. These models, trained on up to 40 diverse datasets and 10 million instances, significantly reduced whole-body mean primary errors from over 110 mm to below 60 mm and hand primary errors from over 62 mm to 31 mm, establishing new state-of-the-art performance and strong generalization across benchmarks.

251

31 Aug 2022

computer-science computer-vision-security computer-vision-and-pattern-recognition

MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model

Nanyang Technological University Sensetime

MotionDiffuse introduces the first framework to apply Denoising Diffusion Probabilistic Models (DDPMs) for text-driven human motion generation. The model achieves state-of-the-art performance in synthesizing high-fidelity, diverse, and semantically relevant motions, while also offering novel fine-grained control over specific body parts and temporal action sequences.

898

494

07 Jun 2024

computer-science conversational-ai computation-and-language

InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews

Fudan University

Renmin University of China

Boston University Beijing University of Technology Sensetime

Jiangjie Chen

Siyu Yuan

INCHARACTER introduces an interview-based methodology for evaluating the personality fidelity of Role-Playing Agents (RPAs), adapting psychological assessment techniques for AI. This approach demonstrates that state-of-the-art RPAs can reproduce complex character personalities with up to 80.7% accuracy on scales like 16Personalities, offering a more robust and consistent evaluation than self-report methods.

416

31 Mar 2025

computer-science artificial-intelligence machine-learning

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

Shanghai AI Laboratory

Nanyang Technological University Sensetime

This research provides a practical evaluation of KV cache compression techniques for Large Language Model serving, revealing that reported memory savings often do not translate to real-world throughput gains due to increased output length and compatibility issues with optimized serving frameworks. The work proposes tools to predict throughput, output length, and identify inputs prone to quality degradation, bridging the gap between academic research and production deployment.

172

03 Apr 2023

computer-science computer-vision-and-pattern-recognition generative-models

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

Nanyang Technological University Sensetime

3D human motion generation is crucial for creative industry. Recent advances rely on generative models with domain knowledge for text-driven motion generation, leading to substantial progress in capturing common motions. However, the performance on more diverse motions remains unsatisfactory. In this work, we propose ReMoDiffuse, a diffusion-model-based motion generation framework that integrates a retrieval mechanism to refine the denoising process. ReMoDiffuse enhances the generalizability and diversity of text-driven motion generation with three key designs: 1) Hybrid Retrieval finds appropriate references from the database in terms of both semantic and kinematic similarities. 2) Semantic-Modulated Transformer selectively absorbs retrieval knowledge, adapting to the difference between retrieved samples and the target motion sequence. 3) Condition Mixture better utilizes the retrieval database during inference, overcoming the scale sensitivity in classifier-free guidance. Extensive experiments demonstrate that ReMoDiffuse outperforms state-of-the-art methods by balancing both text-motion consistency and motion quality, especially for more diverse motion generation.

345

185

21 Oct 2024

computer-science computer-vision-and-pattern-recognition generative-models

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Shanghai AI Laboratory CUHK MMLab THU Sensetime HKU MMLab

Researchers from CUHK MMLab, HKU MMLab, SenseTime, Shanghai AI Laboratory, and Tsinghua University developed PUMA, a unified Multimodal Large Language Model that handles multi-granular visual generation. PUMA addresses the diversity-controllability trade-off in visual generation by processing and generating visual features at multiple scales, achieving superior image reconstruction and competitive performance across text-to-image, image editing, and understanding tasks compared to existing models.

130

492

26 May 2025

agents chain-of-thought computer-science

ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search

Chinese Academy of Sciences Shanghai AI Laboratory Tongji University

Shanghai Jiao Tong University Shanghai Innovation Institute Sensetime

The ARISE framework enhances Large Language Model reasoning in complex, knowledge-intensive tasks by integrating dynamic Retrieval-Augmented Generation with Monte Carlo Tree Search guided by Bayesian risk assessment. It consistently outperforms state-of-the-art knowledge-augmented reasoning methods, showing up to 23.10% accuracy (EM) improvement over vanilla RAG on challenging multi-hop question answering benchmarks.

14 Jul 2025

computer-science distributed-parallel-and-cluster-computing

Past-Future Scheduler for LLM Serving under SLA Guarantees

Beihang University

Peking University Sensetime

The exploration and application of Large Language Models (LLMs) is thriving. To reduce deployment costs, continuous batching has become an essential feature in current service frameworks. The effectiveness of continuous batching relies on an accurate estimate of the memory requirements of requests. However, due to the diversity in request output lengths, existing frameworks tend to adopt aggressive or conservative schedulers, which often result in significant overestimation or underestimation of memory consumption. Consequently, they suffer from harmful request evictions or prolonged queuing times, failing to achieve satisfactory throughput under strict Service Level Agreement (SLA) guarantees (a.k.a. goodput), across various LLM application scenarios with differing input-output length distributions. To address this issue, we propose a novel Past-Future scheduler that precisely estimates the peak memory resources required by the running batch via considering the historical distribution of request output lengths and calculating memory occupancy at each future time point. It adapts to applications with all types of input-output length distributions, balancing the trade-off between request queuing and harmful evictions, thereby consistently achieving better goodput. Furthermore, to validate the effectiveness of the proposed scheduler, we developed a high-performance LLM serving framework, LightLLM, that implements the Past-Future scheduler. Compared to existing aggressive or conservative schedulers, LightLLM demonstrates superior goodput, achieving up to 2-3

\times

higher goodput than other schedulers under heavy loads. LightLLM is open source to boost the research in such direction (this https URL).

3,761

15 Oct 2025

autonomous-vehicles computer-science artificial-intelligence

Hi-Drive: Hierarchical POMDP Planning for Safe Autonomous Driving in Diverse Urban Environments

Shanghai Jiao Tong University Sensetime

Hi-Drive introduces a hierarchical POMDP planning framework that jointly models and infers high-level behavioral intentions and low-level driving styles of other road users in urban environments. This model-based approach demonstrates superior safety with a 0% collision rate on challenging long-term datasets and competitive real-time performance on benchmarks like nuPlan without requiring extensive training.

15 Apr 2025

adversarial-attacks adversarial-robustness computer-science

ADT: Tuning Diffusion Models with Adversarial Supervision

Sensetime

Researchers at SenseTime developed Adversarial Diffusion Tuning (ADT), a unified fine-tuning framework that addresses the training-inference discrepancy in diffusion models by simulating the inference process during training and employing adversarial supervision. This approach leads to higher-quality image generation, demonstrated by improvements in metrics such as FID, HPSv2, and Aesthetics scores across various model architectures.

18 Apr 2021

computer-science hardware-architecture computer-vision-and-pattern-recognition

Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch

CUHK

Northwestern University Sensetime NLPR, CASIA CUHK-SenseTime Joint Lab

Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. It can be generally categorized into unstructured fine-grained sparsity that zeroes out multiple individual weights distributed across the neural network, and structured coarse-grained sparsity which prunes blocks of sub-networks of a neural network. Fine-grained sparsity can achieve a high compression ratio but is not hardware friendly and hence receives limited speed gains. On the other hand, coarse-grained sparsity cannot concurrently achieve both apparent acceleration on modern GPUs and decent performance. In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network, which can maintain the advantages of both unstructured fine-grained sparsity and structured coarse-grained sparsity simultaneously on specifically designed GPUs. Specifically, a 2:4 sparse network could achieve 2x speed-up without performance drop on Nvidia A100 GPUs. Furthermore, we propose a novel and effective ingredient, sparse-refined straight-through estimator (SR-STE), to alleviate the negative influence of the approximated gradients computed by vanilla STE during optimization. We also define a metric, Sparse Architecture Divergence (SAD), to measure the sparse network's topology change during the training process. Finally, We justify SR-STE's advantages with SAD and demonstrate the effectiveness of SR-STE by performing comprehensive experiments on various tasks. Source codes and models are available at this https URL.

01 Dec 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

Shanghai AI Laboratory

Peking University The University of HongKong

Shanghai Jiaotong University Sensetime

Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown. As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data. While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain. Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind. To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model. This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks. Model and code are released at: this https URL

105

01 May 2024

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

The Chinese University of Hong Kong Avolution AI Sensetime Centre for Perceptual and Interactive Intelligence (CPII)

Optimizing a text-to-image diffusion model with a given reward function is an important but underexplored research area. In this study, we propose Deep Reward Tuning (DRTune), an algorithm that directly supervises the final output image of a text-to-image diffusion model and back-propagates through the iterative sampling process to the input noise. We find that training earlier steps in the sampling process is crucial for low-level rewards, and deep supervision can be achieved efficiently and effectively by stopping the gradient of the denoising network input. DRTune is extensively evaluated on various reward models. It consistently outperforms other algorithms, particularly for low-level control signals, where all shallow supervision methods fail. Additionally, we fine-tune Stable Diffusion XL 1.0 (SDXL 1.0) model via DRTune to optimize Human Preference Score v2.1, resulting in the Favorable Diffusion XL 1.0 (FDXL 1.0) model. FDXL 1.0 significantly enhances image quality compared to SDXL 1.0 and reaches comparable quality compared with Midjourney v5.2.

24 Sep 2021

autonomous-vehicles computer-science computer-vision-security

FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection

The Chinese University of Hong Kong Sensetime

Researchers from the CUHK-SenseTime Joint Lab developed FCOS3D, which adapts a 2D anchor-free detector to monocular 3D object detection, inferring 3D properties solely from RGB images. The framework achieved first place among all vision-only methods on the nuScenes benchmark with an mAP of 0.358 and an NDS of 0.428, offering a cost-effective solution for 3D perception.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

MOOM: Maintenance, Organization and Optimization of Memory in Ultra-Long Role-Playing Dialogues

A Unified Model for Multi-class Anomaly Detection

SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation

MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model

InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search

Past-Future Scheduler for LLM Serving under SLA Guarantees

Hi-Drive: Hierarchical POMDP Planning for Safe Autonomous Driving in Diverse Urban Environments

ADT: Tuning Diffusion Models with Adversarial Supervision

Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch

TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection

Events

AI for Law

Personalize Your Feed