alphaXiv

History

Papers Benchmarks

Institute of Artificial Intelligence (TeleAI)China Telecom

334

27 Oct 2025

computer-science artificial-intelligence robotics

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

Shanghai Jiao Tong University Harbin Institute of Technology East China University of Science and Technology ShanghaiTech University China Telecom

KungfuBot enables humanoid robots to learn and execute highly-dynamic human skills like martial arts and dancing by integrating a physics-based motion processing pipeline with an adaptive motion tracking mechanism. This approach allows zero-shot transfer to real robots, demonstrating superior tracking performance with a global mean per body position error of 53.25mm on easy motions, and robustly executing complex maneuvers on a Unitree G1 robot.

106

09 Oct 2025

adversarial-robustness computer-science computer-vision-and-pattern-recognition

SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

Sichuan University

Tsinghua University The Hong Kong University of Science and Technology (Guangzhou)

Nanyang Technological University China Telecom Beijing University of Posts and Telecommunications TeleAI West China Hospital West China Biomedical Big Data Center

Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance

70.13

and

78.97

on safety and helpfulness across six benchmarks, surpassing both same-scale and

>10\times

larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at this https URL.

135

05 Sep 2025

computer-science sound

WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation

Northwestern Polytechnical University China Telecom

HKUST Beijing AISHELL Technology Co., Ltd.

WenetSpeech-Yue introduces the largest open-source Cantonese speech corpus, containing over 21,800 hours of multi-dimensionally annotated audio, along with comprehensive evaluation benchmarks. This resource enables the development of state-of-the-art Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for Cantonese, significantly advancing speech technology for the dialect.

169

29 Sep 2025

computer-science sound audio-and-speech-processing

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis

Tianjin University

Shanghai Jiao Tong University

Nanyang Technological University China Telecom Kuaishou Technology Shenzhen Institute of Advanced Technology Huiyan Technology (Tianjin)

While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion transitions and the scarcity of annotated datasets that capture intra-sentence emotional and prosodic variation. In this paper, we propose WeSCon, the first self-training framework that enables word-level control of both emotion and speaking rate in a pretrained zero-shot TTS model, without relying on datasets containing intra-sentence emotion or speed transitions. Our method introduces a transition-smoothing strategy and a dynamic speed control mechanism to guide the pretrained TTS model in performing word-level expressive synthesis through a multi-round inference process. To further simplify the inference, we incorporate a dynamic emotional attention bias mechanism and fine-tune the model via self-training, thereby activating its ability for word-level expressive control in an end-to-end manner. Experimental results show that WeSCon effectively overcomes data scarcity, achieving state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot synthesis capabilities of the original TTS model.

21 Nov 2025

computer-science computer-vision-and-pattern-recognition generative-models

UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation

China Telecom Institute of Artificial Intelligence (TeleAI), China Telecom Institute of Artificial Intelligence Institute of Artificial Intelligence (TeleAI)

UniModel introduces a visual-only framework that unifies multimodal understanding and generation by representing both text and images as pixel-level data within a single diffusion transformer. This approach enables coherent text-to-image generation and image captioning, demonstrating strong cycle consistency and emergent controllability by operating entirely in a shared visual latent space.

176

27 Oct 2025

computer-science computer-vision-and-pattern-recognition generative-models

Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances

Institute of Artificial Intelligence (TeleAI)

Generative models have made significant progress in synthesizing visual content, including images, videos, and 3D/4D structures. However, they are typically trained with surrogate objectives such as likelihood or reconstruction loss, which often misalign with perceptual quality, semantic accuracy, or physical realism. Reinforcement learning (RL) offers a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives. Recent advances demonstrate its effectiveness in enhancing controllability, consistency, and human alignment across generative tasks. This survey provides a systematic overview of RL-based methods for visual content generation. We review the evolution of RL from classical control to its role as a general-purpose optimization tool, and examine its integration into image, video, and 3D/4D generation. Across these domains, RL serves not only as a fine-tuning mechanism but also as a structural component for aligning generation with complex, high-level goals. We conclude with open challenges and future research directions at the intersection of RL and generative modeling.

143

14 Oct 2025

ai-for-genomics computer-science artificial-intelligence

Protein Design with Dynamic Protein Vocabulary

Fudan University China Telecom East China Normal University

PRODVA, developed by researchers at East China Normal University and collaborators, generates protein sequences that are both functionally aligned with text descriptions and structurally plausible. This method achieved 77% of designs with pLDDT > 70 and outperformed prior state-of-the-art models in foldability while utilizing less than 0.04% of their training data.

05 Dec 2025

agents computer-science continual-learning

TeleEgo: Benchmarking Egocentric AI Assistants in the Wild

China Telecom Institute of Artificial Intelligence (TeleAI)

Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work \& study, lifestyle \& routines, social activities, and outings \& culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human this http URL defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose Real-Time Accuracy (RTA) to jointly capture correctness and responsiveness under tight decision windows, and Memory Persistence Time (MPT) as a forward-looking metric for long-term retention in continuous streams. In this work, we report RTA results for current models and release TeleEgo, together with an MPT evaluation framework, as a realistic and extensible benchmark for future egocentric assistants with stronger streaming memory, enabling systematic study of both real-time behavior and long-horizon memory.

104

12 Aug 2025

adversarial-robustness ai-for-cybersecurity computer-science

Safe Semantics, Unsafe Interpretations: Tackling Implicit Reasoning Safety in Large Vision-Language Models

Northwestern Polytechnical University

Peking University China Telecom Southeast University

Large Vision-Language Models face growing safety challenges with multimodal inputs. This paper introduces the concept of Implicit Reasoning Safety, a vulnerability in LVLMs. Benign combined inputs trigger unsafe LVLM outputs due to flawed or hidden reasoning. To showcase this, we developed Safe Semantics, Unsafe Interpretations, the first dataset for this critical issue. Our demonstrations show that even simple In-Context Learning with SSUI significantly mitigates these implicit multimodal threats, underscoring the urgent need to improve cross-modal implicit reasoning.

22 Sep 2025

computer-science computation-and-language sound

WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing

Northwestern Polytechnical University

Nanjing University China Telecom Beijing AISHELL Technology Co., Ltd.WeNet Open Source Community

The paper introduces WenetSpeech-Chuan, the largest open-source corpus for Sichuanese dialects, containing over 10,000 hours of richly annotated audio. This resource, coupled with a systematic data processing pipeline, facilitates state-of-the-art Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) for Sichuanese, achieving performance competitive with commercial systems and significantly surpassing existing open-source models.

126

23 Sep 2025

computer-science computation-and-language data-curation

T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables

Beihang University China Telecom Chongqing University

Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench.

15 Nov 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing

Northwestern Polytechnical University

University of Science and Technology of China Institute of Artificial Intelligence (TeleAI)

Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS), an emerging task that adapts Open-Vocabulary Segmentation (OVS) to the remote sensing (RS) domain, remains underexplored due to the absence of a unified evaluation benchmark and the domain gap between natural and RS images. To bridge these gaps, we first establish a standardized OVRSIS benchmark (\textbf{OVRSISBench}) based on widely-used RS segmentation datasets, enabling consistent evaluation across methods. Using this benchmark, we comprehensively evaluate several representative OVS/OVRSIS models and reveal their limitations when directly applied to remote sensing scenarios. Building on these insights, we propose \textbf{RSKT-Seg}, a novel open-vocabulary segmentation framework tailored for remote sensing. RSKT-Seg integrates three key components: (1) a Multi-Directional Cost Map Aggregation (RS-CMA) module that captures rotation-invariant visual cues by computing vision-language cosine similarities across multiple directions; (2) an Efficient Cost Map Fusion (RS-Fusion) transformer, which jointly models spatial and semantic dependencies with a lightweight dimensionality reduction strategy; and (3) a Remote Sensing Knowledge Transfer (RS-Transfer) module that injects pre-trained knowledge and facilitates domain adaptation via enhanced upsampling. Extensive experiments on the benchmark show that RSKT-Seg consistently outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2x faster inference through efficient aggregation. Our code is \href{this https URL}{\textcolor{blue}{here}}.

258

29 Sep 2025

computer-science artificial-intelligence computation-and-language

Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration

Northwestern Polytechnical University Shanghai AI Laboratory

Tsinghua University

Zhejiang University China Telecom

Researchers introduced Reinforced Advantage (ReAd), a closed-loop framework integrating Multi-Agent Reinforcement Learning (MARL) advantage functions to provide principled feedback for Large Language Model (LLM) planning in embodied multi-agent tasks. This approach reduces environmental interactions and LLM queries, achieving superior task success rates and efficiency across various multi-robot and cooperative benchmarks.

123

20 Jun 2025

ai-for-cybersecurity computer-science computer-vision-security

Loupe: A Generalizable and Adaptive Framework for Image Forgery Detection

Northwestern Polytechnical University China Telecom Southeast University Beijing University of Posts and Telecommunications Lanzhou University

The proliferation of generative models has raised serious concerns about visual content forgery. Existing deepfake detection methods primarily target either image-level classification or pixel-wise localization. While some achieve high accuracy, they often suffer from limited generalization across manipulation types or rely on complex architectures. In this paper, we propose Loupe, a lightweight yet effective framework for joint deepfake detection and localization. Loupe integrates a patch-aware classifier and a segmentation module with conditional queries, allowing simultaneous global authenticity classification and fine-grained mask prediction. To enhance robustness against distribution shifts of test set, Loupe introduces a pseudo-label-guided test-time adaptation mechanism by leveraging patch-level predictions to supervise the segmentation head. Extensive experiments on the DDL dataset demonstrate that Loupe achieves state-of-the-art performance, securing the first place in the IJCAI 2025 Deepfake Detection and Localization Challenge with an overall score of 0.846. Our results validate the effectiveness of the proposed patch-level fusion and conditional query design in improving both classification accuracy and spatial localization under diverse forgery patterns. The code is available at this https URL.

25 Sep 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Conditional Video Generation for High-Efficiency Video Compression

China Telecom Institute of Artificial Intelligence (TeleAI)

Perceptual studies demonstrate that conditional diffusion models excel at reconstructing video content aligned with human visual perception. Building on this insight, we propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction. Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals. Our approach introduces three key modules: (1) Multi-granular conditioning that captures both static scene structure and dynamic spatio-temporal cues; (2) Compact representations designed for efficient transmission without sacrificing semantic richness; (3) Multi-condition training with modality dropout and role-aware embeddings, which prevent over-reliance on any single modality and enhance robustness. Extensive experiments show that our method significantly outperforms both traditional and neural codecs on perceptual quality metrics such as Fréchet Video Distance (FVD) and LPIPS, especially under high compression ratios.

106

16 Jun 2025

computer-science computation-and-language sound

Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis

Northwestern Polytechnical University China Telecom

Large-scale training corpora have significantly improved the performance of ASR models. Unfortunately, due to the relative scarcity of data, Chinese accents and dialects remain a challenge for most ASR models. Recent advancements in self-supervised learning have shown that self-supervised pre-training, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios. We aim to investigate the effectiveness of this paradigm for Chinese dialects. Specifically, we pre-train a Data2vec2 model on 300,000 hours of unlabeled dialect and accented speech data and do alignment training on a supervised dataset of 40,000 hours. Then, we systematically examine the impact of various projectors and LLMs on Mandarin, dialect, and accented speech recognition performance under this paradigm. Our method achieved SOTA results on multiple dialect datasets, including Kespeech. We will open-source our work to promote reproducible research

17 Jul 2025

ai-for-health computer-science conversational-ai

Emotional Support with LLM-based Empathetic Dialogue Generation

China Telecom Corp Ltd Institute of Artificial Intelligence (TeleAI)

Researchers from TeleAI, China Telecom Corp Ltd, developed an effective solution for Emotional Support Conversation (ESC) by fine-tuning Qwen2.5 Large Language Models with advanced prompt engineering. Their approach, which includes both LoRA and full-parameter fine-tuning, achieved a second-place ranking in the NLPCC 2025 Task 8 evaluation, with their best model yielding a total score of 39.62 and a G-score of 87.20.

219

07 Jul 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation

China Telecom

With the rise of diffusion models, audio-video generation has been revolutionized. However, most existing methods rely on separate modules for each modality, with limited exploration of unified generative architectures. In addition, many are confined to a single task and small-scale datasets. To overcome these limitations, we introduce UniForm, a unified multi-task diffusion transformer that generates both audio and visual modalities in a shared latent space. By using a unified denoising network, UniForm captures the inherent correlations between sound and vision. Additionally, we propose task-specific noise schemes and task tokens, enabling the model to support multiple tasks with a single set of parameters, including video-to-audio, audio-to-video and text-to-audio-video generation. Furthermore, by leveraging large language models and a large-scale text-audio-video combined dataset, UniForm achieves greater generative diversity than prior approaches. Experiments show that UniForm achieves performance close to the state-of-the-art single-task models across three generation tasks, with generated content that is not only highly aligned with real-world data distributions but also enables more diverse and fine-grained generation.

386

11 Aug 2025

autonomous-vehicles computer-science artificial-intelligence

Exploring Spatial Representation to Enhance LLM Reasoning in Aerial Vision-Language Navigation

Northwestern Polytechnical University Shanghai AI Laboratory China Telecom

Aerial Vision-and-Language Navigation (VLN) is a novel task enabling Unmanned Aerial Vehicles (UAVs) to navigate in outdoor environments through natural language instructions and visual cues. However, it remains challenging due to the complex spatial relationships in aerial this http URL this paper, we propose a training-free, zero-shot framework for aerial VLN tasks, where the large language model (LLM) is leveraged as the agent for action prediction. Specifically, we develop a novel Semantic-Topo-Metric Representation (STMR) to enhance the spatial reasoning capabilities of LLMs. This is achieved by extracting and projecting instruction-related semantic masks onto a top-down map, which presents spatial and topological information about surrounding landmarks and grows during the navigation process. At each step, a local map centered at the UAV is extracted from the growing top-down map, and transformed into a ma trix representation with distance metrics, serving as the text prompt to LLM for action prediction in response to the given instruction. Experiments conducted in real and simulation environments have proved the effectiveness and robustness of our method, achieving absolute success rate improvements of 26.8% and 5.8% over current state-of-the-art methods on simple and complex navigation tasks, respectively. The dataset and code will be released soon.

196

24 Jul 2025

cloud-computing computer-science artificial-intelligence

AI Flow: Perspectives, Scenarios, and Approaches

China Telecom

Pioneered by the foundational information theory by Claude Shannon and the visionary framework of machine intelligence by Alan Turing, the convergent evolution of information and communication technologies (IT/CT) has created an unbroken wave of connectivity and computation. This synergy has sparked a technological revolution, now reaching its peak with large artificial intelligence (AI) models that are reshaping industries and redefining human-machine collaboration. However, the realization of ubiquitous intelligence faces considerable challenges due to substantial resource consumption in large models and high communication bandwidth demands. To address these challenges, AI Flow has been introduced as a multidisciplinary framework that integrates cutting-edge IT and CT advancements, with a particular emphasis on the following three key points. First, device-edge-cloud framework serves as the foundation, which integrates end devices, edge servers, and cloud clusters to optimize scalability and efficiency for low-latency model inference. Second, we introduce the concept of familial models, which refers to a series of different-sized models with aligned hidden features, enabling effective collaboration and the flexibility to adapt to varying resource constraints and dynamic scenarios. Third, connectivity- and interaction-based intelligence emergence is a novel paradigm of AI Flow. By leveraging communication networks to enhance connectivity, the collaboration among AI models across heterogeneous nodes achieves emergent intelligence that surpasses the capability of any single model. The innovations of AI Flow provide enhanced intelligence, timely responsiveness, and ubiquitous accessibility to AI services, paving the way for the tighter fusion of AI techniques and communication systems.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis

UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation

Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances

Protein Design with Dynamic Protein Vocabulary

TeleEgo: Benchmarking Egocentric AI Assistants in the Wild

Safe Semantics, Unsafe Interpretations: Tackling Implicit Reasoning Safety in Large Vision-Language Models

WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing

T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables

Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing

Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration

Loupe: A Generalizable and Adaptive Framework for Image Forgery Detection

Conditional Video Generation for High-Efficiency Video Compression

Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis

Emotional Support with LLM-based Empathetic Dialogue Generation

UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation

Exploring Spatial Representation to Enhance LLM Reasoning in Aerial Vision-Language Navigation

AI Flow: Perspectives, Scenarios, and Approaches

Events

AI for Law

Personalize Your Feed