alphaXiv

History

Papers Benchmarks

Northwestern Polytechnical University

2,027

05 Dec 2024

physics fluid-dynamics

Global drag reduction and local flow statistics in Taylor-Couette turbulence with dilute polymer additives

Northwestern Polytechnical University

Tsinghua University New Cornerstone Science Laboratory

We present an experimental study on the drag reduction by polymers in Taylor-Couette turbulence at Reynolds numbers (

Re

) ranging from

4\times 10^3

2.5\times 10^4

. In this

Re

regime, the Taylor vortex is present and accounts for more than 50\% of the total angular velocity flux. Polyacrylamide polymers with two different average molecular weights are used. It is found that the drag reduction rate increases with polymer concentration and approaches the maximum drag reduction (MDR) limit. At MDR, the friction factor follows the

-0.58

scaling, i.e.,

C_f \sim Re^{-0.58}

, similar to channel/pipe flows. However, the drag reduction rate is about

20\%

at MDR, which is much lower than that in channel/pipe flows at comparable

Re

. We also find that the Reynolds shear stress does not vanish and the slope of the mean azimuthal velocity profile in the logarithmic layer remains unchanged at MDR. These behaviours are reminiscent of the low drag reduction regime reported in channel flow (Warholic et al., Exp. Fluids, vol. 27, issue 5, 1999, p. 461-472). We reveal that the lower drag reduction rate originates from the fact that polymers strongly suppress the turbulent flow while only slightly weaken the mean Taylor vortex. We further show that polymers steady the velocity boundary layer and suppress the small-scale Görtler vortices in the near-wall region. The former effect reduces the emission rate of both intense fast and slow plumes detached from the boundary layer, resulting in less flux transport from the inner cylinder to the outer one and reduces energy input into the bulk turbulent flow. Our results suggest that in turbulent flows, where secondary flow structures are statistically persistent and dominate the global transport properties of the system, the drag reduction efficiency of polymer additives is significantly diminished.

1,148

15 Oct 2025

agents computer-science artificial-intelligence

EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control

Northwestern Polytechnical University Shanghai AI Laboratory

Fudan University AgiBot EO-Robotics Team

The EO-Robotics Team developed EO-1, a 3B parameter embodied foundation model, employing a unified architecture and interleaved vision-text-action pretraining for general robot control. The model achieved state-of-the-art performance, surpassing GPT-4o and Gemini 1.5 Flash in overall embodied reasoning, and demonstrated an 86.0% completion rate across 28 diverse real-world manipulation tasks.

376

31 Oct 2025

computer-science contrastive-learning computer-vision-and-pattern-recognition

IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

Northwestern Polytechnical University

Tsinghua University

The Chinese University of Hong Kong

Nanyang Technological University StepFun, Inc.

IGGT introduces an end-to-end unified transformer that jointly learns 3D geometric reconstruction and instance-level semantics, leveraging a new InsScene-15K dataset with 3D-consistent instance annotations. The framework achieves state-of-the-art instance spatial tracking, superior open-vocabulary semantic segmentation, and maintains high geometric accuracy, while offering flexible integration with various Vision-Language Models.

242

9,646

15 Nov 2025

ai-for-health clustering-algorithms computer-science

MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention

Northwestern Polytechnical University

University of Maryland The University of Sydney

Tianyi Wang

The MIRROR framework introduces a multi-modal self-supervised learning approach for computational pathology, integrating histopathology and transcriptomics by balancing modality alignment with modality-specific information retention and mitigating redundancy through a novel style clustering module. It demonstrates superior performance in cancer subtyping and survival prediction on TCGA cohorts, outperforming existing baselines in various diagnostic tasks.

3,194

22 Feb 2025

computer-science artificial-intelligence computation-and-language

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Northwestern Polytechnical University University of Surrey University of Rochester University of Science and Technology Beijing

HKUST Hong Kong Baptist University Chinese University of Hong Kong Shanghai Mobvoi Information Technology Co., Ltd.

Llasa introduces a unified, Llama-based architecture for speech synthesis that simplifies the text-to-speech pipeline to a single Transformer and a novel speech tokenizer. The work systematically investigates train-time and inference-time scaling, showing consistent quality improvements and strong performance on both speech generation and understanding tasks.

340

2,961

03 Mar 2025

chain-of-thought computer-science artificial-intelligence

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Northwestern Polytechnical University

Shanghai Jiao Tong University

Nanyang Technological University

HKUST Shanghai Mobvoi Information Technology Co., Ltd.NetEase Inc

A collaborative team from HKUST and partners introduces Spark-TTS, a groundbreaking single-stream text-to-speech framework that achieves state-of-the-art voice synthesis while enabling precise attribute control through an innovative BiCodec tokenization system, demonstrating superior performance in zero-shot voice cloning and establishing new benchmarks with the comprehensive VoxBox dataset.

347

05 Sep 2025

computer-science artificial-intelligence robotics

Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

Northwestern Polytechnical University

Tsinghua University The Chinese University of Hong Kong, Shenzhen Institute of Artificial Intelligence, China Telecom

A framework called "Align-Then-stEer (ATE)" enables pre-trained Vision-Language-Action (VLA) models to adapt to new robotic embodiments and diverse tasks using limited data. This is achieved through a two-stage process that establishes a unified latent space for action representation and then steers the generative VLA policies, yielding notable success rate improvements and enhanced robustness in both simulated and real-world robotic manipulation.

1,021

18 Jun 2025

clustering-algorithms computer-science computation-and-language

The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants

Northwestern Polytechnical University

Northeastern University Shanghai Artificial Intelligence Laboratory

the University of Tokyo Singapore Management University Beĳing Institute of Technology

shuyue hu

Proprietary giants are increasingly dominating the race for ever-larger language models. Can open-source, smaller models remain competitive across a broad range of tasks? In this paper, we present the Avengers -- a simple recipe that leverages the collective intelligence of these smaller models. The Avengers builds upon four lightweight operations: (i) embedding: encode queries using a text embedding model; (ii) clustering: group queries based on their semantic similarity; (iii) scoring: scores each model's performance within each cluster; and (iv) voting: improve outputs via repeated sampling and voting. At inference time, each query is embedded and assigned to its nearest cluster. The top-performing model(s) within that cluster are selected to generate the response with repeated sampling. Remarkably, with 10 open-source models (~7B parameters each), the Avengers surpasses GPT-4o, 4.1, and 4.5 in average performance across 15 diverse datasets spanning mathematics, coding, logical reasoning, general knowledge, and affective tasks. In particular, it surpasses GPT-4.1 on mathematics tasks by 18.21% and on code tasks by 7.46%. Furthermore, the Avengers delivers superior out-of-distribution generalization, and remains robust across various embedding models, clustering algorithms, ensemble strategies, and values of its sole parameter -- the number of clusters.

171

29 Sep 2025

computer-science robotics

Beyond Human Demonstrations: Diffusion-Based Reinforcement Learning to Generate Data for VLA Training

Northwestern Polytechnical University Wuhan University

Chinese Academy of Sciences

Tsinghua University

Microsoft

HKUST Big Data Institute, Central South University

Vision-language-action (VLA) models have shown strong generalization across tasks and embodiments; however, their reliance on large-scale human demonstrations limits their scalability owing to the cost and effort of manual data collection. Reinforcement learning (RL) offers a potential alternative to generate demonstrations autonomously, yet conventional RL algorithms often struggle on long-horizon manipulation tasks with sparse rewards. In this paper, we propose a modified diffusion policy optimization algorithm to generate high-quality and low-variance trajectories, which contributes to a diffusion RL-powered VLA training pipeline. Our algorithm benefits from not only the high expressiveness of diffusion models to explore complex and diverse behaviors but also the implicit regularization of the iterative denoising process to yield smooth and consistent demonstrations. We evaluate our approach on the LIBERO benchmark, which includes 130 long-horizon manipulation tasks, and show that the generated trajectories are smoother and more consistent than both human demonstrations and those from standard Gaussian RL policies. Further, training a VLA model exclusively on the diffusion RL-generated data achieves an average success rate of 81.9%, which outperforms the model trained on human data by +5.3% and that on Gaussian RL-generated data by +12.6%. The results highlight our diffusion RL as an effective alternative for generating abundant, high-quality, and low-variance demonstrations for VLA models.

449

28 Feb 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

HVI: A New Color Space for Low-light Image Enhancement

Northwestern Polytechnical University Singapore Management University Xi’an University of Architecture and Technology

A new Horizontal/Vertical-Intensity (HVI) color space is introduced to address color distortion and noise in low-light image enhancement by decoupling brightness from color while mitigating red discontinuity and black plane noise. Paired with a Color and Intensity Decoupling Network (CIDNet), the approach achieves superior quantitative performance on multiple benchmarks with high efficiency, and the HVI space demonstrates generalizability when integrated into other state-of-the-art methods.

193

316

29 Mar 2025

computer-science computer-vision-and-pattern-recognition efficient-transformers

CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction

Northwestern Polytechnical University Shanghai Artificial Intelligence Laboratory

昊李

CityGS-X, developed by Northwestern Polytechnical University and Shanghai AI Lab, presents a scalable architecture for efficient and geometrically accurate large-scale 3D scene reconstruction using 3D Gaussian Splatting. The method achieves state-of-the-art rendering quality and significantly faster training times on urban datasets by utilizing a parallelized hybrid hierarchical 3D representation, batch-level multi-task rendering, and a consistent progressive training scheme with enhanced depth priors.

142

28 Oct 2025

computer-science sound audio-and-speech-processing

SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

Northwestern Polytechnical University

Shanghai Jiao Tong University Soul AI Lab

SoulX-Podcast introduces an LLM-driven generative framework for creating realistic, long-form, multi-speaker podcasts, incorporating diverse Chinese dialects and controllable paralinguistic cues. The system achieves state-of-the-art performance in multi-turn dialogue synthesis, exhibiting the lowest Character Error Rate (2.20) and highest cross-speaker consistency (0.599) on the Chinese ZipVoice-Dia benchmark, alongside strong zero-shot monologue capabilities.

214

03 Sep 2025

computer-science sound

OSUM-EChat: Enhancing End-to-End Empathetic Spoken Chatbot via Understanding-Driven Spoken Dialogue

Northwestern Polytechnical University

Huawei

The OSUM-EChat system developed by the Audio, Speech and Language Processing Group at Northwestern Polytechnical University enhances end-to-end empathetic spoken chatbots by integrating an understanding-driven training strategy and a linguistic-paralinguistic dual think mechanism. It achieved a GPT-4 score of 72.0 on a new EChat-eval benchmark for multi-label empathy, demonstrating improved empathetic responsiveness and efficient speech understanding without relying on massive, proprietary datasets.

663

01 Feb 2025

computer-science robotics

FastUMI: A Scalable and Hardware-Independent Universal Manipulation Interface with Dataset

Northwestern Polytechnical University Shanghai AI Lab Xi'an Jiaotong Liverpool University Institute of AI, China Telecom Corp Ltd

Yan Ding

FastUMI, developed by Shanghai AI Lab, redesigns the Universal Manipulation Interface to enable scalable and hardware-independent collection of high-quality, real-world robotic manipulation data. It achieves this through decoupled hardware, a streamlined software framework, and algorithmic enhancements tailored for first-person data, providing over 10,000 demonstration trajectories across 22 tasks.

539

28 Oct 2025

autonomous-vehicles computer-science computer-vision-and-pattern-recognition

Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond

Northwestern Polytechnical University Shanghai Artificial Intelligence Laboratory

Chinese Academy of Sciences

National University of Singapore

Tsinghua University Defense Innovation Institute Institute of Computing Technology GigaAI Mach Drive EVOL Lab, Institute of AI, China Telecom

Botian Shi

Domingo

General world models represent a crucial pathway toward achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual environments to decision-making systems. Recently, the emergence of the Sora model has attained significant attention due to its remarkable simulation capabilities, which exhibits an incipient comprehension of physical laws. In this survey, we embark on a comprehensive exploration of the latest advancements in world models. Our analysis navigates through the forefront of generative methodologies in video generation, where world models stand as pivotal constructs facilitating the synthesis of highly realistic visual content. Additionally, we scrutinize the burgeoning field of autonomous-driving world models, meticulously delineating their indispensable role in reshaping transportation and urban mobility. Furthermore, we delve into the intricacies inherent in world models deployed within autonomous agents, shedding light on their profound significance in enabling intelligent interactions within dynamic environmental contexts. At last, we examine challenges and limitations of world models, and discuss their potential future directions. We hope this survey can serve as a foundational reference for the research community and inspire continued innovation. This survey will be regularly updated at: this https URL.

438

1,035

17 May 2025

computer-science computer-vision-and-pattern-recognition generative-models

No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

Northwestern Polytechnical University Zhejiang University of Technology Baidu Inc SGIT AI Lab, State Grid Corporation of China

A method called Self-Representation Alignment (SRA) allows Diffusion Transformers (DiTs) to enhance their training and image generation quality by leveraging their intrinsic discriminative properties for self-guidance. This approach avoids reliance on external representation models or complex auxiliary training frameworks.

473

29 Mar 2025

computer-science artificial-intelligence robotics

COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models

Northwestern Polytechnical University Shanghai Artificial Intelligence Laboratory Institute of Artificial Intelligence, China Telecom Corp Ltd

COHERENT, developed by researchers including those at Shanghai AI Laboratory, introduces a centralized hierarchical framework that leverages Large Language Models for complex task planning in heterogeneous multi-robot systems. The framework achieves high success rates (0.975 in simulation) and demonstrates robust collaboration among diverse robot types, including quadrotors, robotic dogs, and robotic arms, in both simulated and real-world environments.

122

18 Sep 2025

ai-for-health computer-science computer-vision-and-pattern-recognition

EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

Northwestern Polytechnical University The First Affiliated Hospital of Sun Yat-sen University

Ultrasound imaging has become the preferred imaging modality for early cancer screening due to its advantages of non-ionizing radiation, low cost, and real-time imaging capabilities. However, conventional ultrasound diagnosis heavily relies on physician expertise, presenting challenges of high subjectivity and low diagnostic efficiency. Vision-language models (VLMs) offer promising solutions for this issue, but existing general-purpose models demonstrate limited knowledge in ultrasound medical tasks, with poor generalization in multi-organ lesion recognition and low efficiency across multi-task diagnostics. To address these limitations, we propose EchoVLM, a vision-language model specifically designed for ultrasound medical imaging. The model employs a Mixture of Experts (MoE) architecture trained on data spanning seven anatomical regions. This design enables the model to perform multiple tasks, including ultrasound report generation, diagnosis and visual question-answering (VQA). The experimental results demonstrated that EchoVLM achieved significant improvements of 10.15 and 4.77 points in BLEU-1 scores and ROUGE-1 scores respectively compared to Qwen2-VL on the ultrasound report generation task. These findings suggest that EchoVLM has substantial potential to enhance diagnostic accuracy in ultrasound imaging, thereby providing a viable technical solution for future clinical applications. Source code and model weights are available at this https URL.

248

05 Feb 2025

computer-science sound audio-and-speech-processing

GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling

Northwestern Polytechnical University

Nanyang Technological University

GenSE, developed by researchers from Northwestern Polytechnical University and Nanyang Technological University, introduces a generative speech enhancement framework that utilizes language models and a hierarchical processing structure. The system, which includes a novel single-quantizer neural codec called SimCodec, treats speech enhancement as a conditional language modeling task. It achieved superior DNSMOS, SECS, and Word Error Rate scores compared to existing state-of-the-art methods on standard datasets, demonstrating improved speech quality, speaker similarity, and generalization to unseen noise conditions.

165

05 Sep 2024

computer-science artificial-intelligence computation-and-language

Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models

Northwestern Polytechnical University The Hong Kong University of Science and Technology (Guangzhou)

Shandong University Xian Jiaotong University Zhejiang Createlink Technology

zhitao gao

Large Language Models (LLMs) may suffer from hallucinations in real-world applications due to the lack of relevant knowledge. In contrast, knowledge graphs encompass extensive, multi-relational structures that store a vast array of symbolic facts. Consequently, integrating LLMs with knowledge graphs has been extensively explored, with Knowledge Graph Question Answering (KGQA) serving as a critical touchstone for the integration. This task requires LLMs to answer natural language questions by retrieving relevant triples from knowledge graphs. However, existing methods face two significant challenges: \textit{excessively long reasoning paths distracting from the answer generation}, and \textit{false-positive relations hindering the path refinement}. In this paper, we propose an iterative interactive KGQA framework that leverages the interactive learning capabilities of LLMs to perform reasoning and Debating over Graphs (DoG). Specifically, DoG employs a subgraph-focusing mechanism, allowing LLMs to perform answer trying after each reasoning step, thereby mitigating the impact of lengthy reasoning paths. On the other hand, DoG utilizes a multi-role debate team to gradually simplify complex questions, reducing the influence of false-positive relations. This debate mechanism ensures the reliability of the reasoning process. Experimental results on five public datasets demonstrate the effectiveness and superiority of our architecture. Notably, DoG outperforms the state-of-the-art method ToG by 23.7\% and 9.1\% in accuracy on WebQuestions and GrailQA, respectively. Furthermore, the integration experiments with various LLMs on the mentioned datasets highlight the flexibility of DoG. Code is available at \url{this https URL}.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Global drag reduction and local flow statistics in Taylor-Couette turbulence with dilute polymer additives

EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control

IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants

Beyond Human Demonstrations: Diffusion-Based Reinforcement Learning to Generate Data for VLA Training

HVI: A New Color Space for Low-light Image Enhancement

CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction

SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

OSUM-EChat: Enhancing End-to-End Empathetic Spoken Chatbot via Understanding-Driven Spoken Dialogue

FastUMI: A Scalable and Hardware-Independent Universal Manipulation Interface with Dataset

Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond

No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models

EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling

Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models

Events

AI for Law

Personalize Your Feed