alphaXiv

History

Papers Benchmarks

Harbin Institute of Technology (Shenzhen)

2,038

06 Jul 2024

computer-science distributed-parallel-and-cluster-computing machine-learning

A Joint Approach to Local Updating and Gradient Compression for Efficient Asynchronous Federated Learning

Harbin Institute of Technology (Shenzhen)

Southern University of Science and Technology Shenzhen International Graduate School, Tsinghua University Tsinghua-Berkeley Shenzhen Institute, Tsinghua University

Researchers from Tsinghua University, Southern University of Science and Technology, and Harbin Institute of Technology introduce FedLuck, an Asynchronous Federated Learning (AFL) framework that jointly and adaptively optimizes local updating frequency and gradient compression rates. This approach, grounded in a derived convergence factor, achieved up to 55% faster training times and 56% less communication consumption on average across various tasks compared to existing baselines.

912

09 Sep 2025

computer-science computer-vision-and-pattern-recognition robotics

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Shanghai AI Laboratory Harbin Institute of Technology (Shenzhen)

F₁, a Vision-Language-Action (VLA) model, integrates explicit visual foresight into its decision-making process, moving beyond purely reactive control. This approach yields enhanced robustness in dynamic environments and improved generalization across a range of real-world and simulated robotic manipulation tasks.

103

882

01 Sep 2025

computer-science robotics

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Harbin Institute of Technology (Shenzhen)

The survey presents the first systematic, taxonomy-oriented review of large Vision-Language-Action (VLA) models for robotic manipulation, consolidating diverse research and proposing a coherent framework. It identifies key architectural paradigms, integration strategies, and distinctive characteristics while outlining critical future research directions.

206

1,404

21 Aug 2025

adversarial-robustness agentic-frameworks agents

A Survey on Large Language Model Benchmarks

South China University of Technology

Chinese Academy of Sciences

University of Science and Technology of China Shanghai AI Lab Shenzhen University Harbin Institute of Technology (Shenzhen)Shanghai University of Electric Power

Southern University of Science and Technology Shenzhen MSU-BIT University Shenzhen University of Advanced Technology Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

Researchers from a consortium of Chinese institutions systematically surveyed 283 Large Language Model benchmarks, categorizing them into a three-tiered taxonomy. The work identifies critical issues like data contamination and cultural bias, while proposing a design paradigm for more robust and fair future evaluations.

131

22,706

06 Jul 2025

agents chain-of-thought computer-science

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Harbin Institute of Technology (Shenzhen)

Researchers from Harbin Institute of Technology, Shenzhen, surveyed Large Multimodal Reasoning Models (LMRMs), tracing their evolution through four stages from perception-driven modularity to language-centric reasoning. The survey highlights current limitations of state-of-the-art models in omni-modal and agentic benchmarks, and proposes Native Large Multimodal Reasoning Models (N-LMRMs) as a future paradigm for inherently multimodal and integrated AI systems.

420

377

27 Oct 2025

computer-science computer-vision-and-pattern-recognition geometric-deep-learning

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

The Chinese University of Hong Kong

The University of Hong Kong Harbin Institute of Technology (Shenzhen)

Concerto, a joint 2D-3D self-supervised learning framework developed by researchers from The University of Hong Kong, The Chinese University of Hong Kong, and Harbin Institute of Technology, synergistically combines intra-modal 3D self-distillation and cross-modal 2D-3D joint embedding prediction. This approach learns unified spatial representations, achieving 80.7% mIoU for 3D semantic segmentation on ScanNet and demonstrating robust data efficiency.

944

26 Jun 2024

computer-science artificial-intelligence computation-and-language

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

The Chinese University of Hong Kong Harbin Institute of Technology (Shenzhen)SmartMore

Researchers at The Chinese University of Hong Kong and Harbin Institute of Technology developed Step-DPO, a method that adapts Direct Preference Optimization for fine-grained, step-wise supervision in long-chain mathematical reasoning. This approach enabled open-source models like Qwen2-72B-Instruct to achieve 70.8% accuracy on MATH and 94.0% on GSM8K, outperforming several state-of-the-art closed-source models including GPT-4-1106 and Claude-3-Opus.

347

957

31 Oct 2025

computer-science artificial-intelligence computation-and-language

R $^2$ ec: Towards Large Recommender Models with Reasoning

National University of Singapore

University of Science and Technology of China

The Hong Kong Polytechnic University Harbin Institute of Technology (Shenzhen)

润洋游

R2ec introduces a unified large recommender model that intrinsically integrates reasoning and recommendation capabilities within a single architecture, optimizing performance and interpretability without relying on human-annotated reasoning data. The model consistently surpasses existing baselines in recommendation quality across multiple datasets while maintaining competitive inference efficiency.

469

02 Jun 2025

computer-science artificial-intelligence computation-and-language

KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning

Huawei Noah’s Ark Lab Harbin Institute of Technology (Shenzhen)

KDRL (Knowledge Distillation and Reinforcement Learning) introduces a unified post-training framework that integrates knowledge distillation and reinforcement learning to enhance reasoning capabilities in LLMs, achieving notable performance improvements and increased efficiency on mathematical reasoning benchmarks. This framework addresses the limitations of applying these methods separately by jointly optimizing both objectives.

195

09 Oct 2025

computer-science contrastive-learning artificial-intelligence

Parallel Test-Time Scaling for Latent Reasoning Models

University of Science and Technology of China

The Hong Kong Polytechnic University Harbin Institute of Technology (Shenzhen)Shandong Jianzhu University

Researchers from The Hong Kong Polytechnic University and collaborators developed a framework that enables parallel test-time scaling for latent reasoning models, addressing the challenges of generating and evaluating diverse reasoning paths in continuous latent spaces. This approach enhances the performance of models like COCONUT, CODI, and CoLaR on arithmetic reasoning tasks by effectively leveraging additional inference compute.

319

01 Oct 2025

agents attention-mechanisms computer-science

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

Harbin Institute of Technology (Shenzhen)

CogVLA, developed by researchers at Harbin Institute of Technology, introduces a cognition-aligned, instruction-driven framework for Vision-Language-Action (VLA) models. It achieves a 97.4% average success rate on the LIBERO benchmark and 70.0% on real-world ALOHA tasks, while also reducing inference time by 2.79x and FLOPs by 3.12x compared to leading baselines.

508

27 Jul 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning

Peng Cheng Laboratory Harbin Institute of Technology (Shenzhen)

HKUST Tsinghua Shenzhen International Graduate School, Tsinghua University

Researchers from Harbin Institute of Technology, Tsinghua University, Peng Cheng Laboratory, and HKUST introduce HLFormer, a hyperbolic learning framework for Partially Relevant Video Retrieval (PRVR). This model achieves state-of-the-art results on ActivityNet Captions, Charades-STA, and TVR benchmarks by effectively modeling video's semantic hierarchy and enforcing partial relevance through a novel hyperbolic loss.

478

19 Oct 2025

agentic-frameworks agents cloud-computing

Repo2Run: Automated Building Executable Environment for Code Repository at Scale

ByteDance Harbin Institute of Technology (Shenzhen)

Repo2Run introduces an LLM-based agent capable of automating the creation of executable test environments for code repositories, achieving an 86.0% success rate in building environments and running tests on a new 420-repository benchmark. This system synthesizes reproducible Dockerfiles, providing foundational infrastructure for training more capable software engineering LLMs.

1,127

07 May 2024

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

National University of Singapore

Nanyang Technological University Harbin Institute of Technology (Shenzhen)

A new framework, Video-of-Thought (VoT), mimics human cognition by breaking down complex video reasoning into five structured steps, from perception to cognitive interpretation. The MotionEpic model, which implements VoT with Spatio-Temporal Scene Graph integration, demonstrates superior performance over existing video MLLMs on complex video question-answering tasks and competitive grounding capabilities.

196

14 Oct 2025

computer-science computation-and-language data-curation

KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

Harbin Institute of Technology (Shenzhen)

The KaLM-Embedding-V2 series introduces compact (0.5B parameter) and versatile text embedding models that achieve state-of-the-art performance on MTEB benchmarks for their size. Developed by SLAI and Tencent, these models utilize superior training techniques and meticulously curated data, offering a fully open-sourced solution for improved RAG and diverse NLP applications.

206

14 Jul 2025

computer-science computer-vision-and-pattern-recognition robotics

Video Individual Counting for Moving Drones

Sun Yat-Sen University

City University of Hong Kong Harbin Institute of Technology (Shenzhen)

HKUST

Video Individual Counting (VIC) has received increasing attention for its importance in intelligent video surveillance. Existing works are limited in two aspects, i.e., dataset and method. Previous datasets are captured with fixed or rarely moving cameras with relatively sparse individuals, restricting evaluation for a highly varying view and time in crowded scenes. Existing methods rely on localization followed by association or classification, which struggle under dense and dynamic conditions due to inaccurate localization of small targets. To address these issues, we introduce the MovingDroneCrowd Dataset, featuring videos captured by fast-moving drones in crowded scenes under diverse illuminations, shooting heights and angles. We further propose a Shared Density map-guided Network (SDNet) using a Depth-wise Cross-Frame Attention (DCFA) module to directly estimate shared density maps between consecutive frames, from which the inflow and outflow density maps are derived by subtracting the shared density maps from the global density maps. The inflow density maps across frames are summed up to obtain the number of unique pedestrians in a video. Experiments on our datasets and publicly available ones show the superiority of our method over the state of the arts in highly dynamic and complex crowded scenes. Our dataset and codes have been released publicly.

1,434

04 Dec 2024

computer-science computation-and-language computers-and-society

From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents

Fudan University Harbin Institute of Technology (Shenzhen)East China Normal University Shanghai Innovation Institute

杰周

This survey provides a comprehensive overview of LLM-based social simulation, categorizing existing work into Individual, Scenario, and Society simulations based on their scale and precision requirements. It details the architectures, construction methods, objectives, and evaluation approaches for each category, highlighting research trends and future directions in the field.

188

137

13 Nov 2025

computer-science computer-vision-and-pattern-recognition robotics

SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

Huawei Noah’s Ark Lab Harbin Institute of Technology (Shenzhen)

Researchers from Harbin Institute of Technology, Shenzhen, and Huawei Noah’s Ark Lab developed SemanticVLA, a framework for robotic manipulation that addresses perceptual redundancy and superficial instruction-vision alignment. It achieved a 97.7% success rate on the LIBERO benchmark, reducing training cost by 3.0x and inference latency by 2.7x compared to OpenVLA, while demonstrating robust performance in real-world tasks.

1,032

18 Jun 2024

computer-science artificial-intelligence computation-and-language

AutoSurvey: Large Language Models Can Automatically Write Surveys

Nanjing University

Westlake University

Peking University Harbin Institute of Technology (Shenzhen)Squirrel AI

AutoSurvey presents a multi-phase, agent-based framework that automates the creation of comprehensive literature surveys, achieving a substantial speed increase (up to 73.59 surveys per hour for 64k tokens) while maintaining citation and content quality metrics comparable to human-authored benchmarks.

427

190

04 Jun 2025

computer-science artificial-intelligence computation-and-language

CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG

National University of Singapore Harbin Institute of Technology (Shenzhen)

Shandong University

A framework called CoRe-MMRAG improves multimodal retrieval-augmented generation by reconciling conflicts between a model's internal knowledge and external data, as well as inconsistencies between visual and textual information. The system achieves 3.5 percentage points higher accuracy on InfoSeek and 2.9 percentage points on Enc-VQA compared to leading baselines in knowledge-based Visual Question Answering.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

A Joint Approach to Local Updating and Gradient Compression for Efficient Asynchronous Federated Learning

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

A Survey on Large Language Model Benchmarks

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

R $^2$ ec: Towards Large Recommender Models with Reasoning

KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning

Parallel Test-Time Scaling for Latent Reasoning Models

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning

Repo2Run: Automated Building Executable Environment for Code Repository at Scale

Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

Video Individual Counting for Moving Drones

From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents

SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

AutoSurvey: Large Language Models Can Automatically Write Surveys

CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG

Events

AI for Law

Personalize Your Feed

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

A Joint Approach to Local Updating and Gradient Compression for Efficient Asynchronous Federated Learning

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

A Survey on Large Language Model Benchmarks

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

R2^22ec: Towards Large Recommender Models with Reasoning

KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning

Parallel Test-Time Scaling for Latent Reasoning Models

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning

Repo2Run: Automated Building Executable Environment for Code Repository at Scale

Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

Video Individual Counting for Moving Drones

From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents

SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

AutoSurvey: Large Language Models Can Automatically Write Surveys

CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG

Events

AI for Law

Personalize Your Feed

R $^2$ ec: Towards Large Recommender Models with Reasoning