alphaXiv

History

Papers Benchmarks

Beijing University of Posts and Telecommunications

2,066

21 Feb 2024

computer-science computation-and-language text-generation

OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language Models

Northeastern University Shanghai University of Finance and Economics

Tsinghua University Beijing University of Posts and Telecommunications Beijing Language and Culture University

wang haoyu

Cunliang Kong

Researchers at Beijing Language and Culture University, Tsinghua University, and collaborators introduce OMGEval, the first open-source multilingual generative evaluation benchmark that incorporates explicit cultural localization. This benchmark highlights a substantial performance gap between proprietary models like GPT-4 and open-source LLMs in understanding and generating culturally nuanced text across Chinese, Russian, French, Spanish, and Arabic.

1,552

22 Sep 2025

computer-science robotics

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Zhejiang University

Westlake University The Hong Kong University of Science and Technology (Guangzhou)Beijing University of Posts and Telecommunications State Key Laboratory of Networking and Switching Technology OpenHelix Team

Researchers from Beijing University of Posts and Telecommunications, Westlake University, and Zhejiang University, along with the OpenHelix Team, introduce VLA-Adapter, an efficient method to bridge vision-language representations to robotic actions. The approach enables state-of-the-art level performance with a tiny-scale 0.5B parameter backbone without robotic data pre-training, achieving a 97.3% average success rate on the LIBERO benchmark and providing a 3x faster inference speed (219.2 Hz) than comparable methods.

578

14,756

28 Apr 2025

computer-science artificial-intelligence information-retrieval

LightRAG: Simple and Fast Retrieval-Augmented Generation

University of Hong Kong Beijing University of Posts and Telecommunications

Lianghao Xia

LightRAG integrates graph structures into text indexing and uses a dual-level retrieval paradigm to enhance Retrieval-Augmented Generation (RAG) systems. This approach improves contextual understanding, response diversity, retrieval efficiency, and adaptability to new data compared to existing RAG methods.

11,923

1,009

06 Dec 2025

agentic-frameworks agents ai-for-cybersecurity

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence

A comprehensive synthesis of Large Language Models for automated software development covers the entire model lifecycle, from data curation to autonomous agents, and offers practical guidance derived from empirical experiments on pre-training, fine-tuning, and reinforcement learning, alongside a detailed analysis of challenges and future directions.

3,998

23 Jun 2025

agents computer-science artificial-intelligence

RLPR: Extrapolating RLVR to General Domains without Verifiers

University of Illinois at Urbana-Champaign

National University of Singapore

Tsinghua University Harbin Institute of Technology Beijing University of Posts and Telecommunications Shanghai Qi-Zhi Institute

Researchers from Tsinghua University and NUS developed RLPR, a verifier-free reinforcement learning framework that enhances Large Language Model reasoning across general domains by using an intrinsic probability-based reward. This method achieved an average 24.9% improvement on general-domain benchmarks and consistently outperformed existing RLVR and concurrent verifier-free approaches by removing the need for external verification.

166

694

06 Nov 2025

agents computer-science computer-vision-and-pattern-recognition

V-Thinker: Interactive Thinking with Images

Beijing University of Posts and Telecommunications Tencent Inc

V-Thinker introduces a framework that empowers Large Multimodal Models with interactive, vision-centric reasoning capabilities by enabling them to autonomously modify and reflect on an image's visual state through code-driven tools. The system achieves an average accuracy improvement of 14.6% over baseline models on the new VTBench benchmark for interactive reasoning.

854

21 Oct 2025

agents chain-of-thought computer-science

Multi-Agent Collaboration via Evolving Orchestration

Shanghai Jiao Tong University

Tsinghua University Beijing University of Posts and Telecommunications Siemens Tencent Robotics X

Researchers from Tsinghua University, Shanghai Jiao Tong University, Siemens, and Tencent Robotics X developed "Puppeteer," a multi-agent framework that uses a centralized, learnable orchestrator to dynamically coordinate LLM-based agents. This approach, building on previous ChatDev work, achieves superior performance and reduced computational costs by adaptively evolving agent interaction topologies across diverse reasoning and generative tasks.

666

04 Oct 2025

computer-science robotics

Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation

Beijing University of Posts and Telecommunications Beijing Institute for General Artificial Intelligence (BIGAI)

Researchers at BIGAI and UniTree Robotics developed a unified policy for legged robots to control both position and contact force without relying on external force sensors. This policy enhanced imitation learning by generating force-aware data, leading to a 39.5% improvement in success rates for contact-rich manipulation tasks on quadrupedal and humanoid robots.

448

02 Oct 2025

agentic-frameworks agents computer-science

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

Tsinghua University Beijing University of Posts and Telecommunications

Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents, showing promise in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in domains such as software engineering and scientific discovery, the finance domain remains underexplored, despite its direct relevance to economic value and high-stakes decision-making. Existing financial benchmarks primarily test static knowledge through question answering, but they fall short of capturing the dynamic and iterative nature of trading. To address this gap, we introduce StockBench, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals -- including prices, fundamentals, and news -- and must make sequential buy, sell, or hold decisions. Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio. Our evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively. These findings highlight both the challenges and opportunities in developing LLM-powered financial agents, showing that excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies. We release StockBench as an open-source resource to support reproducibility and advance future research in this domain.

1,252

30 May 2025

agentic-frameworks agents computer-science

Memory OS of AI Agent

Tencent AI Lab Beijing University of Posts and Telecommunications

MemoryOS, developed by researchers at Beijing University of Posts and Telecommunications and Tencent AI Lab, proposes a comprehensive operating system-inspired framework to manage long-term memory for AI agents, overcoming LLMs' context window limitations. This system significantly improves conversational coherence and personalization, achieving up to 49.11% F1 score improvement on ultra-long dialogues while reducing LLM calls by over 60% compared to baseline methods.

149

1,770

29 Jun 2025

computer-science artificial-intelligence computation-and-language

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Renmin University of China Beijing University of Posts and Telecommunications

HKUST

Time-R1 presents a reinforcement learning (RL) post-training framework for Large Vision-Language Models (LVLMs) to enhance temporal video grounding (TVG). The approach achieves state-of-the-art zero-shot performance on TVG benchmarks like Charades-STA and ActivityNet, demonstrating improved generalization and data efficiency by leveraging only 2.5K RL training samples.

408

09 Oct 2025

agentic-frameworks agents computer-science

Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents

Chinese Academy of Sciences

National University of Singapore

Tsinghua University

Peking University Beijing University of Posts and Telecommunications Moonshot AI

Kimi-Dev, an open-source 72B code LLM developed by Moonshot AI and academic collaborators, demonstrates that Agentless training can effectively serve as a structured skill prior to enhance multi-turn SWE-Agents. This approach achieves a state-of-the-art 60.4% resolve rate on SWE-bench Verified in Agentless mode and a competitive 48.6% pass@1 in agentic mode after minimal fine-tuning, while showing strong generalization to diverse benchmarks.

984

1,614

01 Nov 2025

agentic-frameworks agents computer-science

VideoExplorer: Think With Videos For Agentic Long-Video Understanding

Renmin University of China Beijing Academy of Artificial Intelligence Beijing University of Posts and Telecommunications

Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(this https URL).

1,504

29 Jul 2025

agentic-frameworks agents computer-science

Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning

Nanyang Technological University Beijing University of Posts and Telecommunications

GRAPH-R1 introduces an agentic GraphRAG framework that employs end-to-end reinforcement learning for multi-turn interaction with lightweight knowledge hypergraphs. This approach achieves state-of-the-art performance across various RAG benchmarks, improving reasoning accuracy, retrieval efficiency, and generation quality while demonstrating strong out-of-distribution generalizability.

404

332

05 Dec 2025

computer-science computer-vision-and-pattern-recognition distributed-learning

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Alibaba Group

University of Science and Technology of China

Zhejiang University Beijing University of Posts and Telecommunications

Researchers from Alibaba Group and USTC developed Live Avatar, an algorithm-system co-designed framework for real-time, high-fidelity, and infinite-length audio-driven avatar generation using a 14-billion-parameter diffusion model. The system achieves 20.88 FPS and demonstrates visual consistency for over 10,000 seconds, significantly advancing practical applications.

2,086

05 Jun 2024

computer-science conversational-ai computation-and-language

ChatDev: Communicative Agents for Software Development

Tsinghua University Beijing University of Posts and Telecommunications

Brown University

Wei Liu

Hongzhang Liu (Alpha)

Researchers from Tsinghua University developed ChatDev, a chat-powered framework that enables specialized Large Language Model agents to collaboratively create software from design to testing using multi-turn dialogues. This framework produced software with an executability score of 0.8800 and a quality score of 0.3953 on the SRDD dataset, outperforming single-agent and statically-instructed multi-agent baselines.

320

11 Oct 2025

chain-of-thought computer-science computer-vision-and-pattern-recognition

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

Shanghai AI Laboratory

Shanghai Jiao Tong University

Zhejiang University Beijing University of Posts and Telecommunications

Princeton University

The MM-HELIX framework evaluates and enhances multimodal large language models' long-chain reflective reasoning through a novel benchmark and an adaptive hybrid policy optimization strategy. This approach achieved an 18.6% accuracy improvement on its own benchmark and a 5.7% gain in generalization across other mathematical and logic tasks.

1,198

28 Dec 2023

computer-science computer-vision-and-pattern-recognition machine-learning

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

Tsinghua University Beijing University of Posts and Telecommunications

Researchers from Tsinghua University and Zhipu AI introduce ImageReward, a general-purpose human preference reward model, and Reward Feedback Learning (ReFL), a method for directly optimizing text-to-image diffusion models. ImageReward accurately predicts human aesthetic and alignment preferences, while ReFL leverages this feedback to improve generated image quality and human alignment.

1,297

769

25 Aug 2025

agents autonomous-vehicles computer-science

RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

Chinese Academy of Sciences

National University of Singapore

Tsinghua University

Peking University South China Normal University Beijing University of Posts and Telecommunications

Shandong University StepFun University of Chinese Academy of Science

RepoMaster enables LLM-based agents to autonomously explore and understand complex GitHub repositories for task solving by intelligently reusing and adapting existing codebases. It significantly improves task success rates by up to 110% and reduces token consumption by approximately 95% compared to state-of-the-art baselines.

398

1,588

21 Oct 2025

chain-of-thought computer-science artificial-intelligence

HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation

National University of Singapore

Nanyang Technological University Beijing University of Posts and Telecommunications Beijing Institute of Computer Technology and Application Beijing Anzhen Hospital, Capital Medical University

HyperGraphRAG introduces a Retrieval-Augmented Generation (RAG) system that represents knowledge using hypergraphs to capture complex n-ary relations. The method consistently outperforms existing binary graph-based RAGs and standard RAG across diverse knowledge-intensive domains, demonstrating improvements in answer accuracy, retrieval relevance, and generation quality.

There are no more papers matching your filters at the moment.

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language Models

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

LightRAG: Simple and Fast Retrieval-Augmented Generation

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence

RLPR: Extrapolating RLVR to General Domains without Verifiers

V-Thinker: Interactive Thinking with Images

Multi-Agent Collaboration via Evolving Orchestration

Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

Memory OS of AI Agent

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents

VideoExplorer: Think With Videos For Agentic Long-Video Understanding

Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

ChatDev: Communicative Agents for Software Development

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation

Personalize Your Feed