alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Zhongguancun LaboratoryBeijingChina

ReDAN: An Empirical Study on Remote DoS Attacks against NAT Networks

25 Nov 2024

Tsinghua University Southeast University

This empirical study from Tsinghua University and collaborators demonstrates that off-path attackers can remotely identify NAT devices via a PMTUD side channel and launch Denial-of-Service (DoS) attacks by manipulating TCP connections. Over 92% of 180 tested real-world NAT networks, including 4G/5G and public Wi-Fi, were found vulnerable, leading to the assignment of 5 CVE/CNVD identifiers.

#computer-science #computer-vision-security #cryptography-and-security

Paper thumbnail

FedCache: A Knowledge Cache-driven Federated Learning Architecture for Personalized Edge Intelligence

01 Feb 2024

xuefeng-jiang

Xuefeng Jiang

Chinese Academy of Sciences

University of Science and Technology of China

Researchers from ICT, Chinese Academy of Sciences, developed FedCache, a knowledge cache-driven federated learning architecture that facilitates personalized edge intelligence. It achieves performance comparable to state-of-the-art methods while reducing communication overhead by more than two orders of magnitude, notably being the first sample-grained logits interaction method without feature transmission or public datasets.

#computer-science #distributed-parallel-and-cluster-computing

Paper thumbnail

Harnessing Multiple Large Language Models: A Survey on LLM Ensemble

18 Sep 2025

Beihang University University of Macau

LLM Ensemble -- which involves the comprehensive use of multiple large language models (LLMs), each aimed at handling user queries during downstream inference, to benefit from their individual strengths -- has gained substantial attention recently. The widespread availability of LLMs, coupled with their varying strengths and out-of-the-box usability, has profoundly advanced the field of LLM Ensemble. This paper presents the first systematic review of recent developments in LLM Ensemble. First, we introduce our taxonomy of LLM Ensemble and discuss several related research problems. Then, we provide a more in-depth classification of the methods under the broad categories of "ensemble-before-inference, ensemble-during-inference, ensemble-after-inference'', and review all relevant methods. Finally, we introduce related benchmarks and applications, summarize existing studies, and suggest several future research directions. A curated list of papers on LLM Ensemble is available at this https URL.

#computer-science #computation-and-language

Paper thumbnail

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

24 Dec 2024

qingxiu-dong

Qingxiu Dong

zefan-cai

Zefan Cai

University of Waterloo Chinese Academy of Sciences logo

Chinese Academy of Sciences

Peking University and Alibaba Group researchers introduce Omni-MATH, a comprehensive, text-only, Olympiad-level mathematical reasoning benchmark. It features over 4,400 problems meticulously categorized by 33+ sub-domains and 10+ difficulty levels, revealing that state-of-the-art LLMs achieve only up to 60.54% accuracy, indicating significant remaining challenges in complex mathematical reasoning.

#computer-science #computation-and-language #model-interpretation

Paper thumbnail

Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation

27 Apr 2025

qi-li-zhang

启立张

Beihang University

Nanyang Technological University

A federated learning framework enables privacy-preserving collaborative training of RAG retrieval models across multiple organizations, combining homomorphic encryption and knowledge distillation to achieve 90.22 MAP score on financial domain tasks while keeping sensitive data localized within each client's environment.

#computer-science #computation-and-language #embedding-methods

Paper thumbnail

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

08 Nov 2025

Xiamen University East China Normal University

This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need to sample multiple completions for each question. Our experiment and theoretical analysis reveal that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experiments show that CPPO achieves up to

7.98\times

speedup on GSM8K and

3.48\times

on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at \href{this https URL}{this https URL}.

#agents #chain-of-thought #computer-science

Paper thumbnail

TrafficLLM: Enhancing Large Language Models for Network Traffic Analysis with Generic Traffic Representation

15 Apr 2025

Tsinghua University Zhongguancun Laboratory

Machine learning (ML) powered network traffic analysis has been widely used for the purpose of threat detection. Unfortunately, their generalization across different tasks and unseen data is very limited. Large language models (LLMs), known for their strong generalization capabilities, have shown promising performance in various domains. However, their application to the traffic analysis domain is limited due to significantly different characteristics of network traffic. To address the issue, in this paper, we propose TrafficLLM, which introduces a dual-stage fine-tuning framework to learn generic traffic representation from heterogeneous raw traffic data. The framework uses traffic-domain tokenization, dual-stage tuning pipeline, and extensible adaptation to help LLM release generalization ability on dynamic traffic analysis tasks, such that it enables traffic detection and traffic generation across a wide range of downstream tasks. We evaluate TrafficLLM across 10 distinct scenarios and 229 types of traffic. TrafficLLM achieves F1-scores of 0.9875 and 0.9483, with up to 80.12% and 33.92% better performance than existing detection and generation methods. It also shows strong generalization on unseen traffic with an 18.6% performance improvement. We further evaluate TrafficLLM in real-world scenarios. The results confirm that TrafficLLM is easy to scale and achieves accurate detection performance on enterprise traffic.

#ai-for-cybersecurity #computer-science #artificial-intelligence

Paper thumbnail

Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model

06 Mar 2025

Wuhan University Zhejiang University logo

Zhejiang University

The paper "Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model" provides a systematic review and unified benchmark for tuning MLLMs, classifying methods into Selective, Additive, and Reparameterization paradigms. It empirically analyzes the trade-offs between task-expert specialization and open-world stabilization, offering practical guidelines for MLLM deployment.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf

11 May 2024

pengli09

Peng Li

Tsinghua University Zhongguancun Laboratory

Communication games, which we refer to as incomplete information games that heavily depend on natural language communication, hold significant research value in fields such as economics, social science, and artificial intelligence. In this work, we explore the problem of how to engage large language models (LLMs) in communication games, and in response, propose a tuning-free framework. Our approach keeps LLMs frozen, and relies on the retrieval and reflection on past communications and experiences for improvement. An empirical study on the representative and widely-studied communication game, ``Werewolf'', demonstrates that our framework can effectively play Werewolf game without tuning the parameters of the LLMs. More importantly, strategic behaviors begin to emerge in our experiments, suggesting that it will be a fruitful journey to engage LLMs in communication games and associated domains.

#computer-science #conversational-ai #computation-and-language

Paper thumbnail

Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation

02 Mar 2025

Shanghai Artificial Intelligence Laboratory Fudan University logo

Fudan University

A framework named ProverGen, developed by researchers including those at Beihang University and Shanghai AI Laboratory, combines large language models with symbolic provers to automatically generate ProverQA, a challenging and logically sound benchmark for first-order logic reasoning. This generated data effectively enhances LLM reasoning abilities, leading to consistent performance gains across both in-distribution and out-of-distribution logical tasks.

#chain-of-thought #computer-science #computation-and-language

Paper thumbnail

SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models

24 Oct 2024

National University of Singapore

University of Science and Technology of China

SafeBench introduces a comprehensive framework for evaluating the safety of Multimodal Large Language Models (MLLMs) through an LLM-driven generation of high-quality multimodal harmful queries and an automated 'Jury Deliberation Protocol'. It revealed commercial MLLMs generally exhibit higher safety than open-source counterparts, identified specific high-risk categories like Cybersecurity, and demonstrated that multimodal fine-tuning can degrade the safety of underlying language models.

#computer-science #cryptography-and-security

Paper thumbnail

Traj-LLM: A New Exploration for Empowering Trajectory Prediction with Pre-trained Large Language Models

08 May 2024

Chinese Academy of Sciences Beihang University logo

Beihang University

Traj-LLM introduces a framework that integrates pre-trained Large Language Models into trajectory prediction for autonomous driving, processing spatial-temporal features as tokens rather than using explicit prompt engineering. It achieved state-of-the-art performance on the nuScenes dataset and demonstrated robust few-shot learning capabilities while maintaining practical inference times.

#autonomous-vehicles #computer-science #artificial-intelligence

Paper thumbnail

Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning

23 Oct 2025

Beihang University Peking University logo

Peking University

In cooperative Multi-Agent Reinforcement Learning (MARL), it is a common practice to tune hyperparameters in ideal simulated environments to maximize cooperative performance. However, policies tuned for cooperation often fail to maintain robustness and resilience under real-world uncertainties. Building trustworthy MARL systems requires a deep understanding of robustness, which ensures stability under uncertainties, and resilience, the ability to recover from disruptions--a concept extensively studied in control systems but largely overlooked in MARL. In this paper, we present a large-scale empirical study comprising over 82,620 experiments to evaluate cooperation, robustness, and resilience in MARL across 4 real-world environments, 13 uncertainty types, and 15 hyperparameters. Our key findings are: (1) Under mild uncertainty, optimizing cooperation improves robustness and resilience, but this link weakens as perturbations intensify. Robustness and resilience also varies by algorithm and uncertainty type. (2) Robustness and resilience do not generalize across uncertainty modalities or agent scopes: policies robust to action noise for all agents may fail under observation noise on a single agent. (3) Hyperparameter tuning is critical for trustworthy MARL: surprisingly, standard practices like parameter sharing, GAE, and PopArt can hurt robustness, while early stopping, high critic learning rates, and Leaky ReLU consistently help. By optimizing hyperparameters only, we observe substantial improvement in cooperation, robustness and resilience across all MARL backbones, with the phenomenon also generalizing to robust MARL methods across these backbones. Code and results available at this https URL .

#adversarial-robustness #agents #computer-science

Paper thumbnail

AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions

19 Oct 2025

Beihang University

Nanyang Technological University

A new benchmark, AGENTSAFE, systematically evaluates the safety of embodied vision-language model (VLM) agents against hazardous instructions, revealing vulnerabilities primarily in the planning stage. The research introduces SAFE-AUDIT, a thought-level safety module that improves task success rate by 2.22% on normal instructions and achieves the lowest planning (3.52%) and task success rates (0.48%) for hazardous tasks.

#computer-science #cryptography-and-security #robotics

Paper thumbnail

LATTE: A Decoding Architecture for Quantum Computing with Temporal and Spatial Scalability

04 Sep 2025

Tsinghua University Zhongguancun Laboratory

Quantum error correction allows inherently noisy quantum devices to emulate an ideal quantum computer with reasonable resource overhead. As a crucial component, decoding architectures have received significant attention recently. In this paper, we introduce LATTE, a FPGA-CPU hybrid decoding architecture aiming to address the key requirements of scaling up in lattice surgery quantum computation -- Latency, Accuracy, Throughput and Transmission Bandwidth, in an Eclectic manner. LATTE follows a hierarchical design: (1) A fully streaming and asynchronous block decoding system on CPU to enable parallelization both temporally and spatially. (2) A super-light yet accurate neural local decoding unit integrated with quantum control hardware on FPGA, which remains \emph{transparent} to the block decoding system, effectively reducing transmission bandwidth and accelerating the decoding process. LATTE delivers accuracy on par with the base decoder while achieving real-time decoding throughput and significantly reducing both bandwidth requirements and computational resources, enabling a level of scalability far beyond previous approaches. Under circuit-level noise

p=0.001

, LATTE achieves over

\mathbf{90\%}

reduction in transmission bandwidth and a

\mathbf{6.4\times}

speedup on average in single-block decoding. In the \emph{streaming decoding} scenario: (1) LATTE achieves constant and low latency (

\mathbf{16\times}

-

\mathbf{20\times}

speedup over existing streaming decoding implementations) in arbitrarily long quantum memory experiments, with near-optimal resources -- merely

\mathbf{2}

threads are sufficient for decoding the surface code with distance up to

17

. (2) LATTE minimizes latency in multi-patch measurement experiments through highly parallelized decoding operations. These combined efforts ensure sufficient scalability for large-scale fault-tolerant quantum computing.

#physics #quantum-physics

Paper thumbnail

DyGKT: Dynamic Graph Learning for Knowledge Tracing

30 Jul 2024

Beihang University University of Macau

Knowledge Tracing aims to assess student learning states by predicting their performance in answering questions. Different from the existing research which utilizes fixed-length learning sequence to obtain the student states and regards KT as a static problem, this work is motivated by three dynamical characteristics: 1) The scales of students answering records are constantly growing; 2) The semantics of time intervals between the records vary; 3) The relationships between students, questions and concepts are evolving. The three dynamical characteristics above contain the great potential to revolutionize the existing knowledge tracing methods. Along this line, we propose a Dynamic Graph-based Knowledge Tracing model, namely DyGKT. In particular, a continuous-time dynamic question-answering graph for knowledge tracing is constructed to deal with the infinitely growing answering behaviors, and it is worth mentioning that it is the first time dynamic graph learning technology is used in this field. Then, a dual time encoder is proposed to capture long-term and short-term semantics among the different time intervals. Finally, a multiset indicator is utilized to model the evolving relationships between students, questions, and concepts via the graph structural feature. Numerous experiments are conducted on five real-world datasets, and the results demonstrate the superiority of our model. All the used resources are publicly available at this https URL.

#computer-science #continual-learning #machine-learning

Paper thumbnail

First-Order Error Matters: Accurate Compensation for Quantized Large Language Models

14 Nov 2025

Beihang University Xidian University

Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by performing a first-order Taylor expansion around the pre-quantization weights. This yields an approximation based on the difference between latent and full-precision weights as well as the Hessian matrix. When substituted into the theoretical solution, the formulation eliminates the need to explicitly compute the Hessian, thereby avoiding the high computational cost and limited generalization of backpropagation-based gradient methods. This design introduces only minimal additional computational overhead. Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 17.3% and increases the 5-shot MMLU accuracy from 53.8% achieved by GPTAQ to 56.1%. Moreover, FOEM can be seamlessly combined with advanced techniques such as SpinQuant, delivering additional gains under the challenging W4A4KV4 setting and further narrowing the performance gap with full-precision baselines, surpassing existing state-of-the-art methods.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method

21 May 2025

University of Amsterdam Chinese Academy of Sciences logo

Chinese Academy of Sciences

A divergence-based calibration method, DC-PDD, was developed to enhance the detection of pretraining data in black-box large language models. This method consistently outperforms existing state-of-the-art approaches across English and Chinese benchmarks, including a new PatentMIA dataset, by effectively calibrating token probabilities.

#computer-science #computation-and-language #cryptography-and-security

Paper thumbnail

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

11 Oct 2025

Chinese Academy of Sciences Beihang University logo

Beihang University

SecureWebArena is introduced as the first holistic security evaluation benchmark for LVLM-based web agents, integrating diverse web environments, a broad attack taxonomy, and a multi-layered evaluation protocol. The benchmark reveals consistent vulnerabilities across state-of-the-art models, with pop-up attacks being particularly effective and achieving Payload Delivery Rates (PDR) from 76.67% to 100%.

#adversarial-attacks #adversarial-robustness #agents

Paper thumbnail

A Comprehensive Survey of Action Quality Assessment: Method and Benchmark

15 Dec 2024

Beihang University Tsinghua University logo

Tsinghua University

Action Quality Assessment (AQA) quantitatively evaluates the quality of human actions, providing automated assessments that reduce biases in human judgment. Its applications span domains such as sports analysis, skill assessment, and medical care. Recent advances in AQA have introduced innovative methodologies, but similar methods often intertwine across different domains, highlighting the fragmented nature that hinders systematic reviews. In addition, the lack of a unified benchmark and limited computational comparisons hinder consistent evaluation and fair assessment of AQA approaches. In this work, we address these gaps by systematically analyzing over 150 AQA-related papers to develop a hierarchical taxonomy, construct a unified benchmark, and provide an in-depth analysis of current trends, challenges, and future directions. Our hierarchical taxonomy categorizes AQA methods based on input modalities (video, skeleton, multi-modal) and their specific characteristics, highlighting the evolution and interrelations across various approaches. To promote standardization, we present a unified benchmark, integrating diverse datasets to evaluate the assessment precision and computational efficiency. Finally, we review emerging task-specific applications and identify under-explored challenges in AQA, providing actionable insights into future research directions. This survey aims to deepen understanding of AQA progress, facilitate method comparison, and guide future innovations. The project web page can be found at this https URL.

#computer-science #computer-vision-security #computer-vision-and-pattern-recognition

Paper thumbnail

There are no more papers matching your filters at the moment.