Ask or search anything...

Researchers at Lancaster University and Mindgard developed "Model Leeching," a black-box extraction attack that distills task-specific knowledge from large, proprietary language models like ChatGPT-3.5-Turbo into smaller, local models. This method enabled high-fidelity replication of LLM behavior at a cost of only $50, and demonstrated an 11% increase in adversarial attack success against the target LLM by using the extracted model for attack staging.

#computer-science #artificial-intelligence #computation-and-language

Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems

14 Jul 2025

Researchers empirically demonstrated that current Large Language Model (LLM) guardrail systems are highly vulnerable to both simple character injection and sophisticated adversarial machine learning evasion techniques. The study revealed that even widely used commercial and open-source guardrails can be bypassed with high success rates, up to 100% for certain methods like emoji smuggling, highlighting critical weaknesses in existing LLM protection mechanisms.

#adversarial-attacks #adversarial-robustness #ai-for-cybersecurity

PINCH: An Adversarial Extraction Attack Framework for Deep Learning Models

218

31 Jan 2023

Adversarial extraction attacks constitute an insidious threat against Deep Learning (DL) models in-which an adversary aims to steal the architecture, parameters, and hyper-parameters of a targeted DL model. Existing extraction attack literature have observed varying levels of attack success for different DL models and datasets, yet the underlying cause(s) behind their susceptibility often remain unclear, and would help facilitate creating secure DL systems. In this paper we present PINCH: an efficient and automated extraction attack framework capable of designing, deploying, and analyzing extraction attack scenarios across heterogeneous hardware platforms. Using PINCH, we perform extensive experimental evaluation of extraction attacks against 21 model architectures to explore new extraction attack scenarios and further attack staging. Our findings show (1) key extraction characteristics whereby particular model configurations exhibit strong resilience against specific attacks, (2) even partial extraction success enables further staging for other adversarial attacks, and (3) equivalent stolen models uncover differences in expressive power, yet exhibit similar captured knowledge.

#adversarial-attacks #computer-science #artificial-intelligence

Compilation as a Defense: Enhancing DL Model Attack Robustness via Tensor Optimization

20 Sep 2023

Adversarial Machine Learning (AML) is a rapidly growing field of security research, with an often overlooked area being model attacks through side-channels. Previous works show such attacks to be serious threats, though little progress has been made on efficient remediation strategies that avoid costly model re-engineering. This work demonstrates a new defense against AML side-channel attacks using model compilation techniques, namely tensor optimization. We show relative model attack effectiveness decreases of up to 43% using tensor optimization, discuss the implications, and direction of future work.

#adversarial-attacks #adversarial-robustness #ai-for-cybersecurity

Model Leeching: An Extraction Attack Targeting LLMs

19 Sep 2023

Model Leeching is a novel extraction attack targeting Large Language Models (LLMs), capable of distilling task-specific knowledge from a target LLM into a reduced parameter model. We demonstrate the effectiveness of our attack by extracting task capability from ChatGPT-3.5-Turbo, achieving 73% Exact Match (EM) similarity, and SQuAD EM and F1 accuracy scores of 75% and 87%, respectively for only $50 in API cost. We further demonstrate the feasibility of adversarial attack transferability from an extracted model extracted via Model Leeching to perform ML attack staging against a target LLM, resulting in an 11% increase to attack success rate when applied to ChatGPT-3.5-Turbo.

#adversarial-attacks #computer-science #artificial-intelligence