Researchers at Lancaster University and Mindgard developed "Model Leeching," a black-box extraction attack that distills task-specific knowledge from large, proprietary language models like ChatGPT-3.5-Turbo into smaller, local models. This method enabled high-fidelity replication of LLM behavior at a cost of only $50, and demonstrated an 11% increase in adversarial attack success against the target LLM by using the extracted model for attack staging.
View blogResearchers empirically demonstrated that current Large Language Model (LLM) guardrail systems are highly vulnerable to both simple character injection and sophisticated adversarial machine learning evasion techniques. The study revealed that even widely used commercial and open-source guardrails can be bypassed with high success rates, up to 100% for certain methods like emoji smuggling, highlighting critical weaknesses in existing LLM protection mechanisms.
View blog