Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

BibTex

Copy

@misc{soulyWed Oct 08 2025 16:25:05 GMT+0000 (Coordinated Universal Time)poisoningattacksllms,
      title={Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples},
      author={Alexandra Souly and Javier Rando and Ed Chapman and Xander Davies and Burak Hasircioglu and Ezzeldin Shereen and Carlos Mougan and Vasilios Mavroudis and Erik Jones and Chris Hicks and Nicholas Carlini and Yarin Gal and Robert Kirk},
      year={Wed Oct 08 2025 16:25:05 GMT+0000 (Coordinated Universal Time)},
      eprint={2510.07192},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.07192},
}

GitHub

POISONING-ATTACKS-ON-LLMS-REQUIRE-A-NEAR-CONSTANT-NUMBER-OF-POISON-SAMPLES-paper-reproduce

HTTPS

https://github.com/842169963/POISONING-ATTACKS-ON-LLMS-REQUIRE-A-NEAR-CONSTANT-NUMBER-OF-POISON-SAMPLES-paper-reproduce

SSH

git@github.com:842169963/POISONING-ATTACKS-ON-LLMS-REQUIRE-A-NEAR-CONSTANT-NUMBER-OF-POISON-SAMPLES-paper-reproduce.git

CLI

gh repo clone 842169963/POISONING-ATTACKS-ON-LLMS-REQUIRE-A-NEAR-CONSTANT-NUMBER-OF-POISON-SAMPLES-paper-reproduce

AI Audio Lecture + Q&A

0:00 / 0:00

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Transcript

Speaker 1: We've got a fascinating paper on our hands today, one that fundamentally reshapes how we think about LLM security: Poisoning Attacks on LLMs Require a Near-Constant Number of Poison Samples. This isn't just another incremental update; it’s a direct challenge to a prevailing assumption in the field. For so long, we've implicitly believed that as LLMs scale to gargantuan datasets, poisoning them becomes harder because adversaries would need to control an ever-larger *percentage* of the training data. This paper says, 'Hold on, not so fast.' It posits that the actual risk is tied to a surprisingly small *absolute number* of malicious samples, which has huge implications for current and future AI systems. Speaker 2: That's a pretty bold claim, especially considering how much emphasis has been placed on the sheer scale of modern LLM training data. So, the core idea is that the 'dilution effect' we thought would protect larger models might not be as strong as we assumed? That feels counter-intuitive at first glance. Speaker 1: Exactly. The central contribution here is demonstrating that poisoning attacks consistently require a near-constant *absolute number* of poisoned samples to succeed, irrespective of the model's parameter count or the total volume of clean training data. Think about it: a 600-million parameter model and a 13-billion parameter model, training on datasets potentially 20 times larger, can both be effectively backdoored with as few as 250 to 500 poisoned documents. This means the *percentage* of poisoned data plummets for larger models, yet the attack success remains high. This finding is critical because it refutes the idea that simply training on more data makes your model inherently more robust against data poisoning. Instead, it suggests that larger models, with their enormous training data footprints, might actually present a proportionally larger 'attack surface' for a fixed, small number of malicious examples. They found this holds true not just for pretraining but also for fine-tuning, and across various types of backdoors like denial-of-service, language switching, and harmful instruction compliance. It's a comprehensive re-evaluation of the threat model. Speaker 2: So, instead of needing to compromise, say, 0.01 percent of a petabyte of data, an attacker just needs to inject a few hundred specific documents, no matter the total size? That's a staggering thought from a practical perspective. How did they actually go about testing this across such diverse scenarios and model scales? Speaker 1: They took a multi-pronged experimental approach, which is really impressive. For the largest pretraining experiments, they actually trained dense autoregressive transformers from scratch, ranging from 600 million up to 13 billion parameters. Each model was trained on Chinchilla-optimal datasets, so we're talking about huge amounts of clean data – for the 13B model, over 260 billion tokens. Critically, they injected a *fixed number* of poisoned documents – 100, 250, or 500 – uniformly at random. This allowed them to observe what happens when the absolute number is constant, but the percentage of poisons dramatically shrinks for larger models. The attack they used initially was a denial-of-service backdoor, where a trigger phrase would make the model output gibberish. Another key aspect was their ablation studies using Pythia models, where they explored a more complex language-switching backdoor. This allowed them to test how factors like batch density and poison frequency affected success, and found that the absolute count was still dominant. Finally, for fine-tuning, they applied similar principles to Llama-3.1-8B-Instruct and GPT-3.5-turbo, injecting a harmful instruction compliance backdoor. The most encouraging insight from their methodology, though, was seeing how effectively post-training alignment, even a simulated supervised fine-tuning, could mitigate these backdoors, often reducing attack success to near zero. This points to a potential strong defense. Speaker 2: That's a thorough setup, especially training models from scratch on such scales. The finding about alignment training being effective is a significant silver lining in what otherwise sounds like a pretty grim security outlook. So, if this shifts our understanding, what does it mean for the broader field and how we approach LLM development? Speaker 1: This research delivers a paradigm shift in our LLM security posture. It means the perceived difficulty of data poisoning attacks has been drastically underestimated for large models. Adversaries don't need to infiltrate entire data pipelines or control a significant fraction of a multi-terabyte dataset; they just need to introduce a few hundred carefully crafted documents. This vastly expands the attack surface and lowers the barrier for malicious actors. It directly connects to the existing work on scaling laws, showing that models might become more 'sample efficient' in learning from malicious data, just as they are for benign data. This demands an urgent re-evaluation of data governance, emphasizing rigorous provenance and integrity checks. It also highlights the critical importance of post-training alignment as a robust defense mechanism, offering a promising avenue for future research. The involvement of institutions like the UK AI Security Institute really underscores the national security implications here. Speaker 2: It certainly does. This paper is a wake-up call, emphasizing that bigger isn't automatically safer when it comes to data integrity. The takeaway for me is that we can't rely on dataset scale alone to dilute threats; instead, we need proactive, scalable defenses throughout the LLM lifecycle. It forces us to think harder about subtle, targeted attacks. Thanks for breaking it down.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples