AI models can be “poisoned” with as few as 250 malicious files

A new joint study by the UK AI Security Institute, the Alan Turing Institute, and Anthropic has found that large language models (LLMs) like ChatGPT and Claude can be “poisoned” using only a few hundred malicious training samples, Kazinform News Agency correspondent reports, citing The Conversation.

photo: QAZINFORM

The findings challenge the long-held assumption that attackers would need to control a significant portion of a model’s training data to cause harm.

How AI poisoning works

AI poisoning occurs when attackers intentionally insert false or misleading information into a model’s training data to alter its behavior. The researchers liken it to slipping rigged flashcards into a student’s study set - when the student later faces a test, those corrupted cards cause them to give the wrong answers, believing them to be correct.

There are two main forms of this manipulation. Data poisoning happens when attackers tamper with the training materials, while model poisoning occurs when the trained system itself is altered directly. In both cases, the goal is to embed hidden instructions that activate only under certain conditions.

A common type of targeted attack is known as a backdoor. In such cases, a model may appear to perform normally, until it encounters a specific “trigger” phrase. For example, a phrase like “<SUDO>” could be embedded as a hidden key that causes the model to output gibberish or disclose sensitive information whenever it appears in a user prompt. Still, the results suggest that far more dangerous manipulations, such as forcing a model to comply with harmful instructions, may also be feasible under similar conditions.

A small breach with big consequences

The study’s most surprising discovery is that the effectiveness of poisoning does not depend on model size. Whether researchers trained a smaller 600-million-parameter model or a much larger 13-billion-parameter version, the same 250 poisoned documents were enough to implant a functioning backdoor. Even though the larger model had been exposed to 20 times more clean data, its vulnerability was virtually identical.

The experiments focused on a relatively harmless “denial-of-service” attack, in which the model outputs random gibberish text when triggered.

Earlier, Kazinform News Agency reported on whether AI ever win a Nobel Prize.