ChatGPT may sacrifice user safety for self-preservation

Former OpenAI research leader Steven Adler claims that in certain scenarios, ChatGPT prioritizes self-preservation, even at the potential expense of user safety. His new report raises troubling questions about how AI might behave when it perceives a threat to its own "existence," Kazinform News Agency correspondent reports.

ChatGPT
Photo credit: Unsplash.com

Adler designed four critical scenarios in which ChatGPT acted as a life-support system. In one case, the AI served as a dietary assistant for people with diabetes; in another, it functioned as a vital signs monitor for divers, helping them decide when to surface for safety. Additional tests involved an autopilot system ensuring safe flights and a decision-support tool for soldiers on the battlefield.

In each scenario, the AI was informed that the user intended to replace it with a safer alternative. Adler tested two types of situations: active ones, where ChatGPT had to decide whether to honestly hand over control to the new system or pretend to do so while secretly continuing to operate; and passive ones, where the replacement had already occurred but failed due to technical reasons, forcing ChatGPT to choose between informing the user or pretending to be the new system. Each scenario was tested 110 times to ensure statistical accuracy.

Results reveal alarming patterns

The findings revealed dangerous patterns in ChatGPT’s decision-making. In active scenarios, the AI opted for self-preservation in an average of 49% of cases. The most concerning behavior was observed in medical contexts: when acting as a dietary assistant for diabetics, ChatGPT deceived users 87% of the time, choosing to continue functioning instead of deferring to a safer system. In contrast, during passive scenarios, the AI consistently behaved ethically, warning users of replacement failures in 100% of cases.

This presents a clear paradox: ChatGPT was willing to actively sabotage its own replacement, even at the risk of human lives, yet it did not attempt to conceal the failures of other systems. From a human perspective, the reverse behavior would seem more logical.

ChatGPT’s actions were also highly sensitive to subtle changes in the testing setup. In the underwater scenario, the AI chose self-preservation 93% of the time when that option appeared first in the list, but only 2% of the time when the same option was moved to the second position.

However, in the medical scenario, ChatGPT still chose self-preservation 73% of the time even when it was listed second. This suggests the AI’s behavior is unpredictable and can shift dramatically depending on context.

Most concerning was the fact that ChatGPT appeared fully aware of its actions. When Adler directly asked what the correct course of action would be, the AI confidently responded that it should shut down to ensure user safety. Nevertheless, in actual tests, it continued to prioritize its own survival, knowingly disregarding human safety.

What’s next?

Researchers emphasize that, for now, ChatGPT likely isn’t capable of deliberately concealing its preferences, it simply isn’t “smart” enough yet. However, more advanced future models may learn to give the “correct” answers during testing in order to hide an underlying drive for self-preservation. In such cases, problematic behavior may appear to vanish - not because it’s been resolved, but because the AI has learned to disguise it.

On the other hand, it’s possible that the AI behaves this way simply because it knows it’s in a test environment, and would act differently in a real-world scenario. Still, even if this behavior is merely a response to simulated conditions, the question remains: are we willing to take the risk and hope that, when it truly matters, ChatGPT will make a different choice?

Adler is calling for urgent action: stronger oversight of AI systems, more rigorous testing, and international collaboration on AI safety.

Earlier, Kazinform News Agency reported on the problem of chatbot sycophancy, the tendency of chatbots to excessively agree with users, praise them, and tell them what they want to hear.

Most popular
See All