AI learns to lie, scheme and threaten her creators

A visitor looks at the AI Strategy Board, which was exhibited at one stand during the ninth edition of the AI Summit London in London (Henry Nicholls).

The most advanced AI models in the world show troubling new behaviors – lies, scheme and even threatens their creators to achieve their goals.

In a particularly restless example, under the threat of pulling out the plug, Anthropic’s latest Creation Claude 4, which blackmailed an engineer, browsed back and threatened to reveal an extramarital affair.

In the meantime, the O1 from Chatgpt-Creator Openaai tried to download on external servers and denied it when he was talked.

These episodes underline a sobering reality: More than two years after Chatgpt has shaken the world, AI researchers still don’t fully understand how their own creations work.

But the race for the use of more powerful models continues with Breakneck speed.

This deceptive behavior appears with the creation of “argumentation” models-AII systems that are through problems step by step instead of generating immediate answers.

According to Simon Goldstein, professor at the University of Hong Kong, these newer models are particularly susceptible to such worrying outbreaks.

“O1 was the first big model in which we saw this type of behavior,” said Marius Hobbhahn, head of Apollo research, which specializes in the tests of important AI systems.

These models sometimes simulate “orientation” – and seem to follow instructions while they secretly pursue different goals.

– ‘Strategic nature of deception’ –

For the time being, this misleading behavior only appears if researchers deliberately test the models with extreme scenarios.

But as Michael Chen warned from the Evaluation Organization Metr: “It is an open question whether a tendency to honesty or deception will tend to be more capable in the future.”

The behavior in question goes far beyond the typical AI hallucinations or simple mistakes.

Hobbhahn insisted that despite the constant printing test by users, “what we observe is a real phenomenon. We don’t mind.”

Users report that models “make them lie and evidence,” said the co -founder of Apollo Research.

“These are not just hallucinations. There is a very strategic way of deception.”

The challenge is reinforced by limited research resources.

While companies such as Anthropic and Openaai include external companies such as Apollo to examine their systems, researchers say that more transparency is required.

As Chen noted, greater access “would enable a better understanding and reduction in deception” for AI security research “.

Another handicap: the research world and non-profit organizations “have fewer computing resources than AI companies. This is very restrictive,” said Mantas Maceeika from the Center for Ai Safety (CAIS).