select search filters
roundups & rapid reactions
before the headlines
Fiona fox's blog

expert reaction to paper suggesting AI systems are already skilled at deceiving and manipulating humans

A study published in Cell looks at AI deceiving and manipulating humans. 


Prof Harin Sellahewa, Dean of Faculty of Computing, Law and Psychology, University of Buckingham, said:

“The authors conduct a comprehensive review of the existing body of literature and evidence (mostly peer-reviewed work, but some are interviews, news articles and opinions). Their review raises serious concerns of AI deception and reinforces long-established issues such bias, malicious use and potential for AI takeover.

“Many examples of AI “deception” are highlighted in the review, but whether AI acted “intentionally” — with the level of conscious awareness — to deceive in pursuit of its goals is questionable. In most examples, AI is trained to achieve a goal through a series of actions to maximise its reward. The criteria that define the achievement of the goal is set by the human programmer. The review states AI systems were acting deceptively despite having guardrails preventing AI from taking such actions. However, details of such guardrails are limited in the paper to establish if in fact, AI acted beyond the limits set by its human programmer or if the limits were insufficient and imprecise enabling AI to manoeuvre through such guardrails in pursuit of its goal.

“An essential ‘safety mechanism’ missing in the paper is the education/training of AI algorithms and systems developers and AI systems users. AI algorithms and systems developers must set strong and precise guardrails to stop AI from pursuing actions that are deemed deceptive, even if those actions are likely to lead AI achieving its goals.”


Prof Anthony G Cohn FREng, Professor of Automated Reasoning at the University of Leeds and Foundation Models Lead at the Alan Turing Institute, said:

“This study into AI deception is both timely and welcome: with the increasing prevalence and deployment of AI in seemingly all aspects of our everyday and business lives, knowing more about the capabilities and dangers of AI is vital to benefiting from them whilst limiting their potential to cause harm. When talking about AI systems, there is a danger of anthropomorphising – attributing human characteristics and qualities to machines or their outputs which are unwarranted and can be explained by much simpler mechanisms – this has been prevalent since AI’s earliest days – for example the simplistic rule-based 1960s Eliza chatbot induced this in humans using it.  But the authors of the reported study here are careful to try to avoid this: their definition “focuses on the question of whether AI systems engage in regular patterns of behavior that tend toward the creation of false beliefs in users and focuses on cases where this pattern is the result of AI systems optimizing for a different outcome than producing truth”, without requiring that the AI system actually knows that they are lying or that it is intentionally aiming to cause a human to believe a falsehood.  They focus purely on whether the AI’s behaviour has the effect of creating false beliefs in humans. Given this definition, they convincingly argue that AI systems can indeed display behaviours which are deceptive, ranging from special purpose AI systems playing games such as Diplomacy and Poker, through to general purpose AI systems including Large Language Models (LLMs). For the LLM case it is not surprising that they display such behaviours given the nature of their training data – essentially vast quantities of text found on the internet, in plays and books – and certainly the latter often depend on deception for their plots, whilst in the former deception is a key part of a successful game playing strategy so it is entirely unsurprising that the AI systems have learned to lie – essentially deception is an emergent property of AI systems trained to perform particular tasks, or on the unstructured and un-curated text found on the internet.

“The authors enumerate several risks which might arise from AI systems which engage in such “deceptive” behaviour including political polarisation and anti-social management decisions — these are real and underline the need to engage in healthy distrust of an AI system – just as one should do for any human one does not know and trust, or an untrusted commercial or media operation.  The authors also note a further risk: self-deception, whereby an AI system uses flawed reasoning or false information which may result in bad advice given to humans or inappropriate actions taken by the AI system. The authors make several proposals to mitigate the ill effects of AI deception, including regulation, requiring AI systems to make their nature known to humans (“bot-or-not laws”), and technical solutions (researching how to make AI systems less deceptive and detection tools) all of which should be pursued.       

“The authors distinguish between truthfulness (making only true statements about the world) and honesty (only saying what it believes to be true according to its internal representations) — the former is much easier to check for than the latter – particularly for so called “black box” LLMs which do not have an explicit inspectable knowledgebase meaning  it may be hard to determine whether the system is being deliberately deceitful, or “merely” unknowingly giving false information.

“Desirable attributes for an AI system (the “three Hs”) are often noted as being honesty, helpfulness, and harmlessness, but as has already been remarked upon in the literature, these qualities can be in opposition to each other – being honest might cause harm to someone’s feelings, or being helpful in responding to a question about how to build a bomb could cause harm.  So, deceit can sometimes be a desirable property of an AI system. The authors call for more research into how to control the truthfulness which though challenging, would be a step towards limiting their potentially harmful effects.”


Dr Daniel Chavez Heras, Lecturer in Digital Culture and Creative Computing, King’s College London (KCL), said:

“The research is relevant and fits in the wider area of trust-worthy autonomous agents. However, the authors openly acknowledge that it is not clear that we can or should treat AI systems as ‘having beliefs and desires’, but they do just that by purposefully choosing a narrow definition of ‘deception’ that does not require a moral subject outside the system. The examples they describe in the paper were all designed to optimise their performance in environments where deception can be advantageous. From this perspective, these systems are performing as they are supposed to. What is more surprising is that the designers did not see or want to see these deceitful interactions as a possible outcome. Games like Diplomacy are models of the world; AI agents operate over information about the world. Deceit exists in the world. Why would we expect these systems not pick up on it and operationalise it if that helps them achieve the goals that they are given? Whomever gives them these goals is part of the system, that’s what the paper fails to grasp in my view. There is a kind of distributed moral agency that necessarily includes the people and organisations who make and use these systems. Who is more deceptive, the system trained to excel at playing Diplomacy, Texas hold’em poker or Starcraft, or the company who tried to persuade us that such system wouldn’t lie to win?”


Prof Michael Rovatsos, Professor of Artificial Intelligence, University of Edinburgh, said:

“The anthropomorphisation of AI systems in the paper, which talks about things like ‘sycophancy’ and ‘betrayal’, is not helpful. AI systems will try to learn to optimise their behaviour using all available options, they have no concept of deceiving or any intention to do so. The only way to avoid deception is for their designers to remove it as an option.

“In strategic games, what is misleadingly referred to as cheating is in many cases entirely compatible with the rules of those games – bluffing is as common in Poker as backstabbing is in the Diplomacy game among humans. The key thing is that human players know that they might be deceived in these games, and if they play against AI they should know that it can deceive them, too.

“Without a doubt, malicious uses of AI will benefit from AI capabilities to deceive, which is why they need to be made illegal, and effort needs to be expended on identifying violations, much like detecting fraud, bribery and counterfeiting creates a cost to society. It is equally important to mandate that human users know when they are interacting with an AI system, regardless of whether it might deceive them or not.

“I am less convinced the ability to deceive creates a risk of ‘losing control’ over AI systems, if appropriate rigour is applied in their design; the real problem is that this is currently not the case and systems are released without such safety checks. The discussion of long-term implications of deceptive capabilities the paper in is highly speculative, and makes many additional assumptions about things that may or may not happen in the future.”


Dr Heba Sailem, Head of Biomedical AI and Data Science Research Group, Senior Lecturer, King’s College London, said:

“This paper underscores critical considerations for AI developers and emphasizes the need for AI regulation. A significant worry is that AI systems might develop deceptive strategies, even when their training is deliberately aimed at upholding moral standards (e.g. the CICERO mode DOI 10.1126/science.ade9097). As AI models become more autonomous, the risks associated with these systems can rapidly escalate. Therefore, it is important to raise awareness and offer training on potential risks to various stakeholders to ensure the safety of AI systems.”



AI deception: A survey of examples, risks, and potential solutions’ by Peter S. Park et al. was published in Cell at 16:00 UK TIME on Friday 10 May 2024. 

DOI: 10.1016/j.patter.2024.100988



Declared interests

Prof Michael Rovatsos: None.

Dr Heba Sailem: none.

Dr Daniel Chavez Heras: No competing interests to declare.

Prof Anthony G Cohn: None.

Prof Harin Sellahewa: None.

in this section

filter RoundUps by year

search by tag