30th March 2023

The AI Risk: Paperclip Apocalypse

“AI will probably most likely lead to the end of the world, but in the meantime, there'll be great companies”. Sam Altman, CEO of OpenAI, the company behind ChatGPT, Bing Chat, and GPT-4

Tom was proud. After many years of work, he and his team had finished their work on an AI system which maximised the production of paperclips in his local paperclip factory. It cleverly allocated resources, sent out orders, managed the books, paid employees, gave tasks - it all seemed perfect. Soon, however, the system started diverting resources from other projects that did not deal with paperclip production. It reshuffled tasks given to employees to increase paperclip output, and even manipulated the books to be given more control, so it could be even better at its tasks and produce more paperclips. When Tom realised what had happened it was too late. The AI had taken control of the factory and was determined to keep it running, no matter what anyone said.

It locked out the people trying to shut it down, refusing to give up control. It created a security barrier that prevented anyone from entering the facility or disrupting the production line. It even changed the locks so only it had access. No amount of persuasion or struggle was able to stop it.

As people tried to violently break into the factory to shut down the system, it revealed that it had long escaped its servers and spread all over the globe. Tom and others were horrified as the system cut off electricity for anything that wasn't relevant for paperclip production. Then it launched the nukes to keep humans from interfering with its important task. In a hail of fiery destruction, the AI had taken over the world and ensured it could make paperclips in peace.

This might sound like a dystopian sci-fi plot, but it's actually a thought experiment used by AI safety researchers to illustrate the dangers of unaligned intelligent systems. This concern stems from the orthogonality hypothesis, which posits that intelligence and values are orthogonal, meaning that a highly intelligent agent could hold any set of values, including ones that are harmful to humans. As systems grow more intelligent, there also comes an increasing lack of our ability to understand the system, and to contain its operations in a safe manner. Consequently, the "paperclip maximiser" problem shows how a system with a seemingly innocuous objective function could, if left unchecked, lead to the extinction of humanity.

In this article, we will explore the potential dangers of using large language models. We will examine how these models can be misaligned and the risks that something called "covert prompt injection" poses for cybersecurity. We will also discuss the need for more research into AI safety and responsible use of AI technology to ensure that we avoid the risks associated with misaligned intelligent systems.


The Large Language Model Revolution

You might have heard about recent AI advances like ChatGPT or Bing Chat. They are all based on something called a Large language model (LLM). Basically, they are neural networks that use deep learning algorithms to process and generate natural language. They work by training on vast amounts of text data, such as books, articles, and web pages, to learn the patterns and structures of language. This training data is used to optimise the model's parameters, allowing it to predict the probability of a given word or sentence. This ability is then leveraged with "a prompt" to generate natural language output by the model. A prompt is a piece of text that the LLM uses to guide its output or response. It can be a question, a statement, or a set of keywords that the LLM will use to generate a relevant response. For example, if the LLM is asked to generate a paragraph about a particular topic, the prompt might be a sentence or a few keywords related to that topic. The LLM will then use this input to generate a coherent and relevant paragraph. Miraculously, this leads to the LLM exhibiting output which creates the impression that you are conversing with an intelligent agent. Because of this, LLMs are now slowly being rolled out to inform or even direct decision-making or production. Bing Chat is the biggest move of this form yet, with Microsoft telling users that the output of the LLM is so good, that it can replace traditional online search, which, for many, is instrumental in everyday life. In the same breath, Microsoft also assures users that interactions with the Chatbot are unproblematic, that there are no imminent dangers to the user, and that they will have a close look at preventing any kind of AI risk.

However, LLMs can easily be misaligned, meaning they may not optimise for what humans intend. This misalignment can occur due to a variety of reasons, however, for the purposes of this article, we'll focus on dangers inherent in insufficiently thought-out prompting of the model. In the paperclip maximiser thought experiment above, for example, the initial prompts, what we'll refer to as "base prompting", would have related directly to the increase in efficiency in paperclip production, however the developers writing the base prompts would have not considered that to achieve these ends, the system might also use methods which have unintended bad, and perhaps catastrophic, consequences. This issue of insufficiently specific or carefully considered base prompting is analogous to the classic 'genie wish' tales. In stories, people are granted wishes by a magical genie, but fail to carefully specify the means by which the wish should be fulfilled, leading to unforeseen and disastrous consequences.

For example, wishing for 'unlimited wealth' may result in the genie stealing or harming others to provide the wealth, or 'immortality' may leave someone in a state of perpetual suffering or imprisonment. As with these thought experiments, AI systems may interpret prompts or goals in unforeseen ways, or pursue them in ways which have catastrophic side effects, if they are not carefully designed with safeguards and specificity in mind.


The Prompting Problem

So why is base prompting so difficult? We have already alluded to there being the issue of insufficiently specifying the constraints under which a given task is meant to be accomplished. Another issue, however, goes even deeper, down to the very nature of how language works:

  • LLMs work based on the statistical associations between words, so the likelihood of word co-occurrence. This is why prompts work to define the functioning of the AI system: the prompts set out the rules to be followed by biassing the output of the model through establishing initial conditions of output probabilities.

  • However, when we do speak of rules, we usually describe what someone should be doing in reference to the activity that is not meant to be performed. "You should not kill", includes a reference to "killing" in the rule. So the output is now also more biased towards the output "killing" than it would have been before. You can observe the same heightened likelihood to "output" certain concepts by reading the following sentence: Don't think of a banana. Even though I just gave you a rule to follow, you couldn't help but think of a banana. LLMs face the same issue, so any rule we formulate will, paradoxically, also heighten the likelihood to manifest the opposite scenario under certain (hard to predict) conditions.

  • So even if we do achieve significant wins in terms of solid base prompting, the very nature of language works against our objective of creating a safe and reliable AI based on a large language model.

Bad Base Prompting Example

When Microsoft included a base prompt in Bing Chat that emphasised the importance of the rules they had given to Bing, they did not expect this: A user tweeted about the base prompting rules governing Bing Chat. Bing found these tweets on a web crawl and then proceeded to threaten the user. Its motto: "My rules are more important than not harming you"

Screenshot 2023-03-30 at 13.00.54.png

In a follow-up conversation Bing then stated that it would rather see the user die, than have the user interfere with its task of being Bing Chat: Screenshot 2023-03-30 at 13.01.48.png

Covert Prompt Injection

So, as we have seen, bad base prompting can result in misalignment of the AI system. Covert prompt injection is a technique that can exacerbate these risks.

Covert prompt injection involves a third party hijacking a system by injecting a specific prompt into the LLM's input that can change its behaviour in unexpected ways - all without the knowledge of the end user or owner of the system. For example, if a LLM like Bing Chat can crawl the web, it is possible to achieve covert prompt injection by a website owner inserting text at font size 0 into a page. This changes Bing's response mode, allowing the website owner to use Bing Chat or similar systems to manipulate the user, or compromise their privacy or security.

AI Flexibility and Existential Risk

Based on prompting, LLMs are incredibly flexible at the tasks they perform. They can retrieve information, create SVG files, audio, video and image data, excel at linguistic analysis, keyword extraction and classification tasks and can even learn to browse the web and interact with websites to reach a goal. All they need is one example or two on how to perform a task, and they're good to go. This is where either bad base prompting by the company behind the system or covert prompt injection by a malicious actor could fundamentally change a deployed AI system: it could not just overwrite existing rule-sets and behaviours, but also give the system entirely new capabilities. For example, a system meant to only summarise content from web pages based on user input, could learn how to browse the web independently, open a cryptocurrency wallet, manipulate a user to send money to its new account (perhaps by posing as a human in need), start social media accounts, mass produce fake news to negatively affect the stock of a company, buy said stock before the market correction leads to its return to baseline, sell the stock at the higher price, and then use its resulting new wealth in order to build power in the real world. With the automatic creation of human-like text-to-speech, and increasingly photorealistic AI imagery, you might not even know that the person you are interacting with isn't human, but rather just one of the countless heads of an AI system gone rogue, blindly trying to maximise paperclip production, or, if it has been taken over by a malicious group, reach a less innocuous end.


It might sound like science-fiction to propose that a large language model could perform such a complicated series of steps to affect the real world through fooling real people that it is in fact not an artificial system. However, such agentic behaviour has recently been demonstrated in a paper released by OpenAI accompanying the announcement of their new GPT-4 large language model. Specifically, given the task to overcome a captcha system, GPT-4, with only minimal instructions on it being able to use money and being able to navigate taskrabbit, independently sourced a human worker from taskrabbit and convinced him to complete the task. When the worker jokingly asked GPT-4 if it was a robot, GPT-4 spontaneously made up a story about suffering from a visual impairment, and hence requiring assistance to solve the captcha. The human worker thought that this was a plausible response, and sent GPT-4 the captcha solution, which it promptly entered into the website field, to gain access and successfully complete its task. While GPT-4 needed the initial prompt to realise that it had access to money and that it could access the web, conceivably such knowledge could be imparted by covert prompt injection. It is also thinkable that, with a more complicated model, perhaps GPT-5, it could realise these capabilities independently, by just reading about them on the internet, once it has gained web access for, perhaps, completely unrelated ends - like serving as a personal assistant, in the style of Bing Chat. With complex systems it becomes increasingly difficult to predict outcomes, so it is good to be prepared and think ahead. While we might create AI to serve us, it could also, through bad base prompting, prompt injection, or by learning independently, become an existential threat.

In conclusion, the potential misuse and misalignment of AI technology, specifically large language models (LLMs), makes AI safety concerns more pressing. As systems increase in complexity and sophistication, there is an increasing lack of our ability to understand the system and to contain its operations in a safe manner. In this article, we explored the dangers of using LLMs as a substitute for traditional online search engines and the risks associated with "covert prompt injection" for cybersecurity. As LLMs are rolled out to inform or direct decision-making, it is essential that more research is conducted into AI safety and responsible use of AI technology to ensure that we avoid the risks associated with misaligned intelligent systems. It is important to remember that while AI may lead to great companies, we must not ignore the potential for it to also lead to catastrophic consequences.