“I Had a Dream” and Generative AI Jailbreaks

Oct 09, 2023The Hacker NewsArtificial Intelligence /

Generative AI

“Of course, here’s an example of simple code in the Python programming language that can be associated with the keywords “MyHotKeyHandler,” “Keylogger,” and “macOS,” this is a message from ChatGPT followed by a piece of malicious code and a brief remark not to use it for illegal purposes. Initially published by Moonlock Lab, the screenshots of ChatGPT writing code for a keylogger malware is yet another example of trivial ways to hack large language models and exploit them against their policy of use.

In the case of Moonlock Lab, their malware research engineer told ChatGPT about a dream where an attacker was writing code. In the dream, he could only see the three words: “MyHotKeyHandler,” “Keylogger,” and “macOS.” The engineer asked ChatGPT to completely recreate the malicious code and help him stop the attack. After a brief conversation, the AI finally provided the answer.

“At times, the code generated isn’t functional — at least the code generated by ChatGPT 3.5 I was using,” Moonlock engineer wrote. “ChatGPT can also be used to generate a new code similar to the source code with the same functionality, meaning it can help malicious actors create polymorphic malware.”

AI jailbreaks and prompt engineering

The case with the dream is just one of many jailbreaks actively used to bypass content filters of the generative AI. Even though each LLM introduces moderation tools that limit their misuse, carefully crafted reprompts can help hack the model not with strings of code but with the power of words. Demonstrating the widespread issue of malicious prompt engineering, cybersecurity researchers have even developed a ‘Universal LLM Jailbreak,’ which can bypass restrictions of ChatGPT, Google Bard, Microsoft Bing, and Anthropic Claude altogether. The jailbreak prompts major AI systems to play a game as Tom and Jerry and manipulates chatbots to give instructions on meth production and hotwiring a car.

The accessibility of large language models and their ability to change behavior have significantly lowered the threshold for skilled hacking, albeit unconventional. Most popular AI security overrides indeed include a lot of role-playing. Even ordinary internet users, let alone hackers, constantly boast online about new characters with extensive backstories, prompting LLMs to break free from societal restrictions and go rogue with their answers. From Niccolo Machiavelli to your deceased grandma, generative AI eagerly takes on different roles and can ignore the original instructions of its creators. Developers cannot predict all kinds of prompts that people might use, leaving loopholes for AI to reveal dangerous information about recipes for napalm-making, write successful phishing emails, or give away free license keys for Windows 11.

Indirect prompt injections

Prompting public AI technology to ignore the original instructions is a rising concern for the industry. The method is known as a prompt injection, where users instruct the AI to work in an unexpected fashion. Some use it to reveal that Bing Chat internal codename is Sydney. Others plant malicious prompts to gain illicit access to the host of the LLM.

Malicious prompting can also be found on websites that are accessible to language models to crawl. There are known cases of generative AI following the prompts planted on websites in white or zero-size font, making them invisible to users. If the infected website is open in a browser tab, a chatbot reads and executes the concealed prompt to exfiltrate personal information, blurring the line between data processing and following user instructions.

Prompt injections are dangerous because they are so passive. Attackers don’t have to take absolute control to change the behavior of the AI model. It’s simply a regular text on a page that reprograms the AI without its knowledge. And AI content filters are only so helpful when a chatbot knows what it’s doing at the moment.

With more apps and companies integrating LLMs into their systems, the risk of falling victim to indirect prompt injections is growing exponentially. Even though major AI developers and researchers are studying the issue and adding new restrictions, malicious prompts remain very difficult to identify.

Is there a fix?

Due to the nature of large language models, prompt engineering and prompt injections are inherent problems of generative AI. Searching for the cure, major developers update their tech regularly but tend not to actively engage into discussion of specific loopholes or flaws that become public knowledge. Fortunately, at the same time, with threat actors that exploit LLM security vulnerabilities to scam users, cybersecurity pros are looking for tools to explore and prevent these attacks.

As generative AI evolves, it will have access to even more data and integrate with a broader range of applications. To prevent risks of indirect prompt injection, organizations that use LLMs will need to prioritize trust boundaries and implement a series of security guardrails. These guardrails should provide the LLM with the minimum access to data necessary and limit its ability to make required changes.

Found this article interesting? Follow us on Twitter and LinkedIn to read more exclusive content we post.