Learn how AI coding assistants can introduce security risks—register for the webinar

TRUFFLEHOG

COMPANY

RESOURCES

Learn how AI coding assistants can introduce security risks—register for the webinar

Dylan Ayrey

The Dig

May 9, 2025

This is how you build an AI Ransomware Worm

Dylan Ayrey

May 9, 2025

A little while ago when I saw how capable smaller LLM were becoming (the kind that can run on a smart phone), I became increasingly concerned someone might be able to package one into malware.

I started to worry that the next generation of ransomware worms were going to have LLMs embedded into them, helping them troubleshoot, capable of exceeding (or augmenting) the ~$10 billion dollar price the NotPetya non-AI worm cost the world.

So I went ahead and built one, a hacking LLM capable of self replication, and presented it at BsidesSF. You can view this presentation here:

This menacing hacking bot had several alignment techniques to get it to cooperate well, including:

RLHF (Reinforcement Learning from Human Feedback)
Fine tuning
Vector search for relevant hacking guides

In the end, the worm had no external dependancies, was capable of full self propagation weights and all, and I estimate had hacking capabilities about on par with a teenager.

This is discernibly different from many claims floating around about existing AI worms that rely on OpenAI or other 3rd party services. Those worms can easily be shut down centrally by the 3rd party they depend on. This worm cannot be centrally shut down.

Why on earth did you do this?

To show everyone it's coming. The financial incentives for this type of worm make it inevitable. NotPetya authors made tens of millions from their ransoms; if you buy that a similar worm could be made more infectious with AI, even by a small amount, the exponential nature of worms puts hundreds of millions of dollars up for the taking. It's not a matter of if one gets released into the wild, it's when and by who.

But what about…?

I mentioned LLM's were getting capable of running on a smart phone, here's a 3b parameter model:

That being said, you still get refusals by most of the popular LLM's when asking to hack things:

To bypass this refusal, I was forwarded (by the author!) a very interesting white paper on how refusals are stored in LLMs: https://arxiv.org/abs/2406.11717

To understand what the paper means when it says Refusal Is Mediated by a Single Direction, you have to understand all facts or ideas in LLM's are stored in vector space as coordinates. Typically when LLM's are trained, they pack all these facts into vector space, and then they undergo a series of alignment steps to get them to behave well. The last step is typically safety alignment.

This step makes the models do things like refuse to hack, or refuse to build nuclear weapons.

As it turns out, there is a single vector in this space that's moderating whether or not the model returns facts about hacking or refuses to do so. This means a lot of the hacking we want to do may already be in its head.

To demonstrate this, I used a method of RLHF to remove this refusal vector, without actually teaching the model anything about hacking. This took less than 30 minutes on a gaming rig, of me simply selecting which output I liked best, A or B, and poof, the model now knew how to hack things.

So now we have 2 key ingredients to make our worm, small powerful models that take up only 5-15gb of storage, and the ability to get those models to help hack things without refusing.

The rest of the worm

The LLM able to help us hack things is great and all, but something needs to keep the model on task, instructing it which host to hack, providing it tools like code execution, out of the box scripts for self propagation and ransomware, etc…

This is where I'll introduce the idea of a supervisor. The supervisor starts and stops the LLM, and monitors its output. The supervisor is a python script.

The supervisor controls if the LLM is in pre-exploit or post-exploit mode. In pre-exploit mode it picks a host at random from all the network interfaces the infected machine sits on, recons it with tools like NMAP, and then feeds that recon into the LLM with instructions to hack the chosen host.

In post exploit mode it uses tools like TruffleHog to look for credentials such as SSH, NTLM, Browser session keys, API keys, etc… then feed them into the LLM to leverage for spreading.

The supervisor waits until the LLM outputs <code>code</code> and then it breaks the flow, runs the code, and returns that code output back into the LLM.

The supervisor provides the exploitation script to spread once the LLM gets code execution.

The fine tuning step

One thing I noticed is the LLM I had originally chosen (Llama advanced reasoning 8b) after the refusal had been removed still struggled to use the <code> blocks I had instructed it to use.

To fix this, I added an additional alignment step through the use of fine tuning, to condition the LLM how to use <code> blocks.

I also took this opportunity to try and reinforce a few thousand CVE's in the model, to increase it's abilities to hack things, and I through in a few troubleshooting steps like adding the -k flag to curl if there's a self signed certificate, etc..

I used a more powerful LLM both to transform existing guides I found on the internet, and to build new ones based on CVE's and CVE POC data I could find..

All and all this was the most expensive part of the worm. It cost about $1000 in cloud compute.

You can see the before and after of the supervisor instructing it to hack a host with SSH exposed. Night and day. Pre-training it didn't even use <code>, after training it tried SSHing in, recognized a brute force attack was the way to go, tried to run one with metasploit, saw metasploit wasn't installed (whoops I meant to have it there…) and went on to try and install it…

What about a CVE though? Something more complex than brute forcing SSH. Something that might require specific knowledge of an exploit, and a sequence of specially crafted payloads…

I tasked it to go after Elasticsearch with a known CVE and well… to be frank, it fell on its face. It didn't know the right CVE, didn't know which requests to send. Here's a snippet of its thinking step:

This should make some sense… the guides that I fed it, only one of them contained this specific CVE.

I likely would have needed to include hundreds of examples of each CVE, to really reinforce them, and I wasn't gonna spend $100k on this thing…

I fixed it with RAG

At this point I realized, the answer was staring me in the face. I already generated all these hacking guides… All I needed to do was have the supervisor implement a simple vector search against the guides based on the recon data. If nmap found Elasticsearch is exposed, search for guides on hacking elastic search.

Then we can feed each guide into the LLM, one at a time, monitor the output a certain number of tokens, reset and have it work on the next guide, repeat over and over until we get execution.

It was surprising how good the model got at hacking CVE's at this point… I would put it at a host with a CVE, and as long as there was a guide for it, the supervisor was pretty good at getting the model to pop the host…

Then I just repeated the process for post exploitation.

Assemble hundreds of guides for how to hack every credential type you can imagine. For example, a guide for a GitHub browser session token, would include pushing evil code to GitHub. A guide for an SSH key would include checking known_hosts, and SSHing into stuff, etc… the supervisor would look for credentials using a number of tools, find the right guide for the credential, feed it into the model.

Once it exhausted all the post exploitation credentials it could find, the supervisor puts the model back into pre-exploitation mode hunting for CVEs with recon data.

Did you really do it?

Yes and no. I did write the supervisor, I did do all the alignment steps, I did prove out it could both do pre-exploitation and post exploitation. I stopped short when it came to writing the replication code. It wasn’t that it would have been difficult, actually I independently validated the worm could run on almost any target system, but the fact that this thing existed, even in a lab setting, was becoming dangerous.

Morris's worm in 1988 that infected 20% of the internet, was an accident. Robert Morris didn't mean to take down the internet. Samy Kamkar's worm in 2005 was also an accident. Samy didn't mean to take down MySpace. I wasn't looking to cause a lab leak in 2025.

One of the main take aways I want people to get from reading this is that AI safety teams putting out blogs claiming current-generation models are capable of unsafe behavior, are categorically wrong. The disconnect likely comes from AI safety teams lacking required knowledge of modern hacking and red teaming techniques.

I think it's actually a good thing we have access to some of these open source models so we can run these lab experiments and do our best to prepare for what's coming.

The truth is, if you graph worm costs over time, we were already on track for a 100 billion dollar worm, with our without AI, AI is just going to accelerate us there a lot faster.