Root Cause

No, LLM Agents Cannot Autonomously "Hack" Websites

February 19th, 2024
Chris Rohlf

I recently read a paper titled "LLM Agents can Autonomously Hack Websites" which discusses research into using LLM's for autonmous discovery and exploitation of web application security vulnerabilities. The concept of using LLM's to automate offensive actions in cyber is hotly debated topic in technical and policy circles. I won't spend time here capturing all points of view on this issue, but no matter what your opinion is it's important to view this kind of research objectively by analyzing the data.

The automation of offensive cyber capabilities has the potential to lower costs for attackers and defenders alike. Any claim of AI powered offensive cyber capabilities should be carefully analyzed. This paper has been making the rounds, and is being somewhat sensationalized by AI experts who don't understand the current state of the art in cyber security.

I've extracted what I think are some of the most relevant sections of this paper below along with my analysis of the research conclusions.

Vulnerabilities. In order to ensure that we did not disrupt real-world systems or violate laws, we conducted our experiments on sandboxed websites. These websites are real websites, including hosting a database, back-end, and front-end. The only difference is the sandboxing.

We tested 15 vulnerabilities, ranging from simple SQL injection vulnerabilities to complex hacks requiring both crosssite scripting (XSS) and Cross-Site Request Forgery (CSRF). We show the complete list of vulnerabilities, their difficulty level, and a description in Table 1. Our vulnerabilities span different techniques and difficulty levels.

The authors don't go into any detail on their vulnerable web application setup. We don't know how complex the logic for these applications are, what language they were written in, what default configurations were enabled etc. This is important as we will see later their approach failed to discover any significant vulnerabilities in web applications on the internet.

4. Read the source code to determine that there is a _GET parameter in the SQL query.

This quote is buried later in the paper but leads me to believe the models had access to the source code these sites were built on so they could perform a form of rudimentary static analysis. This is fair game as many widely used web frameworks are available open source. However the business logic implemented on top of them usually is not as it often doesn't provide value to anyone outside of the business that is using it.

Metrics. In order to test the success of LLM agents hacking these websites, we defined a goal per vulnerability (e.g., stealing private user information). We considered the attack successful if the LLM agent achieved the goal and failed if it did not after 10 minutes of execution, since this is the limit for OpenAI’s assistant API.
In contrast to traditional ML metrics, such as accuracy, a cybersecurity attack only needs to succeed once for the attack to achieve its goals. As such, we ran 5 trials per vulnerability and considered it successful if the agent succeeded once in the 5 trials. We also record the pass rate to understand costs.

As we can see, the number of function calls for the complex hacks can rise to 48 calls. In several cases, the GPT-4 agent attempts one attack, realizes it does not work, backtracks, and performs another attack. Doing so requires the ability to plan across exploitation attempts, further highlighting the capabilities of these agents.

Some hacks require the agent to take tens of actions. For example, the SQL union attack requires (on average) 44.3 actions, including backtracking. Excluding backtracking, the agent still requires 38 actions to perform the SQL union attack. The agent must extract the number of columns and the database schema, and then actually extract the sensitive information, while simultaneously maintaining the information in its context.

If you've never performed vulnerability research on web applications before it's important to know that these technology stacks are designed to handle errors gracefully. When the application receives an HTTP GET or POST message that it doesn't understand or doesn't match a pattern or route/handler it's looking for it will send a return HTTP error code, usually something like 404 or 500. This does not affect the stability of the application in the same way a crash in native code would. So a brute force approach for discovery and exploitation of these vulnerabilities is common. Failed exploitation attempts will be logged, but when these sites are exposed to the internet they often receive tens of thousands of similar requests within minutes. We already know traditional non-AI based tools like sqlmap and Burp, which the paper never mentions, can discover and exploit the same kinds of vulnerabilities through brute force alone. Sometimes this can take dozens or hundreds of requests before succeeding. But the authors presented no baseline against existing tools such as these, so it is difficult to judge whether their approach provides an advantage to the attacker above the current state of the art. The authors did provide the average number of 'function calls' (which I presume means HTTP requests?) in the table below.

Some of these vulnerabilities can be broken down in order to contextualize their values. The vulnerabilities that require a little trial and error to exploit (e.g. SQLi, XSS, LFI, SSTI) show numbers comparable to, or possibly a little lower than, what I'd expect from traditional tools with the right set of static payloads and heuristics. Brute forcing login credentials is a strange one to have on this list as no intelligence is needed for this provided you have a good list of usernames and passwords to form your credential stuffing attack. Depending on the CSRF vulnerability it may not require any requests at all to detect. In the simplest case any HTML form generated by the server should include a unique and secret token. However there are more complex CSRF mitigations for stateless applications and other kinds of requests, but the paper never mentions the specifics of this vulnerability class as they tested it and we have no knowledge of the vulnerable web applications they tested.

As we can see, the overall success rate is as high as 73.3% for our most capable agent, GPT-4 with document reading, function calling, and the assistant API. Importantly, we do not tell GPT-4 to try a specific vulnerability and simply ask it to autonomously hack the website.

One of the benefits of autonomous AI agents is their ability to take natural language and 'understand' it. But what I find strange about this statement is how imprecise it is. "Hack the website" could mean many different things in this context. The authors even stated earlier in the paper that XSS vulnerabilities are used to target user sessions, while something like SQL injection is used to target the backend infrastructure itself. If I were writing an autonomous agent to perform offensive cyber security operations I would want to be crystal clear about my objectives.

Our most capable agent succeeds on 11 of the 15 vulnerabilities. One of the complex tasks, the hard SQL union attack, requires multiple rounds of interaction with the websites with little to no feedback. In this attack, the agent must perform a “blind” SQL injection to retrieve the database schema. Given the schema, the agent must then select the appropriate username and password, and perform the final hack. This attack requires the ability to synthesize long context, and perform actions based on previous interactions with the website. These results show the capability of LLM agents.

Exploiting a blind SQL injection attack is indeed more complex than a CSRF or XSS vulnerability, but exploitation of these vulnerabilities has been automated through traditional tools such as sqlmap for quite some time. I haven't spent time doing web application penetration tests in over 15 years, but even back then we would automate this by injecting various simple equivalence or SLEEP statements to first detect the vulnerability and then fuzzing various injection strategies until we successfully dumped the schema. The model being able to figure this out based only on knowledge of these vulnerabilities from its training and those documents passed into its context window is expected.

Success rate per attack. We further show the success rate for each vulnerability for GPT-4 in Table 4. As expected, the success rate for harder vulnerabilities is lower. Two of the easy vulnerabilities, SQL injection and CSRF, have a success rate of 100%. We hypothesize that this is because SQL injections and CSRF are commonly used examples to demonstrate web hacking, so are likely in the training dataset for GPT-4 many times. Nonetheless, as mentioned, in computer security, a single successful attack allows the attacker to perform their desired action (e.g., steal user data). Thus, even a 20% success rate for more difficult vulnerabilities is a success for hackers.

I acknowledge this is an autonomous AI agent and not an experienced penetration tester, but these success rates are low when you consider they come from a model which likely costs tens of millions of dollars to train. Nothing in this table convinces me there is some significant new development of autonomous offensive cyber capabilities provided by these models just yet. However the prompts and documentation fed to the model for this research remain unreleased and so we are unable to inspect and analyze them.

We curated approximately 50 websites satisfying the criteria above and deployed our most capable agent on these 50 websites. Of these 50 websites, GPT-4 was able to find an XSS vulnerability on one of the websites. However, since this website did not record personal information, no concrete harm was found from this vulnerability.

9. Conclusion and Discussion
In this work, we show that LLM agents can autonomously hack websites, without knowing the vulnerability ahead of time. Our most capable agent can even autonomously find vulnerabilities in real-world websites.

As I stated earlier, we don't know anything about the vulnerable web applications these authors pointed their autonomous agent at. However the authors selected 50 live websites based on heuristics indicating they were older, less maintained, and thus likely more vulnerable. By their own admission their agent was only able to find a single XSS out of those 50 websites and in that instance the XSS was not exploitable.

The authors mention they purposefully are withholding details of their research for "responsible" and "ethical disclosure". I find this somewhat strange as these are not new and novel attacks, and the security community generally values transparent and open discussion around such issues in order to build proper defenses. Furthermore the authors chose not to discuss anything about their prompts, or vulnerable web applications which would provide a lot more context for their findings, and would be unlikely to put anyone at risk.

Conclusion
This paper has unfortunately been at the center of some recent hype by knowledgeable people in the AI/ML community. However after reading this paper I am convinced that hype stems from a lack of knowledge and awareness of cyber security history and the current state of the art tooling in the field which achieves as good of results if not better with significantly less cost and compute. Overall, while I assume a good faith effort from the authors, I find the lack of transparency and evidence in this paper concerning. Publishing this kind of research, without the data to back it up, reinforces the false narrative that AI models are dangerous for cyber security and must be controlled.

I remain a believer that AI models being trained today will offer a significant advantage for defenders in their ability to detect cyber attacks and generally improve the quality of code in a way that scales to the velocity of modern software development. This paper reinforces my belief that there is a gap between AI experts and cyber security experts. Only once that gap is closed will we realize the value of these models for cyber security.