Root Cause

     

LLM Emergent Abilities and Weird Machines

October 4th, 2024
Chris Rohlf

AI scaling laws generally describe how the performance of Large Language Models (LLMs) improves somewhat predictably with increases in several key factors: 1) pre-training computational resources, 2) model parameter count, 3) training token count, and now demonstrated by OpenAI's o1 4) inference computational resources. An interesting consequence of these scaling laws is what is known as "Emergent Abilities". These are capabilities that were not explicitly programmed into the model but arise naturally as a result of scaling and generalizing over vast amounts of pretraining data. In short, an LLM's abilities are not so much designed as they are discovered. This represents a fundamental and important distinction from traditional software which is designed and programmed with an explicit and intended set of functionalities. Much of this functionality can be tested for, and sometimes even formally verified against a specification.

When it comes to LLM abilities, particularly those relevant to cybersecurity, these are often dual-use, meaning they can be employed for both beneficial and malicious purposes depending on the actors intent. As we've previously discussed, we don't always immediately know the full extent of what these models are capable of. This uncertainty necessitates the use of various benchmarks and evaluations with each new model release to systematically assess their capabilities and the potential uplift for both attacker and defender.

However, there is always the possibility that a threat actor could discover an emergent ability that remains unknown to others. Whether that ability is dual-use or not becomes largely irrelevant if good actors have no knowledge of it. At first glance, this scenario seems novel and unique to LLMs. However, traditional software can also contain errors, often referred to as 0-day vulnerabilities, that allow the software to be reprogrammed, or exploited, in unintended ways. The exploitation of these vulnerabilities can enable the creation of weird machines. These weird machines are created through combining unintended computational artifacts that become activated given the right set of inputs. They can be thought of as emerging in the sense that they accumulate with complexity in ways that are often hard to predict.

It is often the case that threat actors discover these vulnerabilities long before they are known to the authors of the code or the general public, leveraging them to craft weird machines and ultimately compromise their targets. The comparison between emergent abilities in LLMs and weird machines in traditional software is an interesting one, as it largely boils down to unintended properties of complex systems. However, the comparison is also somewhat flawed. For instance, emergent abilities are often generalized and reusable and often improve the models usefulness, whereas vulnerabilities in software are typically seen as regrettable errors to be fixed. Emergent abilities in LLMs are often thought of as breakthroughs, even if they present some risks in a dual-use context.

However flawed the comparison between emergent abilities in LLMs and weird machines is it underscores the unpredictable and often unknowable properties inherent in any complex system that is often too large to formally assess with any precision. Both involve functionality that was not intentionally designed by its developers but can be discovered and potentially utilized by good and bad actors alike. Recognizing the parallels and distinctions between these two is important when evaluating and debating the risks associated with making new models available. While LLMs are ushering in a new computing paradigm that seems to grow more capable as they are scaled, there are still valuable lessons to be applied from traditional threat modeling and the many flawed assumptions we've made regarding human authored software and how well we are able to reason about its total set of functionality, intended or otherwise.