Research
The new, formal definition of agency gives clear principles for causal modeling of AI agents and the motivations they face
We want to create safe, aligned artificial general intelligence (AGI) systems that pursue the intended goals of its designers. Causal influence diagrams (CID) is a way to model decision-making situations that allow us to reason agents’ motivations. For example, here is a CID for a 1-step Markov decision process – a typical framework for decision problems.
S1 represents the initial state, A1 represents the agent’s decision (square), S2 the next state. R2 is the agent’s reward/utility (diamond). Solid conjunctions specify causal influence. Dashed edges define information links – what the agent knows when it makes its decision.
By correlating training settings with the incentives that shape agent behavior, CIDs help illuminate potential risks before an agent is trained and can inspire better agent designs. But how do we know when a CID is an accurate model of a training facility?
Our new paper, Discovering Agentsintroduces new ways of dealing with these issues, such as:
- The first formal causal definition of factors: Agents are systems that would adjust their policy if their actions affected the world differently
- An algorithm for agent discovery from empirical data
- A translation between causal models and CID
- Resolving previous confounds from incorrect causal factor modeling
Combined, these results provide an additional level of assurance that no modeling error has been made, meaning that CIDs can be used to analyze an agent’s motivations and security properties with greater confidence.
Example: modeling a mouse as an agent
To illustrate our method, consider the following example consisting of a world containing three squares, with a mouse starting in the middle square choosing to go left or right, reach its next location, and then potentially take some cheese. The floor is icy, so the mouse can slip. Sometimes the cheese is on the right, but sometimes on the left.
The environment of the mouse and the cheese.
This can be represented by the following CID:
CID for the mouse. D represents the left/right decision. X is the new position of the mouse after performing the left/right action (it can slide and end up on the other side by mistake). U represents whether the mouse gets cheese or not.
The intuition that the mouse would choose a different behavior for different environmental settings (ice, cheese distribution) can be captured by a mechanized reason chart, which for each (object-level) variable, also includes a mechanism variable that governs how the variable depends on its parents. Mainly, we allow links between mechanism variables.
This graph contains additional mechanism nodes in black, representing mouse policy and ice and cheese distribution.
Mechanized causality diagram for mouse and cheese environment.
Edges between mechanisms represent direct causal influence. The blue tips are special terminal edges – roughly, mechanism edges A~ → B~ that would still exist even if the object level variable A were modified to have no outgoing edges.
In the example above, since U has no children, its mechanism edge must be terminal. But the end of the mechanism X~ → D~ is not terminal, because if we cut off X from the child of U, then the mouse will no longer adjust its decision (because its position will not affect whether it gets the cheese) .
Causal factor discovery
Causal discovery infers a causal graph from experiments involving interventions. In particular, one can discover an arrow from a variable A to a variable B by experimentally intervening in A and checking whether B responds, even if all other variables remain constant.
Our first algorithm uses this technique to discover the machined causal graph:
Algorithm 1 takes as input invasive data from the system (mouse and cheese environment) and uses causal discovery to derive a machined causal graph. See paper for details.
Our second algorithm turns this mechanized reasoning graph into a game graph:
Algorithm 2 takes as input a mechanized reasoning graph and maps it to a game graph. An incoming terminal edge indicates a decision, an outgoing one indicates a utility.
Overall, Algorithm 1 followed by Algorithm 2 allows us to discover factors from causality experiments by representing them using CID.
Our third algorithm transforms the game graph into a mechanized causal graph, allowing us to translate between the game and mechanized causal graph representations under some additional assumptions:
Algorithm 3 takes as input a game graph and maps it to a mechanized causal graph. A decision indicates an incoming terminal edge, a utility indicates an outgoing terminal edge.
Better security tools for modeling AI agents
We proposed the first formal causal definition of factors. Based on causal discovery, our basic knowledge is that agents are systems that adapt their behavior in response to changes in the way their actions affect the world. Indeed, our Algorithms 1 and 2 describe an accurate experimental procedure that can help assess whether a system contains an agent.
Interest in causal modeling of artificial intelligence systems is growing rapidly, and our research underpins this modeling in causal discovery experiments. Our paper demonstrates the potential of our approach by improving the security analysis of several examples of AI systems and shows that causality is a useful framework for discovering whether an agent is present in a system – a key concern for AGI risk assessment.
Excited to learn more? Take a look at ours paper. Comments and feedback are welcome.