Discovering when an agent is present on a system

Research

Post it: August 18, 2022
Authors: Zachary Kenton, Ramana Kumar, Sebastian Farquhar, Jonathan Richens, Matt MacDermott, Tom Everitt

The new, formal definition of agency gives clear principles for causal modeling of AI agents and the motivations they face

We want to create safe, aligned artificial general intelligence (AGI) systems that pursue the intended goals of its designers. Causal influence diagrams (CID) is a way to model decision-making situations that allow us to reason agents’ motivations. For example, here is a CID for a 1-step Markov decision process – a typical framework for decision problems.

S1 represents the initial state, A1 represents the agent’s decision (square), S2 the next state. R2 is the agent’s reward/utility (diamond). Solid conjunctions specify causal influence. Dashed edges define information links – what the agent knows when it makes its decision.

By correlating training settings with the incentives that shape agent behavior, CIDs help illuminate potential risks before an agent is trained and can inspire better agent designs. But how do we know when a CID is an accurate model of a training facility?

Our new paper, Discovering Agentsintroduces new ways of dealing with these issues, such as:

The first formal causal definition of factors: Agents are systems that would adjust their policy if their actions affected the world differently
An algorithm for agent discovery from empirical data
A translation between causal models and CID
Resolving previous confounds from incorrect causal factor modeling

Combined, these results provide an additional level of assurance that no modeling error has been made, meaning that CIDs can be used to analyze an agent’s motivations and security properties with greater confidence.

Example: modeling a mouse as an agent

To illustrate our method, consider the following example consisting of a world containing three squares, with a mouse starting in the middle square choosing to go left or right, reach its next location, and then potentially take some cheese. The floor is icy, so the mouse can slip. Sometimes the cheese is on the right, but sometimes on the left.

The environment of the mouse and the cheese.

This can be represented by the following CID:

CID for the mouse. D represents the left/right decision. X is the new position of the mouse after performing the left/right action (it can slide and end up on the other side by mistake). U represents whether the mouse gets cheese or not.

The intuition that the mouse would choose a different behavior for different environmental settings (ice, cheese distribution) can be captured by a mechanized reason chart, which for each (object-level) variable, also includes a mechanism variable that governs how the variable depends on its parents. Mainly, we allow links between mechanism variables.

This graph contains additional mechanism nodes in black, representing mouse policy and ice and cheese distribution.

Mechanized causality diagram for mouse and cheese environment.

Edges between mechanisms represent direct causal influence. The blue tips are special terminal edges – roughly, mechanism edges A~ → B~ that would still exist even if the object level variable A were modified to have no outgoing edges.

In the example above, since U has no children, its mechanism edge must be terminal. But the end of the mechanism X~ → D~ is not terminal, because if we cut off X from the child of U, then the mouse will no longer adjust its decision (because its position will not affect whether it gets the cheese) .

Causal factor discovery

Causal discovery infers a causal graph from experiments involving interventions. In particular, one can discover an arrow from a variable A to a variable B by experimentally intervening in A and checking whether B responds, even if all other variables remain constant.

Our first algorithm uses this technique to discover the machined causal graph:

Algorithm 1 takes as input invasive data from the system (mouse and cheese environment) and uses causal discovery to derive a machined causal graph. See paper for details.

Our second algorithm turns this mechanized reasoning graph into a game graph:

Algorithm 2 takes as input a mechanized reasoning graph and maps it to a game graph. An incoming terminal edge indicates a decision, an outgoing one indicates a utility.

Overall, Algorithm 1 followed by Algorithm 2 allows us to discover factors from causality experiments by representing them using CID.

Our third algorithm transforms the game graph into a mechanized causal graph, allowing us to translate between the game and mechanized causal graph representations under some additional assumptions:

Algorithm 3 takes as input a game graph and maps it to a mechanized causal graph. A decision indicates an incoming terminal edge, a utility indicates an outgoing terminal edge.

Better security tools for modeling AI agents

We proposed the first formal causal definition of factors. Based on causal discovery, our basic knowledge is that agents are systems that adapt their behavior in response to changes in the way their actions affect the world. Indeed, our Algorithms 1 and 2 describe an accurate experimental procedure that can help assess whether a system contains an agent.

Interest in causal modeling of artificial intelligence systems is growing rapidly, and our research underpins this modeling in causal discovery experiments. Our paper demonstrates the potential of our approach by improving the security analysis of several examples of AI systems and shows that causality is a useful framework for discovering whether an agent is present in a system – a key concern for AGI risk assessment.

Excited to learn more? Take a look at ours paper. Comments and feedback are welcome.

AI lie detectors are better than humans at spotting lies

What are AI agents? | MIT Technology Review

A way to let robots learn by listening will make them more useful

AI companies are finally being forced to cough up training data

NanoNets AI solution feeds delivery information to Jamix

Guide to Statistical Analysis: Definition, Types, and Careers

Understanding YOLOv5 Loss: A Comprehensive Analysis

Master Advanced Prompt Engineering with LangChain for Context-Aware Language Models

Arduino vs Raspberry Pi: What’s the difference?

Top 20 Generative AI Applications/ Use Cases Across Industries

Discovering when an agent is present on a system

From gen AI 1.5 to 2.0: Moving from RAG to agent systems

The Past, Present, and Future of Data Quality Management: Understanding Testing, Monitoring, and Data Observability in 2024 | by Barr Moses | May, 2024

Looking ahead to the Seoul AI Summit

Introduction of the border security framework

Google strikes back at OpenAI with “Project Astra” AI agent prototype

How AI turned a Ukrainian YouTuber into a Russian

AI lie detectors are better than humans at spotting lies

LLM Alignment: Reward-Based vs Reward-Free Methods | by Anish Dubey | Jul, 2024

AI and Data Extraction for Business Benefits

Our Picks

AI lie detectors are better than humans at spotting lies

LLM Alignment: Reward-Based vs Reward-Free Methods | by Anish Dubey | Jul, 2024

AI and Data Extraction for Business Benefits

Subscribe to Updates

Discovering when an agent is present on a system

Example: modeling a mouse as an agent

Causal factor discovery

Better security tools for modeling AI agents

Related Posts