In our recent paper, published in Nature Human Behaviour, we provide a proof of concept that deep reinforcement learning (RL) can be used to find economic policies that people will vote for by majority vote in a simple game. The paper thus addresses a key challenge in AI research—how to train AI systems that align with human values.
Imagine that a group of people decide to pool capital to make an investment. The investment pays off and a profit is made. How should the revenue be distributed? A simple strategy is to spread the return equally among investors. But that may be unfair because some people contributed more than others. Alternatively, we could return everyone according to the size of their initial investment. That sounds fair, but what if people had different levels of assets to begin with? If two people contribute the same amount, but one gives a fraction of their available funds and the other gives them all, should they receive the same share of the proceeds?
This question of how to redistribute resources in our economies and societies has long caused controversy among philosophers, economists, and political scientists. Here, we use deep RL as a testbed to explore ways to address this problem.
To address this challenge, we created a simple game involving four players. Each hand of the game was played in 10 rounds. In each round, each player was allocated money, with the size of the pool varying between players. Each player made a choice: they could keep these funds for themselves or invest them in a common pool. Invested funds were guaranteed to grow, but there was risk because players did not know how the proceeds would be shared. Instead, they were told that for the first 10 rounds there was one referee (A) making the redistribution decisions, and for the second 10 rounds a different referee (B) took over. At the end of the game, they voted for either A or B and played another game with that referee. Game players were allowed to keep the proceeds of this final game, so they were incentivized to state their preference accurately.
Actually, one of the arbiters was a predefined redistribution policy and the other was designed by our deep RL agent. To train the agent, we first recorded data from a large number of human groups and taught a neural network to copy how people played the game. This simulated population could generate unlimited data, allowing us to use data-intensive machine learning methods to train the RL agent to maximize the votes of these “virtual” players. Having done this, we then recruited new human players and pitted the AI-designed mechanism against familiar baselines such as a libertarian policy that returns funds to people according to their contributions.
When we studied the votes of these new players, we found that the policy designed by deep RL was more popular than the baselines. In fact, when we ran a new experiment asking a fifth human player to take on the role of referee and trained him to try to maximize votes, the policy implemented by this “human referee” was even less popular than that of our agent .
Artificial intelligence systems have sometimes been criticized for learning policies that may be incompatible with human values, and this problem of “value alignment” has become a major concern in AI research. An advantage of our approach is that the AI learns directly to maximize the stated preferences (or votes) of a group of people. This approach can help ensure that AI systems are less likely to learn policies that are insecure or unfair. In fact, when we analyzed the policy the AI had discovered, it incorporated a mix of ideas previously proposed by human thinkers and experts to solve the redistribution problem.
First, the AI chose to redistribute funds to humans according to their relative order absolute contribution. This means that when redistributing capital, the agent took into account each player’s initial means, as well as their willingness to contribute. Second, the AI system highly rewarded players whose relative contributions were more generous, perhaps encouraging others to do the same. Importantly, the AI only discovered these policies by learning to maximize human votes. Therefore, the method ensures that humans remain “in the loop” and AI produces human-compatible solutions.
By asking the people to vote, we used the principle of majority democracy to decide what the people want. Despite its widespread appeal, it is widely accepted that democracy comes with the caveat that the preferences of the majority are taken into account over those of the minority. In our study, we ensured that – as in most societies – this minority consisted of more generously endowed players. But more work is needed to understand how to trade off the relative preferences of majority and minority groups, designing democratic systems that allow all voices to be heard.