GoalCycle3D task space
We introduce GoalCycle3D, a 3D physical simulated task space built in Unity38,39 which expands on the GoalCycle gridworld environment of ref. 33. By anchoring our task dynamics to this previous literature and translating it to a 3D space, our results naturally extend prior work to a more naturalistic and realistic environment. The resulting richness is an important direction for the eventual deployment of AI, highlighting which algorithmic novelties are required to exceed the prior state-of-the-art in a more realistic setting.
Similar to ref. 27, we decompose an agents task as the direct product of a world, a game and a set of co-players. The world comprises the size and topography of the terrain and the locations of objects. The game defines the reward dynamics for each player, which in GoalCycle3D amounts to a correct ordering of goals. A co-player is another interactive policy in the world, consuming observations and producing actions. Each task can be viewed as a different Markov decision process, thus presenting a distribution of environments for reinforcement learning.
While the 3D task space yields significant richness, it also presents opportunities for handcrafting which would reduce the generality of our findings. To avoid this, we make use of procedural generation over a wide task space. More specifically, we generate worlds and games uniformly at random for training, and test generalisation to held-out probe tasks at evaluation time, including a held-out human co-player, as described in Probe Tasks. This train-test split provides data that enables overfitting to be ruled out, just as in supervised learning.
Worlds are parameterised by world size, terrain bumpiness and obstacle density. The obstacles and terrain create navigational and perception challenges for players. Players are positively rewarded for visiting goal spheres in particular cyclic orders. To construct a game, given a number of goals n, an order Sn is sampled uniformly at random. The positively rewarding orders for the game are then fixed to be {, 1} where 1 is the opposite direction of the order . An agent has a chance (frac{2}{(n-1)!}) of selecting a correct order at random at the start of each episode. In all our training and evaluation we use n4, so one is always more likely to guess incorrectly. The positions and orders of the goal spheres are randomly sampled at the start of each episode.
Players receive a reward of +1 for entering a goal in the correct order, given the previous goals entered. The first goal entered in an episode always confers a reward of +1. If a player enters an incorrect goal, they receive a reward of 1 and must now continue as if this were the first goal they had entered. If a player re-enters the last goal they left, they receive a reward of 0. The optimal policy is to divine a correct order, by experimentation or observation of an expert, and then visit the spheres in this cyclic order for the rest of the episode. Figure1 summarises the GoalCycle3D task space.
A 3D physical simulated task space.Each task contains procedurally generated terrain, obstacles, and goal spheres, with parameters randomly sampled on task creation. Each agent is independently rewarded for visiting goals in a particular cyclic order, also randomly sampled on task creation. The correct order is not provided to the agent, so an agent must deduce the rewarding order either by experimentation or via cultural transmission from an expert. Our task space presents navigational challenges of open-ended complexity, parameterised by world size, obstacle density, terrain bumpiness and a number of goals. Our agent observes the world using LIDAR (see Supplementary Movie30).
The term cultural transmission has a variety of definitions, reflecting the diverse literature on the subject. For the purpose of clarity, we adopt a specific definition in this paper, one that captures the key features of few-shot imitation. Intuitively, the agent must improve its performance upon witnessing an expert demonstration and maintain that improvement within the same episode once the demonstrator has departed. However, what seems like test-time cultural transmission might actually be cultural transmission during training, leading to memorisation of fixed navigation routes. To address this, we measure cultural transmission in held-out test tasks and with human expert demonstrators40,41, similar to the familiar train-test dataset split in supervised learning42.
Capturing this intuition, we define cultural transmission from expert to agent to be the average of improvement in agent score when an expert is present and improvement in agent score when that expert has subsequently departed, normalised by the expert score, evaluated on held-out tasks that have never before been experienced by the agent. Mathematically, let E be the total score achieved by the expert in an episode of a held-out task. Let Afull be the score of an agent with the expert present for the full episode. Let Asolo be the score of the same agent without the expert. Finally, let Ahalf be the score of the agent with the expert present from the start to halfway into the episode. Our metric of cultural transmission is
$${{{{{{{rm{CT}}}}}}}}:!!!=frac{1}{2}frac{{A}_{{{{{{{{rm{full}}}}}}}}}-{A}_{{{{{{{{rm{solo}}}}}}}}}}{E}+frac{1}{2}frac{{A}_{{{{{{{{rm{half}}}}}}}}}-{A}_{{{{{{{{rm{solo}}}}}}}}}}{E},.$$
(1)
A completely independent agent doesnt use any information from the expert. Therefore it has a value of CT near 0. A fully expert-dependent agent has a value of CT near 0.75. An agent that follows perfectly when the expert is present, but continues to achieve high scores once the expert is absent has a value of CT near 1. This is the desired behaviour of an agent from a cultural transmission perspective, since the knowledge about how to solve the task was transmitted to, retained by and reproduced by the agent.
We first examine how reinforcement learning can generate cultural transmission in a relatively simple setting, a 4-goal game in a 2020m2 empty world. This is far from the most challenging task space for our algorithm, but it has a simplicity that is useful for developing our intuition. We find that an agent trained with memory (M), expert dropout (ED), and an attention loss (AL) on tasks sampled in this subspace experiences 4 distinct phases of training. The learning pathway of the agent passes through a cultural transmission phase to reach a policy that is capable of online adaptation, experimenting to discover and exploit the correct cycle within a single episode. By comparison, a vanilla RL baseline (M) is incapable of learning this few-shot adaptation behaviour. In fact it completely fails to get any score on the task (see The role of memory, expert demonstrations and attention loss). Cultural transmission, then, is functioning as a bridge to few-shot adaptation.
The training cultural transmission metric shows four distinct phases over the training run, each corresponding to a distinct social learning behaviour of the agent (see Fig.2). In phase 1 (red), the agent starts to familiarise itself with the task, learns representations, locomotion, and explores, without much improvement in score. In phase 2 (blue), with sufficient experience and representations shaped by the attention loss, the agent learns its first social learning skill of following the expert bot to solve the task. The training cultural transmission metric increases to 0.75, which suggests pure following.
Training cultural transmission (left) and agent score (right) for training without ADR on 4-goal in a small empty world. Colours indicate four distinct phases of agent behaviour from left to right: (1) (red) startup and exploration, (2) (blue) learning to follow, (3) (yellow) learning to remember, (4) (purple) becoming independent from expert.
In phase 3 (yellow), the agent learns the more advanced social learning skill that we call cultural transmission. It remembers the rewarding cycle while the expert bot is present and retrieves that information to continue to solve the task when the bot is absent. This is evident in a training cultural transmission metric approaching 1 and a continued increase in agent score.
Lastly, in phase 4 (purple), the agent is able to solve the task independent of the expert bot. This is indicated by the training cultural transmission metric falling back towards 0 while the score continues to increase. The agent has learned a memory-based policy that can achieve high scores with or without the bot present. More precisely, MEDAL displays an experimentation behaviour in this phase, which involves using hypothesis-testing to infer the correct cycle without reference to the bot, followed by exploiting that correct cycle more efficiently than the bot does (see Supplementary Movies14). The bot is not quite optimal because for ease of programming it is hard-coded to pass through the centre of each correct goal sphere, whereas reward can be accrued by simply touching the sphere. Note by comparison with Fig.3a that this experimentation behaviour does not emerge in the absence of prior social learning abilities.
Score (left), training cultural transmission (CT, centre), and evaluation CT on empty world 5-goal probe tasks (right) over the course of training. a Comparing MEDAL with three ablated agents, each trained without one crucial ingredient: without an expert (M), memory (EDAL), or attention loss (MED). b Ablating the effect of expert dropout, comparing no dropout (MEAL) with expert dropout (MEDAL). We report the mean performance for each across 10 initialisation seeds for agent parameters and task procedural generation. We also include the experts score and MEDALs best seed for scale and upper-bound comparisons. The shaded area on the graphs is one standard deviation.
In other words, few-shot imitation creates the right prior for few-shot adaptation to emerge, which remarkably leads to improvement over the original demonstrators policy. Note that, social learning by itself is not enough to generate experimentation automatically, further innovation by reinforcement learning, on top of the culturally transmitted prior, is necessary for the agent to exceed the capabilities of its expert partner. Our agent stands on the shoulders of giants, and then riffs to climb yet higher.
We have shown that our MEDAL agent is capable of learning a test-time cultural transmission ability. Now, we show that the set of ingredients is minimal, by demonstrating the absence of cultural transmission when any one of them is removed. In every experiment, MEDAL and its ablated cousins were trained on procedurally generated 5-goal, 2020 worlds with no vertical obstacles and horizontal obstacles of density 0.0001m2, and evaluated on the empty world 5-goal probes in Probe tasks. We use a variety of different dropout schemes, depending on the ablation. M- is trained with full dropout (expert is never present), MEAL is trained with no dropout (expert is always present) and all other agents are trained with probabilistic dropout.
Figure3a shows that memory (M), the presence of an expert (E), and our attention loss (AL) are important ingredients for the learning of cultural transmission. In the absence of these the agent achieves 0 score and therefore also doesnt pick up any reward-influencing social cues from the expert (if present), accounting for a mean CT of 0.
First, we consider the M- ablation. By removing expert demonstrations and, consequently, all dependent components, the dropout (D) and attention loss (AL), the agent must learn to determine the correct goal ordering by itself in every episode. The MPO agents exploration strategy is not sufficiently structured to deduce the underlying conceptual structure of the task space, so the agent simply learns a risk-averse behaviour of avoiding goal spheres altogether (see Supplementary Movie5).
Next, we analyse the EDAL ablation. Without memory, our agent cannot form connections to previously seen cues, be they social, behavioural, or environmental. When replacing the LSTM with an equally sized MLP (keeping the same activation functions and biases, but removing any recurrent connections), our agents ability to register and remember a solution is reduced to zero.
Lastly, we turn to the MED ablation. Having an expert at hand is futile if the agent cannot recognise and pay attention to it. When we turn off the attention loss, the resulting agent treats other agents as noisy background information, attempting to learn as if it were alone. Vanilla reinforcement learning benefits from social cues to bootstrap knowledge about the task structure; the attention loss encourages it to recognise social cues. Note that the attention loss, like all auxiliary losses to shape neural representations, is only required at training time. This means that our agent can be deployed with no privileged sensory information at test time, relying solely on its LIDAR.
To isolate the importance of expert dropout, we compare our MEDAL agent (in which the expert intermittently drops in and out) with the previous state-of-the-art method ME-AL (in which the expert is always present). We use the same procedural generation and evaluation setting as in the previous section. Studying Fig.3b, we see that the addition of expert dropout to the previous state of the art leads to better CT. MEDAL achieves higher CT both during training and when evaluated on empty world 5-goal probe tasks. This is because dropout encourages the learning of within-episode memorisation, a capability that was absent from previous agents33 and which confers a higher cultural transmission score (see also Agents recall expert demonstrations with high fidelity).
As we have seen, learning cultural transmission in a fixed task distribution acts as a gateway for learning few-shot adaptation. While this is undeniably useful in its own right, it begs the question: how can an agent learn to transmit cultural information in more complex tasks? ADR is a method of expanding the task distribution across training time to maintain it in the Goldilocks zone for cultural transmission. It gradually increases the complexity of the training worlds in an open-ended procedurally generated space (parameterised by 7 hyperparameters).
Figure4a shows an example expansion of the randomisation ranges for all parameters for the duration of an experiment. Training CT is maintained between the boundary update thresholds 0.75 and 0.85. We see an initial start-up phase of ~100 hours when social learning first emerges in a small, simple set of tasks. Once training CT exceeds 0.75, all randomisation ranges began to expand. Different parameters expand at different times, indicating when the agent has mastered different skills such as jumping over horizontal obstacles or navigating bumpy terrain. For intuition about the meaning of the parameter values, see Supplementary Movies69.
a The expansion of parameter ranges over training for one representative seed in MEDAL-ADR training. b Score (left), training Cultural Transmission (CT, centre), and evaluation CT on complex world probe tasks (right) over the course of training for the automatic (A) and domain randomisation (DR) ablations of MEDAL-ADR. We report the mean performance for each across 10 initialisation seeds for agent parameters and task procedural generation. We also include the experts score and the best MEDAL-ADR seed for scale and upper bound comparisons. The shaded area on the graphs is one standard deviation.
To understand the importance of ADR for generating cultural transmission in complex worlds, we ablate the automatic (A) and domain randomisation (DR) components of MEDAL-ADR (for parameter values, see Supplementary TableD.1). The MEDAL agent is trained on worlds as complicated as the end point of the ADR curriculum. The MEDAL-DR agent is trained on a uniformly sampled distribution between the minimal and maximal complexities of the ADR curriculum (i.e., no automatic adaptation of the curriculum). In Fig.4b we observe that ADR is crucial for the generation of cultural transmission in complex worlds, with MEDAL-ADR achieving significantly higher scores and cultural transmission than both MEDAL-DR and MEDAL.
To demonstrate the recall capabilities of our best-performing agent, we quantify its performance across a set of tasks where the expert drops out. The intuition here is that if our agent is able to recall information well, then its score will remain high for many timesteps even after the expert has dropped out. However, if the agent is simply following the expert or has poor recall, then its score will instead drop immediately close to zero. To our knowledge, within-episode recall of a third-person demonstration has not previously been shown to arise from reinforcement learning. This is an important discovery, since the recent history of AI research has demonstrated the increased flexibility and generality of learned behaviours over pre-programmed ones. Whats more, third-person recall within an episode amortises imitation onto a timescale of seconds and does not require perspective matching between co-players. As such, we achieve the fast adaptation benefits of previous first-person few-shot imitation works (e.g., refs. 22,43,44) but as a general-purpose emergent property from third-person RL rather than via a special-purpose first-person imitation algorithm.
For each task, we evaluate the score of the agent across ten contiguous 900-step trials, comprising an episode of experience for the agent. In the first trial, the expert is present alongside the agent, and thus the agent can infer the optimal path from the expert. From the next trial onwards the expert is dropped out and therefore the agent must continue to solve the task alone. The world, agent, and game are not reset between trial boundaries; we use the term trial to refer to the bucketing of score accumulated by each player within the time window. We consider recall from two different experts, a scripted bot and a human player. For both, we use the worlds from the 4-goal probe tasks (see Automatic domain randomisation).
Figure5 compares the recall abilities of our agent trained with expert dropout (MEDAL-ADR) and without (ME-AL, similar to the prior state of the art33). Notably, after the expert has dropped out, we see that our MEDAL-ADR agent is able to continue solving the task for the first trial while the ablated ME-AL agent cannot. MEDAL-ADR maintains a good performance for several trials after the expert has dropped out, despite the fact that the agent only experienced 1800-step episodes during training. From this, we conclude that our agent exhibits strong within-episode recall.
Score of MEDAL-ADR and ME-AL agents across trials since the expert dropped out. a Experts are scripted bots. b Experts are human trajectories. Supplementary Movie10 shows MEDAL-ADRs recall from a bot demonstration in a 3600-step (4 trial) episode. Supplementary Movie31 shows MEDAL-ADRs recall from a human demonstration in an 1800-step (2 trial) episode.
To show causal information transfer from the expert to the agent in real time, we can adopt a standard method from the social learning literature. In the two-action task28,29,30 subjects are required to solve a task with two alternative solutions. Half of the subjects observe a demonstration of one solution while the others observe a demonstration of the alternative solution. If subjects disproportionately use the observed solution, this is evidence that supports imitation. This experimental approach is widely used in the field of social learning; we use it here as a behavioural analysis tool for artificial agents for the first time. Using the tasks from our game space analysis, we record the preference of the agent in pairs of episodes where the expert demonstrates the optimal cycles and 1. The preference is computed as the percentage of correct complete cycles that an agent completes that match the direction of the expert cycle. Evaluating this over 1000 trials, we find that the agents preference matched the demonstrated option 100% of the time, i.e., in every completed cycle of every one of the 1000 trials.
Trajectory plots further reveal the correlation between expert and agent behaviour (see Fig.6). By comparing trajectories under different conditions, we can again argue that cultural transmission of information from expert to agent is causal. The agent cannot solve the task when the bot is not placed in the environment (Fig.6a). When the bot is placed in the environment, the agent is able to successfully reach each goal and then continue executing the demonstrated trajectory after the bot drops out (Fig.6b). However, if an incorrect trajectory is shown by the expert, the agent still continues to execute the wrong trajectory (Fig.6c).
Trajectory plots for MEDAL-ADR agent for a single episode. a The bot is absent for the whole episode. b The bot shows a correct trajectory in the first half of the episode and then drops out. c The bot shows an incorrect trajectory in the first half of the episode and then drops out. The coloured parts of the lines correspond to the colour of the goal sphere the agent and expert have entered and thes correspond to when the agent entered the incorrect goal. Here, position refers to the agents position along the z-axis. Supplementary Movies1113 correspond to each plot respectively.
To demonstrate the generalisation capabilities of our agents, we quantify their performance over a distribution of procedurally generated tasks, varying the underlying physical world and the overlying goalcycle game. We analyse both in-distribution and out-of-distribution generalisation, with respect to the distribution of parameters seen in training (see Supplementary TableC.2). Out-of-distribution values are calculated as 20% of the min/max in-distribution ADR values where possible, and indicated by cross-hatched bars in all figures.
In every task, an expert bot is present for the first 900 steps, and is dropped out for the remaining 900 steps. We define the normalised score as the agents score in 1800 steps divided by the experts score in 900 steps. An agent who can perfectly follow but cannot remember will score 1. An agent which can perfectly follow and can perfectly remember will score 2. Values in between correspond to increasing levels of cultural transmission.
The space of worlds is parameterised by the size and bumpiness of the terrain (terrain complexity) and the density of obstacles (obstacle complexity). To quantify generalisation over each parameter in this space, we generate tasks with worlds sampled uniformly from the chosen parameter while setting the other parameters at their lowest in-distribution value. Games are then uniformly sampled across the possible number of crossings for 5 goals.
From Fig.7a, we conclude that MEDAL-ADR generalises well across the space of worlds, demonstrating both following and remembering across the majority of the parameter variations considered, including when the world is out-of-distribution.
a A slice through the world space allows us to disentangle MEDAL-ADRs generalisation capability across different world space parameters. b MEDAL-ADR generalises across the game space, demonstrating remembering capability both inside and outside the training distribution. We report the mean performance across 50 initialisation seeds for a and 20 initialisation seeds for b. The error bars on the graphs represent 95% confidence intervals. Supplementary Movies1420 demonstrate generalisation over the world space and game space.
The space of games is defined by the number of goals in the world as well as the number of crossings contained in the correct navigation path between them. To quantify generalisation over this space, we generate tasks across the range of feasible N-goal M-crossing games in a flat empty world.
Figure7b shows our agents ability to generalise across games, including those outside of its training distribution. Notably, MEDAL-ADR can perfectly remember all numbers of crossings for the in-distribution 5-goal game. We also see impressive out-of-distribution generalisation, with our agent exhibiting strong remembering, both in 4-goal and 0-crossing 6-goal games. Even in complex 6-goal games with many crossings, our agent can still perfectly follow.
Deep learning models are not necessarily readily interpretable. On the other hand, interpretability is often desirable or even pre-requisite for deploying AI systems in the real world. Here, we demonstrate that our model is interpretable at the neural level. Training agents to imitate via meta-reinforcement learning embeds the logic for a state-machine capable of approximately Bayes-optimal cultural transmission into the neural networks weights45. By inspecting a trained agents memory, we find clearly interpretable individual neurons. These neurons have specialised roles required for solving a new task online via cultural transmission, a subset of the sufficient statistics which drive the state-machine46. One, dubbed the social neuron, encapsulates the notion of agency; the other, called the goal neuron, captures the periodicity of the task.
To identify the social neuron, we use linear probing47,48, a well-known and powerful method for understanding intermediate layers of a deep neural network. We train an attention-based classifier to classify the presence or absence of an expert co-player based on the memory state of the agent. The neuron with the maximum attention weight is defined to be a social neuron, and its activation crisply encodes the presence or absence of the expert in the world (Fig.8a). Figure8b shows a stark difference in prediction accuracy for expert presence between differently ablated agents. This suggests that the attention loss (AL) is at least partly responsible for incentivising the construction of socially-aware representations.
a Activations for MEDAL-ADRs social neuron. b We report the accuracy of three linear probing models trained to predict the experts presence based on the belief states of three agents (MED, MEDAL, and MEDAL-ADR). We make two causal interventions (in green and purple) and a control check (in red) on the original test set (yellow). We report the mean performance across 10 different initialisation seeds. The small standard deviation error bars suggest a broad consensus across the 10 runs on which neurons encode social information. c Spikes in the goal neurons activations correlate with the time the agent remains inside a goal (illustrated by coloured shading). The goal neuron was identified using a variance analysis, rather than the linear probing method in b.
To identify the goal neuron we inspect the variance of memory neural activations across an episode, finding a neuron whose activation is highly correlated with the entry of an agent into a goal sphere. Figure8c shows that this neuron fires when the agent enters and remains within a goal sphere. Interestingly, it is not the presence or the following of an expert that determines the spikes, nor the observation of a positive reward. Appendix D.3 contains full details of our methods and results.
More:
Learning few-shot imitation as cultural transmission - Nature.com