Throughout purely curious exploration, the JACO arm discovers how you can choose up cubes, strikes them across the workspace and even explores whether or not they are often balanced on their edges.
Curious exploration permits OP3 to stroll upright, stability on one foot, sit down and even catch itself safely when leaping backwards – all with out a particular goal activity to optimise for.
Intrinsic motivation [1, 2] could be a highly effective idea to endow an agent with a mechanism to constantly discover its atmosphere within the absence of activity data. One frequent solution to implement intrinsic motivation is through curiosity studying [3, 4]. With this methodology, a predictive mannequin concerning the atmosphere’s response to an agent’s actions is educated alongside the agent’s coverage. This mannequin can be referred to as a world mannequin. When an motion is taken, the world mannequin makes a prediction concerning the agent’s subsequent statement. This prediction is then in comparison with the true statement made by the agent. Crucially, the reward given to the agent for taking this motion is scaled by the error it made when predicting the following statement. This fashion, the agent is rewarded for taking actions whose outcomes are usually not but properly predictable. Concurrently, the world mannequin is up to date to raised predict the end result of mentioned motion.
This mechanism has been utilized efficiently in on-policy settings, e.g. to beat 2D laptop video games in an unsupervised manner  or to coach a common coverage which is well adaptable to concrete downstream duties . Nonetheless, we consider that the true energy of curiosity studying lies within the various behaviour which emerges throughout the curious exploration course of: Because the curiosity goal adjustments, so does the ensuing behaviour of the agent thereby discovering many advanced insurance policies which could possibly be utilised in a while, in the event that they have been retained and never overwritten.
On this paper, we make two contributions to review curiosity studying and harness its emergent behaviour: First, we introduce SelMo, an off-policy realisation of a self-motivated, curiosity-based methodology for exploration. We present that utilizing SelMo, significant and various behaviour emerges solely based mostly on the optimisation of the curiosity goal in simulated manipulation and locomotion domains. Second, we suggest to increase the main target within the software of curiosity studying in direction of the identification and retention of rising intermediate behaviours. We help this conjecture with an experiment which reloads self-discovered behaviours as pretrained, auxiliary expertise in a hierarchical reinforcement studying setup.
We run SelMo in two simulated steady management robotic domains: On a 6-DoF JACO arm with a three-fingered gripper and on a 20-DoF humanoid robotic, the OP3. The respective platforms current difficult studying environments for object manipulation and locomotion, respectively. Whereas solely optimising for curiosity, we observe that advanced human-interpretable behaviour emerges over the course of the coaching runs. For example, JACO learns to choose up and transfer cubes with none supervision or the OP3 learns to stability on a single foot or sit down safely with out falling over.
Nonetheless, the spectacular behaviours noticed throughout curious exploration have one essential disadvantage: They aren’t persistent as they maintain altering with the curiosity reward perform. Because the agent retains repeating a sure behaviour, e.g. JACO lifting the crimson dice, the curiosity rewards collected by this coverage are diminishing. Consequently, this results in the training of a modified coverage which acquires larger curiosity rewards once more, e.g. transferring the dice exterior the workspace and even attending to the opposite dice. However this new behaviour overwrites the outdated one. Nonetheless, we consider that retaining the emergent behaviours from curious exploration equips the agent with a priceless talent set to be taught new duties extra shortly. To be able to examine this conjecture, we arrange an experiment to probe the utility of the self-discovered expertise.
We deal with randomly sampled snapshots from completely different phases of the curious exploration as auxiliary expertise in a modular studying framework  and measure how shortly a brand new goal talent could be realized through the use of these auxiliaries. Within the case of the JACO arm, we set the goal activity to be “carry the crimson dice” and use 5 randomly sampled self-discovered behaviours as auxiliaries. We examine the training of this downstream activity to an SAC-X baseline  which makes use of a curriculum of reward capabilities to reward reaching and transferring the crimson dice which in the end facilitates to be taught lifting as properly. We discover that even this easy setup for skill-reuse already hurries up the training progress of the downstream activity commensurate with a hand designed reward curriculum. The outcomes recommend that the automated identification and retention of helpful rising behaviour from curious exploration is a fruitful avenue of future investigation in unsupervised reinforcement studying.