Deep Demonstration Tracing: Learning Generalizable Imitator Policy for Runtime Imitation from a Single Demonstration

Xiong-Hui Chen^{* 1 2}, Junyin Ye^{* 1 2}, Hang Zhao^{* 3 2}, Yi-Chen Li^{1 2}, Xu-Hui Liu², Haoran Shi², Yu-Yan Xu², Zhihao Ye^{1 2}, Si-Hang Yang^{1 2}, Yang Yu^{1 2} Anqi Huang^{4 2}, Kai Xu³ Zongzhang Zhang¹

¹National Key Laboratory for Novel Software Technology, Nanjing University, China & School of Artificial Intelligence, Nanjing University, China ²Polixr Technologies, ³National University of Defense Technology, ⁴Nanjing University of Science and Technology
^*Indicates Equal Contribution

Paper Code

Illustration of Runtime one-shot imitation learning (OSIL) policies under unforeseen changes in Meta-World tasks. The policy in (b) is trained by traditional OSIL. The grasped block may drop by chance due to disturbances that do not exist during demonstration collection.

Abstract

One-shot imitation learning (OSIL) is to learn an imitator agent that can execute multiple tasks with only a single demonstration. In real-world scenario, the environment is dynamic, e.g., unexpected changes can occur after demonstration. Thus, achieving generalization of the imitator agent is crucial as agents would inevitably face situations unseen in the provided demonstrations. While traditional OSIL methods excel in relatively stationary settings, their adaptability to such unforeseen changes, which asking for a higher level of generalization ability for the imitator agents, is limited and rarely discussed (See the following Figure). In this work, we present a new algorithm called Deep Demonstration Tracing (DDT). In DDT, we propose a demonstration transformer architecture to encourage agents to adaptively trace suitable states in demonstrations. Besides, it integrates OSIL into a meta-reinforcement-learning training paradigm, providing regularization for policies in unexpected situations. We evaluate DDT on a new navigation task suite and robotics tasks, demonstrating its superior performance over existing OSIL methods across all evaluated tasks in dynamic environments with unforeseen changes.

The first column displays the trajectories generated by DDT for the three tasks: shelf place, peg insert side, and sweep. The second column indicates the closest matched states identified by the attention mechanism, while the third column is the expert trajectories. It can be observed that despite disturbances such as object drops or arm jitter, DDT robustly follows the expert trajectories and completes tasks successfully.

Methodology: Deep Demonstration Tracing

A. The Demonstration Transformer Architecture for Demonstration Tracing

Illustration of the motivation example and Demonstration Transformer Architecture for imitator policy \(\Pi). (a) Illustration of how humans achieve OSIL under unforeseen changes; (b) The demonstration transformer architecture for the actor. \( [s^e_0, \ldots, s^e_i, \ldots, s^e_t] \) denote expert states and \( [a^e_0, \ldots, a^e_i, \ldots, a^e_t] \) the expert action list. \( s_j \) is the visited state of the actor at timestep \( j \). We adopt \( \mathbf{q} \), \( \mathbf{k} \), and \( \mathbf{v} \) to denote the query, key, and value vectors of an attention module. \( N\times \) denotes an \( N \)-layer demo-attention module, which takes the output \( v'' \) of the last layer as the input \( q_j \) of the next layer. Note that the expert-state encoder and the visited state encoder shared the same weights.

B. Achieve OSIL via Context-based Meta-RL

Conventional OSIL approaches are predominantly based on training with behavior cloning losses, which also fails to guarantee robust decision-making capabilities in unseen states. Drawing inspiration from methodologies that integrate IL with RL through a stationary imitation reward, we incorporate OSIL into a context-based meta-RL framework. Within this framework, we can utilize the trial-and-error learning mechanism of RL to allow the imitation policy to systematically explore the state space and effectively achieve decision-making proficiency in unseen states.

Illustration of the Training and Deploying Workflow for a Runtime One-shot imitator policy via context-based meta-RL.

Pseudo Code of the Training Process.

Experiments

In experiments, we focus on answering the following research questions.

RQ1: The one-shot imitation ability of DDT in unseen situations, including unseen demonstrations, unseen environments, and unforeseen changes after demonstration collection.
RQ2: Does demonstration transformer really imitating via tracing the demonstration?
RQ3: Can DDT have potential of performance improvement when scaling up the size of parameters and demonstration data, inspired by the "Scaling Law" in large language models.
RQ4: The performance of DDT when Apply it in Other Challenging Tasks.

Environments

We created a challenging benchmark, named Valet Parking Assist in Maze (VPAM), to assess OSIL's performance for unforeseen changes. This navigation benchmark is inspired by a popular and practical real-world application in autonomous driving, called Valet Parking Assist (VPA). VPAM focuses on navigating diverse, complex mazes without global map information. We also apply DDT in various standard and complex tasks, including meta-world, complex robotics manipulations in clutter, and Reacher and Pusher in MuJoCo to show the robustness of our method in other challenges of OSIL.

Illustration of Major Experiments in this paper. (A) Illustration of the VPAM, which is a new benchmark for OSIL with unforseen changes. The imitation points are provided by our DDT method. (B) Illustration of tasks in Meta-Wolrd. (C) Various Complex tasks of robot manipulation in clutter environments. (a): Grasp the blocked target object (cyan). (b): Stack the objects. (c): Collect the objects scattered over the desk together to the specified area (yellow).

RQ1: One-Shot Imitation Ability in Unseen Situations

Illustration of the imitation policies' performance deployed among different group of settings in VPAM. The black bars denote the standard error among task-group with three seeds. Results show that DDT reach the significantly better performance than the baselines from training performance ("Train" Group) to deployment performance whenever with unforseen obstacle ("Unforseen Obstacle" Group) or without obstacle ("Non-Obstacle" Group).

Illustration of the imitation policies' training performance among different settings. The colored areas denote the standard error among the three seeds. DDT displayed a stable and better performance even in the training tasks. We attribute this to the integration of the demonstration transformer architecture. This architecture conferred an additional training efficiency boost by implicitly introducing prior knowledge of how OSIL was achieved, facilitating easier adaptation across various tasks and settings with different complexities.

RQ2: Demonstration-Attention Mechanism for Demonstration Tracing in DDT

We visualize the agent trajectory in a randomly generated map with unforeseen obstacles and depict attention scores during the decision-making process. The attention scores are the product of the query of current state and the keys associated with demonstration states. It is evident that higher attention values are predominantly concentrated on the diagonal, demonstrating our algorithm's ability to identify which state to follow.

Visualizations of DDT in VPAM. (a) A trajectory generated by DDT; (b) The attention score map corresponding to (a). The horizontal and vertical axis represent the agent's trajectory index. The deeper color in a row represents the higher attention score.

Additionally, a corresponding video recording rollouts generated by our DDT method.

RQ3: Similar Scaling Law of DDT when Scaling Up in the OSIL Setting

Asymptotic performance of DDT under varying demonstration quantities and model parameters, with each unit on the x-axis representing 60 demonstrations or 0.6 million parameters. The x-axis is on a logarithmic scale. Square markers depict the performance of the default DDT parameters.

RQ4: Apply DDT in Other Challenging Tasks.

Meta-world: Performance under Disturbance

Results on MetaWorld under Disturbance. The video at the start of this project page are rendered from the results in this experiment.

Meta-world: Demonstraing with Unseen Heterogeneous Behaviors when Deploying

We test and record the generalization performance on three types of unseen heterogeneous demonstrations with all positions of goals without fine-tuning.

Complex Manipulation Tasks

Results on Complex Manipulation.

BibTeX

@inproceedings{
      chen2024deep,
      title={Deep Demonstration Tracing: Learning Generalizable Imitator Policy for Runtime Imitation from a Single Demonstration},
      author={Xiong-Hui Chen and Junyin Ye and Hang Zhao and Yi-Chen Li and Xu-Hui Liu and Haoran Shi and Yu-Yan Xu and Zhihao Ye and Si-Hang Yang and Yang Yu and Kai Xu and Zongzhang Zhang and Anqi Huang},
      booktitle={Forty-first International Conference on Machine Learning},
      year={2024},
      url={https://openreview.net/forum?id=DJdVzxemdA}
      }