Document Type

Honors Project On-Campus Access Only


Large transformers have successfully consolidated diverse tasks by means of prompting, which provides a flexible interface to communicate natural language tasks to a general-purpose model. However, textual instruction alone may not be sufficient or convenient to specify robot manipulation tasks. In many applications, alternative formats like visual goals, few-shot image concepts, and video skill demonstrations have much more expressive power, yet they often require different policy architectures, objective functions, data preprocessing, and training procedures in prior works.

We propose a multimodal prompting framework that casts any robot manipulation command to the same sequence format of interleaving text tokens and visual frames. Thanks to this uniform IO interface, we design a novel transformerbased robot learning architecture, VIMA, that ingests the multimodal task prompts and outputs control commands for a robot arm autoregressively. To systematically evaluate VIMA, we introduce a novel benchmark simulation suite with 17 procedurally-generated tabletop tasks with multimodal prompt templates, 4 levels of generalization protocol, and 600K+ training trajectories. VIMA achieves strong scalability in both model capacity and data size. It outperforms prior SOTA methods in the hardest zero-shot generalization setting by up to 2.9× task success rate given the same training data. With 10× less training data, VIMA still performs 2.7× better than the top competing approach.



© Copyright is owned by author of this document