🤖 Embodied Artificial Intelligence Seminar - Readings and Resources

Welcome to the CS6604 Embodied AI Seminar!

Below is a list of topics we’ll cover during the semester, along with recommended readings, and links to project pages or source code repositories where applicable.

For each paper, click on 📚 for the PDF version and on 🌍 for additional resources.

Topic 1: Benchmarks: Simulators, Environments, Datasets
  • ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes 📚 🌍
  • iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes 📚 🌍
  • Matterport3D: Interpreting Visually-Grounded Navigation Instructions in Real Environments 📚 🌍
  • CVDN: Vision-and-Dialog Navigation 📚
  • Soundspaces: Audio-Visual Navigation in 3D Environments 📚 🌍
  • AI2-THOR: An Interactive 3D Environment for Visual AI 📚 🌍
  • Rearrangement: A Challenge for Embodied AI 📚
  • Visual Room Rearrangement 📚 🌍
  • ProcTHOR: Large-Scale Embodied AI Using Procedural AI Generation 📚 🌍
  • ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills 📚 🌍
  • Object Goal Navigation using Goal-Oriented Semantic Exploration 📚 🌍
  • Embodied Question Answering in Photorealistic Environments with Point Cloud Perception 📚 🌍
  • Alfred: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks 📚 🌍
  • DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following 📚 🌍
  • Alexa Arena: A User-Centric Interactive Platform for Embodied AI 📚 🌍
  • VirtualHome: Simulating Household Activities via Programs 📚 🌍
  • BEHAVIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation 📚 🌍
  • MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge 📚 🌍
Topic 2: Conceptual Framing, World Models, Behavioral and Performance Metrics
  • World Models 📚 🌍
  • Machine Theory of Mind 📚
  • Collaborative World Models: An Online-Offline Transfer RL Approach 📚
  • Transformers are Sample-Efficient World Models 📚
  • Learning Temporally Abstract World Models without Online Experimentation 📚
  • Reward-Free Curricula for Training Robust World Models 📚
  • Recurrent World Models Facilitate Policy Evolution 📚 🌍
  • Discovering and Achieving Goals via World Models 📚 🌍
  • Planning to Explore via Self-Supervised World Models 📚 🌍
  • Learning to Model the World with Language📚 🌍
  • Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling 📚
  • Dream to Control: Learning Behaviors by Latent Imagination 📚
  • DayDreamer: World Models for Physical Robot Learning 📚 🌍
  • Mastering Diverse Domains through World Models 📚 🌍
  • Mastering Atari with Discrete World Models 📚 🌍
  • Masked World Models for Visual Control 📚 🌍
  • Structured World Models from Human Videos 📚 🌍
  • Building Machines That Learn and Think Like People 📚
  • Action and Perception as Divergence Minimization 📚
  • Intrinsically Motivated Reinforcement Learning 📚
  • Decision Transformer: Reinforcement Learning via Sequence Modeling 📚
  • Curiosity-Driven Exploration of Learned Disentangled Goal Spaces 📚
  • Encouraging and Evaluating Embodied Exploration 📚
  • Language as a Cognitive Tool to Imagine Goals in Curiosity-Driven Exploration 📚
  • Learning to play with intrinsically-motivated, self-aware agents 📚
  • On Evaluation of Embodied Navigation Agents 📚
  • ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects 📚
  • On the Evaluation of Vision-and-Language Navigation Instructions 📚
  • A New Path: Scaling Vision-and-Language Navigation With Synthetic Instructions and Imitation Learning 📚
  • Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation 📚
  • Iterative Vision-and-Language Navigation 📚
  • GRIDTOPIX : Training Embodied Agents with Minimal Supervision 📚 🌍
  • On the Limits of Evaluating Embodied Agent Model Generalization Using Validation Sets 📚
Topic 3: Learning about visual sensory information through interaction
  • Scene Graph Contrastive Learning for Embodied Navigation 📚
  • Learning Navigational Visual Representations with Semantic Map Supervision 📚
  • Topological Semantic Graph Memory for Image-Goal Navigation 📚
  • Object-Goal Visual Navigation via Effective Exploration of Relations among Historical Navigation States 📚
  • One-4-All: Neural Potential Fields for Embodied Navigation 📚
  • 🏅 Emergence of Maps in the Memories of Blind Navigation Agents (ICLR'23 Outstanding Paper) 📚
  • Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks 📚
  • Graph Attention Memory for Visual Navigation 📚
  • Instance-Specific Image Goal Navigation: Training Embodied Agents to Find Object Instances 📚
  • Navigating to Objects Specified by Images 📚 🌍
  • TIDEE: Tidying Up Novel Rooms using Visuo-Semantic Commonsense Priors 📚 🌍
  • Egocentric Planning for Scalable Embodied Task Achievement 📚
  • ALP: Action-Aware Embodied Learning for Perception 📚
  • Simple but Effective: CLIP Embeddings for Embodied AI 📚
  • Continuous Scene Representations for Embodied AI 📚
  • Graph-based Environment Representation for Vision-and-Language Navigation in Continuous Environments📚
  • Learning Affordance Landscapes for Interaction Exploration in 3D Environments 📚
  • PASTA: Pretrained Action-State Transformer Agents 📚
Topic 4: Learning about language and language-guided interaction
  • MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception 📚
  • Plan4MC: Skill reinforcement learning and planning for open-world Minecraft tasks 📚
  • VOYAGER: An Open-Ended Embodied Agent with Large Language Models 📚
  • Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents 📚
  • Embodied Task Planning with Large Language Models 📚
  • Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning 📚
  • Chasing Ghosts: Instruction Following as Bayesian State Tracking 📚
  • Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents 📚
  • SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation 📚
  • Building Cooperative Embodied Agents Modularly with Large Language Models 📚
  • Asking Before Action: Gather Information in Embodied Decision Making with Language Models 📚
  • Language Models Meet World Models: Embodied Experiences Enhance Language Models 📚
  • DANLI: Deliberative Agent for Following Natural Language Instructions 📚
  • 3D-LLM: Injecting the 3D World into Large Language Models 📚
  • EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought 📚
  • PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World 📚
  • Embodied Executable Policy Learning with Language-based Scene Summarization📚
  • JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models📚
  • See and Think: Embodied Agent in Virtual Environment📚
  • Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models📚
Topic 5: Dive into robotics, Sim2Sim/Sim2Real Transfer, Embodied Communication, Multi-agent Embodied Collaboration
  • Eureka: Human-Level Reward Design via Coding Large Language Models 📚
  • PaLM-E: An Embodied Multimodal Language Model 📚
  • Learning Interactive Real-World Simulators 📚 🌍
  • Open X-Embodiment: Robotic Learning Datasets and RT-X Models 📚 🌍
  • RT-2: Vision-Language-Action Models 📚 🌍
  • Scaling Robot Learning with Semantically Imagined Experience 📚 🌍
  • AR2-D2:Training a Robot Without a Robot 📚 🌍
  • IndoorSim-to-OutdoorReal: Learning to Navigate Outdoors without any Outdoor Experience📚 🌍
  • VIMA: General Robot Manipulation with Multimodal Prompts 📚 🌍
  • AdaptSim: Task-Driven Simulation Adaptation for Sim-to-Real Transfer 📚 🌍
  • RoboCat: A self-improving robotic agent 📚 🌍
  • Policy Stitching: Learning Transferable Robot Policies 📚 🌍
  • EC2 : Emergent Communication for Embodied Control 📚
  • Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents 📚
  • Heterogeneous Embodied Multi-Agent Collaboration 📚 🌍
  • Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments 📚 🌍
Topic 6: Diffusion Policies
  • Diffusion Policy: Visuomotor Policy Learning via Action Diffusion 📚
  • NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration 📚
  • PlayFusion: Skill Acquisition via Diffusion from Language-Annotated Play 📚
  • Learning Universal Policies via Text-Guided Video Generation 📚
  • Compositional Foundation Models for Hierarchical Planning 📚
  • XSkill: Cross Embodiment Skill Discovery 📚

Additional Resources