CMU Vision-Language-Autonomy Challenge


Simulation Challenge - Happening Now!


For the first year (2024), we will host the competition in simulation. We will deploy a real robot starting next year in 2025.

The CMU Vision-Language-Autonomy Challenge leverages computer vision and natural language understanding in navigation autonomy. The challenge aims at pushing the limit of embodied AI in real environments and on real robots - providing a robot platform and a working autonomy system to bring higher-level reasoning and learning models a step closer to real-world deployment. 

The characteristic of the challenge is to develop a model that takes in natural language queries or commands about a scene and generate the appropriate navigation-based response through reasoning about semantic and spatial relationships. The environment is initially unknown and the system will have to navigate to appropriate viewpoints to discover and validate the spatial relations and attributes. A real-robot system equipped with a 3D lidar and a 360 camera is provided, which has base autonomy onboard that can estimate the sensor pose, analyze the terrain, avoid collisions, and navigate to waypoints. Teams will set up software on the robot's onboard computer to interface with the system and navigate the robot. For 2024, the challenge will be done in a custom simulation environment and move to the real-robot system the following year. 

The top-performing teams will be invited to present their results at our IROS 2024 workshop!

Challenge Overview


For detailed challenge information, technical documentation, and instructions, please see our Github Repo.


Challenge Scenes

The challenge uses a set of 18 scenes from Unity from a variety of indoor environments including residential and corporate environments. 15 scenes are provided for model development while 3 are held out for testing. The majority of these scenes are single rooms while a few are multi-room buildings.  Some examples of scenes are shown below.

Challenge Questions

Three different types of natural language questions/statements requiring semantic spatial understanding are used for each scene, with each type requiring a different response from the AI module. Numerical questions require an output in terminal while Object Reference and Instruction-Following questions requires navigating the robot. Examples of each type of question are shown below:

Numerical

Object Reference

Instruction-Following


For robot navigation, only waypoints are needed as the base autonomy system will take care of low-level path-planning and obstacle avoidance. 

Robot System

Our system (shown above) runs on Ubuntu 20.04 and uses ROS Noetic in both simulation and on real robots and will provide onboard data to the AI module. As inputs, the system takes waypoints from the AI model to navigate the robot. Waypoints located in the traversable area of the scene are accepted directly, and waypoints out of the traversable area are adjusted and moved into the traversable area. The system also takes visualization markers from the team's software to highlight selected objects.

The coordinate frames used by the system are also shown above. The camera position (camera frame) with respect to the lidar (sensor frame) is measured based on a CAD model. The orientation is calibrated and the images are remapped to keep the camera frame and lidar frame aligned. 

Simulation System

Two simulation setups are provided. One simulation setup is based on Unity while the other one is based on AI Habitat and uses Matterport3D environment models. For each scene, we provide the object and region information together with a point cloud of the scene (included in the Object-Referential Language Dataset).

Teams can launch the system, receive synthesized data, and send waypoints to navigate the robot in simulation. 

Object-Referential Language Dataset

To help with the subtask of referential object-grounding, the VLA-3D dataset containing 7.6K indoor 3D scenes with over 11K regions and 9M+ statements is provided. The dataset includes processed scene point clouds, object and region labels, a scene graph of semantic relations, and generated language statements for each 3D scene from a diverse set of data sources and includes the 15 training scenes in Unity. For access to the data and more details on the format, please see our VLA-3D repository.

Real-Robot Challenge - Starting in 2025

Starting in 2025, the challenge evaluation will be done on the real-robot system instead of in simulation. In the challenge, the system provides onboard data and takes waypoints in the same way as the simulator. The software developed in the AI module is only able to send waypoints to explore the scene. Manually sending waypoints or teleoperation is not allowed. Teams will have the opportunity to remotely login to the robot's onboard computer, and set up software in a Docker container that interfaces with the autonomy modules. More information about this part of the challenge will be shared in 2025.

Important Dates

Below is the timeline for the 2024 simulation challenge (all deadlines are MIDNIGHT AOE):

Organizers

Haochen Zhang
CMU Robotics Institute

Zhixuan Liu
CMU Robotics Institute

Pujith Kachana
CMU Robotics Institute

Nader Zantout
CMU Robotics Institute

Zongyuan Wu
CMU MechE Department

Shibo Zhao
CMU Robotics Institute

Jean Oh
CMU NREC & Robotics Institute

Sebastian Scherer
CMU Robotics Institute

Yorie Nakahira
CMU ECE Department

Ji Zhang
CMU NREC & Robotics Institute

Wenshan Wang
CMU Robotics Institute

Acknowledgments

The CMU Vision-Language-Autonomy Challenge is sponsored by the National Robotics Engineering Center (NREC) at Carnegie Mellon University.

Links to Related Resources