CMU Vision-Language-Autonomy Challenge


For the first year (2024), we will host the competition in simulation. We will deploy a real robot starting in the second year (2025). Please check out detailed information about the 2024 competition. The competition is happening now!

The CMU Vision-Language-Autonomy Challenge leverages computer vision and natural language understanding in navigation autonomy. The challenge aims at pushing the limit of embodied AI in real environments and on real robots - providing a robot platform and a working autonomy system to bring everybody's work a step closer to real-world deployment. The challenge provides a real-robot system equipped with a 3D lidar and a 360 camera. The system has base autonomy onboard that can estimate the sensor pose, analyze the terrain, avoid collisions, and navigate to waypoints. Teams will set up software on the robot's onboard computer to interface with the system and navigate the robot. In the challenge, we give each team a set of questions. The team's software processes the questions together with onboard data provided by the system. The characteristic of the challenge is for the team's software to explicitly understand the spatial relations and attributes of the scene as described in the questions. The environment is unknown, the robot navigates to appropriate viewpoints to discover and validate the spatial relations and attributes. The team's software sends waypoints to the system to navigate the robot around.

The questions are separated into three categories, 'numerical', 'object reference', and 'instruction following'. For numerical questions, we expect the teams' software to respond with a number. For object reference questions, we expect the teams' software to select a specific object in the environment. For instruction following questions, we expect the teams' software to guide the vehicle navigation. Some example questions are below.

System Setup

Our system runs on Ubuntu 20.04 and uses ROS Noetic, in both simulation and on real robots. The system provides onboard data to the team's software as follows.

The system takes waypoints from the team's software to navigate the robot. Waypoints located in the traversable area (listed above) are accepted directly, and waypoints out of the traversable area are adjusted and moved into the traversable area. The system also takes visualization markers from the team's software to highlight selected objects.

The coordinate frames used by the system are shown below. The camera position (camera frame) w.r.t. the lidar (sensor frame) is measured based on a CAD model. The orientation is calibrated and the images are remapped to keep the camera frame and lidar frame aligned. 

Real-robot Dataset

A dataset from the same challenge area is provided with some differences in the object layout. The dataset contains ROS messages provided by the system in the same format as during the challenge. An RVIZ configuration file is also provided for viewing the data. A ground truth map with object segmentation and IDs and an object list with bounding boxes and labels are also provided. The ground truth map and the object list are only available in the datasets but not at the challenge. The camera pose (camera frame) w.r.t. the lidar (sensor frame) is in the readme.

Simulation System

Two simulation setups are provided. One simulation setup is based on Unity with a number of 15 environment models provided. Among these, 12 are single rooms and 3 are multi-room buildings. For all the environment models, we provide the object and region information together with a point cloud of the scene (included in the Language-scene Dataset). The 2nd simulation setup is based on AI Habitat and uses Matterport3D environment models. Teams can launch the system, receive synthesized data, and send waypoints to navigate the robot in simulation. For a quick test, the command line below sends a waypoint to the system at 1m away from the start point. Note that the simulation systems provide a broader scope of data than the actual challenge. Teams can use the data to prepare the software in simulation. During the challenge, only data listed in System Setup is provided, matching the Real-robot Dataset.

rostopic pub -1 /way_point_with_heading geometry_msgs/Pose2D '{x: 1.0, y: 0.0, theta: 0.0}'

Language-scene Dataset

A dataset containing 6.2K real-word scenes from 10K regions (together with the 15 Unity scenes for simulation) and 8M statements describing the object spatial relations and attributes in the regional scenes are provided (releasing soon).

Real-robot Challenge - Starting in 2025

A real-robot setup is used for the challenge. In the challenge, the system provides onboard data in the same way as the Datasets and takes waypoints in the same way as the Simulations. We only allow the team's software to send waypoints. Manually sending waypoints or teleoperation is not allowed. Each team will remotely login to the robot's onboard computer (16x i9 CPU cores, 32GB RAM, RTX 4090 GPU), and set up software in a Docker container that interfaces with the autonomy modules. The Docker container is used by each team alone and not shared with other teams. We will schedule multiple time slots for each team to set up the software and test the robot. The teams can also record data on the robot's onboard computer. We will make the data available to the teams afterward. 

Simulation Challenge - Happening Now!

Instructions for the 2024 simulation challenge can be found here. The simulation challenge uses Unity environment models. Among these, 15 are provided to the teams for preparation of the challenge, and 3 are kept for evaluation. A set of 5 questions and the expected responses are made for each scene.  Please follow instructions here to prepare a Docker Image for submission. Teams can use all simulation setups and datasets provided on the website to prepare for the challenge.

Dates

Below is the timeline for the 2024 simulation challenge.

Organizers

Haochen Zhang
CMU Robotics Institute

Zhixuan Liu
CMU Robotics Institute

Shibo Zhao
CMU Robotics Institute

Pujith Kachana
CMU Robotics Institute

Nader Zantout
CMU Robotics Institute

Zongyuan Wu
CMU MechE Department

Jean Oh
CMU NREC & Robotics Institute

Sebastian Scherer
CMU Robotics Institute

Yorie Nakahira
CMU ECE Department

Ji Zhang
CMU NREC & Robotics Institute

Wenshan Wang
CMU Robotics Institute

Acknowledgments

The CMU Vision-Language-Autonomy Challenge is sponsored by the National Robotics Engineering Center (NREC) at Carnegie Mellon University.

Other Links