CMU VLA Challenge

CMU Vision-Language-Autonomy Challenge

Registration is now open for 2025!

Evaluations will be done on a real robot this year.

Introduction

The CMU Vision-Language-Autonomy Challenge leverages computer vision and natural language understanding in navigation autonomy. The challenge aims at pushing the limit of embodied AI in real environments and on real robots - providing a robot platform and a working autonomy system to bring higher-level reasoning and learning models a step closer to real-world deployment.

The characteristic of the challenge is to develop a model that takes in natural language queries or commands about a scene and generate the appropriate navigation-based response through reasoning about semantic and spatial relationships. The environment is initially unknown and the system will have to navigate to appropriate viewpoints to discover and validate the spatial relations and attributes. A real-robot system equipped with a 3D LiDAR and a 360 camera is provided, which has base autonomy onboard that can estimate the sensor pose, analyze the terrain, avoid collisions, and navigate to waypoints. Teams will develop a reasoning module for the robot's onboard computer to interface with the system and navigate the robot. For 2024, the challenge will be done in a custom simulation environment and move to the real-robot system in 2025.

The top-performing teams will have the opportunity to present their results at an IROS workshop in October.

Challenge Overview

For detailed challenge information, technical documentation, and instructions, please see our Github Repo.

Challenge Scenes

The challenge uses a set of 18 scenes from Unity from a variety of indoor environments including residential and corporate environments. 15 scenes are provided for model development while 3 are held out for testing. The majority of these scenes are single rooms while a few are multi-room buildings. Some examples of scenes are shown below.

Challenge Questions

Three different types of natural language questions/statements requiring semantic spatial understanding are used for each scene, with each type requiring a different response from the AI module. Numerical questions require an output in terminal while Object Reference and Instruction-Following questions requires navigating the robot. Examples of each type of question are shown below:

Numerical

How many blue chairs are between the table and the wall?
How many black trash cans are near the window?

Object Reference

Find the potted plant on the kitchen island that is closest to the fridge.
Find the orange chair between the table and sink that is closest to the window.

Instruction-Following

Take the path near the window to the fridge.
Avoid the path between the two tables and go near the blue trash can near the window.

For robot navigation, only waypoints are needed as the base autonomy system will take care of low-level path-planning and obstacle avoidance.

Robot System

Our system (shown above) runs on Ubuntu 20.04 and uses ROS Noetic in both simulation and on real robots and will provide onboard data to the AI module. As inputs, the system takes waypoints from the AI model to navigate the robot. Waypoints located in the traversable area of the scene are accepted directly, and waypoints out of the traversable area are adjusted and moved into the traversable area. The system also takes visualization markers from the team's software to highlight selected objects.

The coordinate frames used by the system are also shown above. The camera position (camera frame) with respect to the lidar (sensor frame) is measured based on a CAD model. The orientation is calibrated and the images are remapped to keep the camera frame and lidar frame aligned. Please see the repo for calibration information.

Simulation System

Two simulation setups are provided. One simulation setup is based on Unity while the other one is based on AI Habitat and uses Matterport3D environment models. For each scene, we provide the object and region information together with a point cloud of the scene (included in the Object-Referential Language Dataset).

Teams can launch the system, receive synthesized data, and send waypoints to navigate the robot in simulation.

Object-Referential Language Dataset

To help with the subtask of referential object-grounding, the VLA-3D dataset containing 7.6K indoor 3D scenes with over 11K regions and 9M+ statements is provided. The dataset includes processed scene point clouds, object and region labels, a scene graph of semantic relations, and generated language statements for each 3D scene from a diverse set of data sources and includes the 15 training scenes in Unity. For access to the data and more details on the format, please see our VLA-3D repository.

Real-Robot Challenge (New for 2025)

Starting in 2025, the challenge evaluation will be done on the real-robot system instead of in simulation in real-world office environments. The real-world environments will test the generalizability of the model developed to unseen and potentially out-of-distribution environments. In the challenge, the system provides onboard data and takes waypoints in the same way as the simulator. The AI module developed is only able to send waypoints to the base system explore the scene. Manually sending waypoints or teleoperation is not allowed. Teams will have the opportunity to remotely login to the robot's onboard computer, and set up software in a Docker container that interfaces with the autonomy modules. More information about evaluation will be shared soon!

Additionally, evaluation will be separated depending on whether participants use the provided ground-truth object semantics or not.

Important Dates

Below is the timeline for the 2025 challenge (all deadlines are MIDNIGHT AOE):

Registration Opens June 2, 2025
Registration Deadline Aug. 15, 2025
Submission Deadline Sept. 15, 2025

The official simulation challenge for 2024 has ended and results were presented at IROS 2024. Please see the workshop website for the presentation recording.

Challenge Repo