CMU Vision-Language-Autonomy Challenge
For the first year (2024), we will host the competition in simulation. We will deploy a real robot starting in the second year (2025). Please check out detailed information about the 2024 competition. The competition is happening now!
The CMU Vision-Language-Autonomy Challenge leverages computer vision and natural language understanding in navigation autonomy. The challenge aims at pushing the limit of embodied AI in real environments and on real robots - providing a robot platform and a working autonomy system to bring everybody's work a step closer to real-world deployment. The challenge provides a real-robot system equipped with a 3D lidar and a 360 camera. The system has base autonomy onboard that can estimate the sensor pose, analyze the terrain, avoid collisions, and navigate to waypoints. Teams will set up software on the robot's onboard computer to interface with the system and navigate the robot. In the challenge, we give each team a set of questions. The team's software processes the questions together with onboard data provided by the system. The characteristic of the challenge is for the team's software to explicitly understand the spatial relations and attributes of the scene as described in the questions. The environment is unknown, the robot navigates to appropriate viewpoints to discover and validate the spatial relations and attributes. The team's software sends waypoints to the system to navigate the robot around.
The questions are separated into three categories, 'numerical', 'object reference', and 'instruction following'. For numerical questions, we expect the teams' software to respond with a number. For object reference questions, we expect the teams' software to select a specific object in the environment. For instruction following questions, we expect the teams' software to guide the vehicle navigation. Some example questions are below.
How many blue chairs are between the table and the wall? Category: numerical
How many black trash cans are near the table and the window? Category: numerical
Find the potted plant closest to the fridge on the kitchen island. Category: object reference
Find the orange chair near the window between the fridge and the table. Category: object reference
Take the path near the table and to the fridge beside the wall. Category: instruction following
Avoid the path between the table and the window and go near the blue trash can farthest from the window. Category: instruction following
System Setup
Our system runs on Ubuntu 20.04 and uses ROS Noetic, in both simulation and on real robots. The system provides onboard data to the team's software as follows.
Image: ROS Image message from the 360 camera. The image is at 1920/640 resolution with 360 deg HFOV and 120VFOV.
Frequency: 10Hz, Frame: camera, ROS topic name: /camera/imageRegistered scan: ROS PointCloud2 message from the 3D lidar and registered by the state estimation module.
Frequency: 5Hz, Frame: map, ROS topic name: /registered_scanSensor scan: ROS PointCloud2 message from the 3D lidar.
Frequency: 5Hz, Frame: sensor_at_scan, ROS topic name: /sensor_scanLocal terrain map: ROS PointCloud2 message from the terrain analysis module around the vehicle.
Frequency: 5Hz, Frame: map, ROS topic name: /terrain_map (5m around the vehicle) /terrain_map_ext (20m around the vehicle)Sensor pose: ROS Odometry message from the state estimation module.
Frequency: 100-200Hz, Frame: from map to sensor, ROS topic name: /state_estimation.Traversable area: ROS PointCloud2 message containing the traversable area of the entire environment.
Frequency: 5Hz, Frame: map, ROS topic name: /traversable_areaGround truth semantics: ROS MarkerArray message containing object labels and bounding boxes within 2m around the vehicle.
Frequency: 5Hz, Frame: map, ROS topic name: /object_markers
The system takes waypoints from the team's software to navigate the robot. Waypoints located in the traversable area (listed above) are accepted directly, and waypoints out of the traversable area are adjusted and moved into the traversable area. The system also takes visualization markers from the team's software to highlight selected objects.
Waypoint with heading: ROS Pose2D message with position and orientation.
Selected object marker: ROS Marker message containing object label and bounding box of the selected object.
The coordinate frames used by the system are shown below. The camera position (camera frame) w.r.t. the lidar (sensor frame) is measured based on a CAD model. The orientation is calibrated and the images are remapped to keep the camera frame and lidar frame aligned.
Real-robot Dataset
A dataset from the same challenge area is provided with some differences in the object layout. The dataset contains ROS messages provided by the system in the same format as during the challenge. An RVIZ configuration file is also provided for viewing the data. A ground truth map with object segmentation and IDs and an object list with bounding boxes and labels are also provided. The ground truth map and the object list are only available in the datasets but not at the challenge. The camera pose (camera frame) w.r.t. the lidar (sensor frame) is in the readme.
Simulation System
Two simulation setups are provided. One simulation setup is based on Unity with a number of 15 environment models provided. Among these, 12 are single rooms and 3 are multi-room buildings. For all the environment models, we provide the object and region information together with a point cloud of the scene (included in the Language-scene Dataset). The 2nd simulation setup is based on AI Habitat and uses Matterport3D environment models. Teams can launch the system, receive synthesized data, and send waypoints to navigate the robot in simulation. For a quick test, the command line below sends a waypoint to the system at 1m away from the start point. Note that the simulation systems provide a broader scope of data than the actual challenge. Teams can use the data to prepare the software in simulation. During the challenge, only data listed in System Setup is provided, matching the Real-robot Dataset.
rostopic pub -1 /way_point_with_heading geometry_msgs/Pose2D '{x: 1.0, y: 0.0, theta: 0.0}'
Language-scene Dataset
A dataset containing 6.2K real-word scenes from 10K regions (together with the 15 Unity scenes for simulation) and 8M statements describing the object spatial relations and attributes in the regional scenes are provided (releasing soon).
Real-robot Challenge - Starting in 2025
A real-robot setup is used for the challenge. In the challenge, the system provides onboard data in the same way as the Datasets and takes waypoints in the same way as the Simulations. We only allow the team's software to send waypoints. Manually sending waypoints or teleoperation is not allowed. Each team will remotely login to the robot's onboard computer (16x i9 CPU cores, 32GB RAM, RTX 4090 GPU), and set up software in a Docker container that interfaces with the autonomy modules. The Docker container is used by each team alone and not shared with other teams. We will schedule multiple time slots for each team to set up the software and test the robot. The teams can also record data on the robot's onboard computer. We will make the data available to the teams afterward.
Simulation Challenge - Happening Now!
Instructions for the 2024 simulation challenge can be found here. The simulation challenge uses Unity environment models. Among these, 15 are provided to the teams for preparation of the challenge, and 3 are kept for evaluation. A set of 5 questions and the expected responses are made for each scene. Please follow instructions here to prepare a Docker Image for submission. Teams can use all simulation setups and datasets provided on the website to prepare for the challenge.
Dates
Below is the timeline for the 2024 simulation challenge.
Deadline for teams' registration 6/30/2024
Deadline to submit simulation challenge results 8/15/2024
Results and teams' presentations on IROS 2024 Workshop 10/14/2024 or 10/18/2024
Organizers
Haochen Zhang
CMU Robotics Institute
Zhixuan Liu
CMU Robotics Institute
Shibo Zhao
CMU Robotics Institute
Pujith Kachana
CMU Robotics Institute
Nader Zantout
CMU Robotics Institute
Zongyuan Wu
CMU MechE Department
Herman Herman
CMU NREC
Jean Oh
CMU NREC & Robotics Institute
Sebastian Scherer
CMU Robotics Institute
Yorie Nakahira
CMU ECE Department
Ji Zhang
CMU NREC & Robotics Institute
Wenshan Wang
CMU Robotics Institute
Acknowledgments
The CMU Vision-Language-Autonomy Challenge is sponsored by the National Robotics Engineering Center (NREC) at Carnegie Mellon University.