2nd AI Meets Autonomy: Vision, Language, and Autonomous Systems Workshop
IROS 2025, Hangzhou, China
Event Details
With the rising popularity of Large Language Models (LLMs), Visual Language Models (VLMs), and other general foundation models, the 2nd iteration of this workshop aims to explore the synergy between these models and robotics, in the context of recent developments. In the workshop, we will discuss how recent advances from the AI and CV communities could benefit robotics research – incorporating LLMs and VLMs in robotic systems could potentially make the systems more explainable, instructable, and generalizable. However, seamlessly integrating the two is still an open problem, as existing LLMs and VLMs often fall short of the real-world knowledge that roboticists need to consider, such as the physical properties, spatial relations, temporal relations, etc. in an environment. These issues may be alleviated by integration with robotic systems, where real-world experiences that contain rich sensory input and physical feedback can potentially bridge the gap for physically grounded systems.
This workshop aims to create an inclusive and collaborative platform for professionals, researchers, and enthusiasts alike to exchange ideas, share experiences, and foster meaningful connections within the AI and robotics community, with a focus on connecting early-career researchers. It will feature a mix of presentations, open panel discussions, networking, and exclusive results and demonstrations from our CMU Vision-Language-Autonomy challenge competition. Five invited speakers will discuss their related research, ideas, and future plans in various topics at the intersection of AI and autonomous systems, with a broad coverage of topics such as: datasets/benchmarks, software stacks, visual-language navigation, situated reasoning, robotics foundation models, and more.
See AI Meets Autonomy 2024 for the previous iteration of our workshop at IROS 2024 and past talk recordings.
Content
There are a variety of ongoing efforts to improve the state-of-the-art in robot autonomy using recent advances in AI, particularly using large foundation models, including new developments in areas such as human-robot interaction, vision-language navigation, multi-task planning, and knowledge acquisition and reasoning. Although integrating AI and autonomous systems has multiple advantages, it also involves issues and challenges. First, large language or visual foundation models are shown to be insufficient in understanding the actual physics, spatial relationships, and causal relationships, lacking the physical grounding that’s critical to most robotic tasks. Second, deploying these large models on robotic systems requires intense GPU resources, especially if the processing needs to be real-time. Further, existing work combining LLMs/VLMs and robotics focuses on using LLMs/VLMs as a tool to translate human instructions into executable plans. In reality, humans can provide more than instructions—such as explanation of the scene and correction of execution errors, and in richer contexts such as multi-turn dialogues. The workshop will focus on the current research work being done to tackle such challenges in the field of robotics with a diverse line-up of speakers. The speakers will each share their recent work and experiences with integrating vision and language methods into various autonomous systems and their active research projects in this area from the perspective of both academia and industry. The speakers will cover a wide range of topics they have expertise in, such as generalist embodied agents, vision-language-navigation, world modeling, robotics foundation models, and situated dialogue agents. Their talks and subsequent discussions will not only give a concrete picture of the current and rapidly developing research landscape, but also promote new insights and directions for future work. Q&A sessions after each talk as well as a panel discussion at the end of the workshop will allow the audience to engage with speakers and discuss ideas.
Another key barrier to integrating vision and language learning methods into robotics systems is the resources required for testing new methods, including the data needed for training and evaluation, or robust real-world robot platforms to test deployment. To lower the barrier, we have been hosting the CMU Vision-Language-Autonomy Challenge. In the challenge, we provide a set of language questions, sample training environments, and a full pipeline for integrating any vision-language method for object-centric indoor navigation with a real-world ground vehicle running an autonomy stack with an onboard 3D LiDAR sensor and 360-degree camera. Challenge participants can conduct development and integration in a simulator and eventually deploy on our real robot system, which will be demoed at the workshop. All code, autonomy stacks, and resources used for the challenge are open-sourced and the challenge is open to anyone. The challenge ran in simulation last year and initial results were presented at IROS 2024 during the previous iteration of our workshop. In the next few months, we are hosting round two of the challenge on our real-robot system and plan to present the results during this workshop alongside a live and interactive demonstration of the system where audience members are invited to participate. This not only gives top participants–often consisting of students–a chance to spotlight their work to our diverse and global workshop audience but will also help garner interest in this workshop in the months leading up to it. It also provides a unique opportunity for IROS 2025 to be the venue at which the final challenge results are shared for the first time. The inclusion of an interactive system demonstration and spotlight talks on the competition results will differentiate our workshop from others at IROS 2025 and create a more engaging workshop experience.
In summary, the workshop will not only have five invited leading researchers present on their relevant work, but we will also invite the top challenge teams to present their results in the form of poster sessions and short spotlight talks, and include an interactive demo of the challenge system. Speakers will also engage in a panel session where attendees can ask questions to the speakers and engage in discussion. The workshop will conclude with an open networking session and the ability for attendees to share their profile and resume in a compiled contact book to encourage forming connections beyond the span of the workshop.
The topics we expect to cover through talks, challenge results, panel discussions, and networking span an incredibly diverse range across the field of robotics while also intersecting with many other disciplines. It will demonstrate how the integration of robotics, language and vision can and will enhance robotics research in a variety of real-world application areas.
Topics of discussion and open questions include but are not limited to the following:
Vision-Language-Action datasets and models
Foundation models for robotics
Human-robot interaction for navigation
Human-robot collaboration and dialogue agents
Social navigation
Embodiment through world models
Bridging sim-to-real gap
Mixed-initiative autonomy
Spatial and causal reasoning
Robot autonomy stacks
Program
Organizers and Speakers
Wenshan Wang
CMU Robotics Institute
Ji Zhang
CMU NREC & Robotics Institute
Haochen Zhang
CMU Robotics Institute
Deva Ramanan
CMU Robotics Institute
Ting Cao
Microsoft Research
Jiangmiao Pang
Shanghai AI Laboratory
Joyce Chai
University of Michigan
Huazhe Xu
Tsinghua University
Roozbeh Mottaghi
Meta FAIR
Zsolt Kira
Georgia Institute of Technology
Siyuan Huang
Beijing Institute for General Artificial Intelligence
Angel Chang
Simon Fraser University