2nd AI Meets Autonomy: Vision, Language, and Autonomous Systems Workshop
IROS 2025, Hangzhou, China
Friday, Oct. 24th, 1:30PM - 5:00PM, Room 210C.
IROS 2025, Hangzhou, China
Friday, Oct. 24th, 1:30PM - 5:00PM, Room 210C.
This workshop is held in the afternoon of Friday, Oct. 24, 2025 from 1:30PM - 5:00PM in Room 210C.
With the rising popularity of Large Language Models (LLMs), Visual Language Models (VLMs), and other general foundation models, the 2nd iteration of this workshop aims to explore the synergy between these models and robotics, in the context of recent developments. In the workshop, we will discuss how recent advances from the AI and CV communities could benefit robotics research – incorporating LLMs and VLMs in robotic systems could potentially make the systems more explainable, instructable, and generalizable. However, seamlessly integrating the two is still an open problem, as existing LLMs and VLMs often fall short of the real-world knowledge that roboticists need to consider, such as the physical properties, spatial relations, temporal relations, etc. in an environment. These issues may be alleviated by integration with robotic systems, where real-world experiences that contain rich sensory input and physical feedback can potentially bridge the gap for physically grounded systems.
This workshop aims to create an inclusive and collaborative platform for professionals, researchers, and enthusiasts alike to exchange ideas, share experiences, and foster meaningful connections within the AI and robotics community, with a focus on connecting early-career researchers. It will feature a mix of presentations, open panel discussions, networking, and exclusive results and demonstrations from our CMU Vision-Language-Autonomy challenge competition. Four invited speakers will discuss their related research, ideas, and future plans in various topics at the intersection of AI and autonomous systems, with a broad coverage of topics such as: datasets/benchmarks, software stacks, visual-language navigation, situated reasoning, robotics foundation models, and more.
See AI Meets Autonomy 2024 for the previous iteration of our workshop at IROS 2024 and past talk recordings.
There are a variety of ongoing efforts to improve the state-of-the-art in robot autonomy using recent advances in AI, particularly using large foundation models, including new developments in areas such as human-robot interaction, vision-language navigation, multi-task planning, and knowledge acquisition and reasoning. Although integrating AI and autonomous systems has multiple advantages, it also involves issues and challenges. First, large language or visual foundation models are shown to be insufficient in understanding the actual physics, spatial relationships, and causal relationships, lacking the physical grounding that’s critical to most robotic tasks. Second, deploying these large models on robotic systems requires intense GPU resources, especially if the processing needs to be real-time. Further, existing work combining LLMs/VLMs and robotics focuses on using LLMs/VLMs as a tool to translate human instructions into executable plans. In reality, humans can provide more than instructions—such as explanation of the scene and correction of execution errors, and in richer contexts such as multi-turn dialogues. The workshop will focus on the current research work being done to tackle such challenges in the field of robotics with a diverse line-up of speakers. The speakers will each share their recent work and experiences with integrating vision and language methods into various autonomous systems and their active research projects in this area from the perspective of both academia and industry. The speakers will cover a wide range of topics they have expertise in, such as generalist embodied agents, vision-language-navigation, world modeling, robotics foundation models, and situated dialogue agents. Their talks and subsequent discussions will not only give a concrete picture of the current and rapidly developing research landscape, but also promote new insights and directions for future work. Q&A sessions after each talk as well as a panel discussion at the end of the workshop will allow the audience to engage with speakers and discuss ideas.
Another key barrier to integrating vision and language learning methods into robotics systems is the resources required for testing new methods, including the data needed for training and evaluation, or robust real-world robot platforms to test deployment. To lower the barrier, we have been hosting the CMU Vision-Language-Autonomy Challenge. In the challenge, we provide a set of language questions, sample training environments, and a full pipeline for integrating any vision-language method for object-centric indoor navigation with a real-world ground vehicle running an autonomy stack with an onboard 3D LiDAR sensor and 360-degree camera. Challenge participants can conduct development and integration in a simulator and eventually deploy on our real robot system, which will be demoed at the workshop. All code, autonomy stacks, and resources used for the challenge are open-sourced and the challenge is open to anyone. The winning teams will be presenting their methods at the workshop.
In summary, the workshop will not only have invited leading researchers present on their relevant work, but we will also invite the top challenge teams to present their results in the form of poster sessions and short spotlight talks, and include an interactive demo of the challenge system. Speakers will also engage in a panel session where attendees can ask questions to the speakers and engage in discussion. The workshop will include networking opportunities and the ability for attendees to share their profile and resume in a compiled contact book to encourage forming connections beyond the span of the workshop.
The topics we expect to cover through talks, challenge results, panel discussions, and networking span an incredibly diverse range across the field of robotics while also intersecting with many other disciplines. It will demonstrate how the integration of robotics, language and vision can and will enhance robotics research in a variety of real-world application areas.
Topics of discussion and open questions include but are not limited to the following:
Vision-Language-Action datasets and models
Foundation models for robotics
Human-robot interaction for navigation
Human-robot collaboration and dialogue agents
Social navigation
Embodiment through world models
Bridging sim-to-real gap
Mixed-initiative autonomy
Spatial and causal reasoning
Robot autonomy stacks
Wenshan Wang
CMU Robotics Institute
Ji Zhang
CMU NREC & Robotics Institute
Haochen Zhang
CMU Robotics Institute
Deva Ramanan
CMU Robotics Institute
Ting Cao
Microsoft Research
Jiangmiao Pang
Shanghai AI Laboratory
Angel Chang
Simon Fraser University
Roozbeh Mottaghi
Meta FAIR
Siyuan Huang
Beijing Institute for General Artificial Intelligence
He Wang
Peking University
Joseph Lim
KAIST
Feishi Wang
Peking University