Embodied Web Agents:

Bridging Physical-Digital Realms for Integrated Agent Intelligence

University of California, Los Angeles
*Indicates Equal Contribution


Illustrative examples of our Embodied Web Agents conceptual paradigm, tasks and environments.

Teaser

Blue boxes and arrows indicate web interaction / switching to the web respectively. Orange boxes and arrows indicate acting in / switching to the embodied environment. We omit most intermediate actions due to the large number of interaction steps.

Abstract

AI agents today are mostly siloed — they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action — but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the EMBODIED WEB AGENTS task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the EMBODIED WEB AGENTS Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation — all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access.

Embodied Web Agents

We introduce Embodied Web Agents as a new conceptual paradigm of AI systems that unify physical embodiment with web-scale knowledge access — capable of perceiving and acting in the real world while reasoning over dynamic, unstructured information from the web.


Task Environments

To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that integrates realistic 3D environments with interactive web interfaces. This platform combines (1) indoor settings from AI2-THOR, (2) outdoor navigation in Google Earth, and (3) web interfaces including Wikipedia, online stores, recipe websites, map services \textit{etc.}, enabling agents to interact seamlessly with both physical and digital spaces.


Benchmark

We construct the Embodied Web Agents Benchmark, which encompasses approximately 1.5k tasks across multiple domains, including: (1) cooking tasks where agents match physical ingredients with online recipes; (2) navigation combining online maps with physical wayfinding; (3) shopping requiring coordination between in-store actions and online options; (4) tourism connecting physical landmarks with web information; and (5) geolocation determining position through embodied exploration and online research. Together, these tasks systematically test an agent's ability to bridge embodied perception, action, and web-based reasoning across varied contexts.


Example

example

An Exemplar Pipeline of completing a task in our Embodied Web Agents dataset. Blue boxes indicate web interaction. Orange boxes indicate embodied interaction. Boxes with gradient colors indicate switching from one environment to the other.

Experiments and Results

Error Analysis

error_analysis

This figure presents a detailed breakdown of error types and their percentages that contribute to task failures in cooking tasks when using GPT-4o. Our analysis reveals that the primary challenges in embodied web agents lie not in isolated capabilities, but in their integration. The most prevalent failure pattern involves agents becoming trapped in single-domain cycles. While embodied errors (14.6%) and web errors (8.0%) occur, cross-domain errors (66.6%) overwhelmingly dominate the failure landscape — confirming that the critical bottleneck emerges at the intersection where physical and digital domains meet.

BibTeX

@misc{hong2025embodiedwebagentsbridging,
          title={Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence}, 
          author={Yining Hong and Rui Sun and Bingxuan Li and Xingcheng Yao and Maxine Wu and Alexander Chien and Da Yin and Ying Nian Wu and Zhecan James Wang and Kai-Wei Chang},
          year={2025},
          eprint={2506.15677},
          archivePrefix={arXiv},
          primaryClass={cs.AI},
          url={https://arxiv.org/abs/2506.15677}
        }