Embodied Web Agents:

Bridging Physical-Digital Realms for Integrated Agent Intelligence

NeurIPS 2025 Datasets and Benchmarks Track ( Spotlight✨)

Yining Hong*, Rui Sun*, Bingxuan Li, Xingchen Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, Kai-Wei Chang

University of California, Los Angeles
^*Indicates Equal Contribution

arXiv Benchmark Web Environments Code Datasets

Illustrative examples of our Embodied Web Agents conceptual paradigm, tasks and environments.

Blue boxes and arrows indicate web interaction / switching to the web respectively. Orange boxes and arrows indicate acting in / switching to the embodied environment. We omit most intermediate actions due to the large number of interaction steps.

1. Abstract

AI agents today are mostly siloed — they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action — but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the EMBODIED WEB AGENTS task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the EMBODIED WEB AGENTS Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation — all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access.

2. Embodied Web Agents

We introduce Embodied Web Agents as a new conceptual paradigm of AI systems that unify physical embodiment with web-scale knowledge access — capable of perceiving and acting in the real world while reasoning over dynamic, unstructured information from the web.

🏗️

Task Environments

To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that integrates realistic 3D environments with interactive web interfaces.

🏠

Indoor Settings

Realistic 3D indoor environments from AI2-THOR for embodied interaction and navigation.

🌍

Outdoor Navigation

Google Earth integration for large-scale outdoor exploration and wayfinding tasks.

💻

Web Interfaces

Interactive web platforms including Wikipedia, online stores, recipe websites, and map services.

📊

Benchmark

We construct the Embodied Web Agents Benchmark, which encompasses approximately 1.5k tasks across multiple domains, systematically testing an agent's ability to bridge embodied perception, action, and web-based reasoning.

👨‍🍳

Cooking Tasks

Agents match physical ingredients with online recipes and cooking instructions.

🗺️

Navigation

Combining online maps with physical wayfinding and route planning.

🛒

Shopping

Coordination between in-store actions and online price comparison.

🏛️

Tourism

Connecting physical landmarks with web-based historical information.

📍

Geolocation

Determining position through embodied exploration and online research.

💡

Example Pipeline

Exemplar pipeline showing task completion process

Figure 1: An exemplar pipeline of completing a task in our Embodied Web Agents dataset. Blue boxes indicate web interaction. Orange boxes indicate embodied interaction. Boxes with gradient colors indicate switching from one environment to the other.

3. Experiments and Results

We evaluate our framework across diverse domains, demonstrating significant performance gaps between current AI systems and human capabilities.

Outdoor Environment Evaluation Navigation • Shopping • Tourism

34.72% Navigation Accuracy

25.46% Shopping Accuracy

30.91% Tourism Accuracy

We evaluate performance across navigation, shopping, and traveling tasks using GPT-4o-mini, Gemini 2.0 Flash, Qwen-VL-Plus, and InternVL2.5-latest, following VisualWebArena settings for web observation.

                  1
                  
                    Model Performance: GPT-4o-mini consistently leads across all metrics, though still well below human performance. Gemini follows closely, while Qwen and Intern lag behind.

                  2
                  
                    Domain Comparison: Web-only accuracy exceeds embodied-only accuracy for all outdoor tasks, suggesting models handle digital information more effectively than physical navigation.

                  3
                  
                    Task Complexity: Shopping and traveling involve richer cross-domain interactions and longer sequences, resulting in noticeably lower accuracy than navigation tasks.

Indoor Cooking Evaluation Vision-based • Text-based

6.4% Best AI Accuracy

77.08% Human Accuracy

Text > Vision Modality Performance

We implement two distinct approaches for indoor cooking tasks: vision-based using first-person views and text-based using structured scene graphs.

                  1
                  
                    Performance Gap: Substantial gap exists between AI models and humans, with the best model (text-based GPT-4o) achieving only 6.4% vs humans' 77.08%.

                  2
                  
                    Modality Comparison: Text-based models consistently outperform vision-based counterparts, suggesting current models struggle with visual grounding in cooking contexts.

                  3
                  
                    Model Ranking: GPT-4o and Gemini-2.0-Flash demonstrate substantially stronger performance than Qwen and InternVL variants.

Geolocation Task Evaluation Active Exploration • Web Integration

We benchmark against FairLocator, analyzing VLM performance on GeoGuessr using Google Street View images, comparing our active exploration approach with passive baselines.

                  1
                  
                    Active vs Passive: Our agent with active exploration and web access significantly outperforms passive baselines, particularly for fine-grained location identification.

                  2
                  
                    Granularity Performance: Substantial improvements in identifying cities and streets compared to country-level localization.

                  3
                  
                    Integration Benefits: Results underscore the potential of integrating embodied and web domains for enhanced real-world task performance.

4. Error Analysis

We analyze failure patterns in GPT-4o cooking tasks to understand the primary challenges in embodied web agent integration.

Error Type Distribution in Cooking Tasks

Error analysis breakdown showing distribution of error types

66.6%

Cross-Domain Errors

Failures at the intersection where physical and digital domains meet. Agents become trapped in single-domain cycles.

14.6%

Embodied Errors

Issues with physical world perception, planning, and action execution in the embodied environment.

8.0%

Web Errors

Problems with web interface interaction, information retrieval, and digital reasoning tasks.

10.8%

Other Errors

Miscellaneous failures including system errors, timeout issues, and unexpected behaviors.

Critical Bottleneck Identified: The most prevalent failure pattern involves agents becoming trapped in single-domain cycles, with cross-domain errors overwhelmingly dominating the failure landscape.

🔍 Key Insights

Our analysis reveals that the primary challenges in embodied web agents lie not in isolated capabilities, but in their integration across domains.

              1
              
                Domain Integration Challenge: Cross-domain errors (66.6%) far exceed individual domain errors, indicating that seamless integration between physical and digital realms remains the primary technical challenge.

              2
              
                Single-Domain Traps: Agents frequently become stuck in repetitive cycles within one domain, failing to effectively transition between embodied and web interactions when required.

              3
              
                Relative Domain Performance: While both embodied (14.6%) and web (8.0%) errors occur, their individual rates are significantly lower than cross-domain failures, suggesting competency in isolated tasks.

📊 Performance Implications

This error distribution confirms that the critical bottleneck emerges at the intersection where physical and digital domains meet, rather than within individual domain capabilities. Future research should prioritize developing more sophisticated cross-domain coordination mechanisms and transition strategies for embodied web agents.

5. Citation

@misc{hong2025embodiedwebagentsbridging,
          title={Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence}, 
          author={Yining Hong and Rui Sun and Bingxuan Li and Xingcheng Yao and Maxine Wu and Alexander Chien and Da Yin and Ying Nian Wu and Zhecan James Wang and Kai-Wei Chang},
          year={2025},
          eprint={2506.15677},
          archivePrefix={arXiv},
          primaryClass={cs.AI},
          url={https://arxiv.org/abs/2506.15677}
        }