Back to Blog

From Retrieval to Reasoning: The Architectural Evolution of Information Systems for Large Language Models

From RAG to multi-agent systems: my GPT-5 testing reveals how AI architectural evolution makes structured data essential for website visibility.

By Andrea Volpini

August 11, 2025

—

14 min read

entities in this article

The architecture of AI information systems is evolving, and your website’s visibility to AI agents depends entirely on understanding this shift. After testing GPT-5 and analyzing frontier AI systems, I’ve identified three distinct architectural paradigms that determine whether your content appears in AI-generated responses—or gets ignored completely.

For publishers watching referral traffic decline, e-commerce managers seeing competitors dominate AI shopping assistants, and website owners struggling to appear in ChatGPT or Claude responses, the implications are immediate: websites that don’t adapt to these new retrieval architectures will become invisible to the next generation of AI-powered search.

Testing confirms the divide: sites with comprehensive structured data appear accurately in AI responses; those without risk being misunderstood or ignored entirely. But here’s what most don’t realize—structured data visibility varies dramatically across different AI tool architectures, creating both risks and opportunities for those who act quickly. But before you jump to conclusions: No, LLMs do not directly read your structured data. What happens next will surprise most publishers.

The Three-Phase Evolution: From Static Knowledge to Dynamic Reasoning

Phase 1: Foundational RAG (Retrieval-Augmented Generation)

The first phase tackled what I call the LLM’s “static knowledge problem.” By linking models to external vector databases—effectively extending their memory—RAG reduced hallucinations and kept answers current. A Web Index from providers like Bing or Google became essential, allowing models to draw from broader internet snapshots. Yet limitations persisted: RAG couldn’t query live systems, handle temporal questions effectively, or deliver precise results for complex, multi-constraint requests (e.g., “All horror movies filmed in Italy in 2023” or “The best Montepulciano d’Abruzzo wines from 2021 under €25”).

Phase 2: Agentic Retrieval
The second phase solved the “dynamic knowledge problem” through a sophisticated two-step process revealed by my analysis of frontier models like GPT-5:

Search action returns snippets rich in pre-digested metadata—authors and dates (arXiv), release versions (GitHub), event details, recipe yields.

Metadata-based decision on which URLs to open for deeper reading.

This represents a shift from “prompting with data” to “prompting with a reference to data.”

Phase 3: Multi-Agent Systems
The current frontier tackles the “complexity problem“—queries requiring multi-hop reasoning across heterogeneous sources. Architectures like Baidu’s TURA framework use a “Planner” agent to decompose tasks into a DAG (Directed Acyclic Graph), executed by specialized agent teams. This enables parallel, collaborative problem-solving that mirrors human research methodologies.

TURA Framework Overview. The framework consists of three stages: Intent-Aware MCP Server Retrieval, DAG-based Task Planner, and Distilled Agent Executor. Example shows processing a Beijing travel query.

Behind the Curtain: How Modern AI Retrieves Information

My testing of GPT-5’s web search capabilities (as well as Dan Petrovic testing on Gemini’s search tools) reveals sophisticated metadata extraction that goes far beyond text scraping.

Testing Recipe Content: When I queried for “tiramisu recipe,” GPT-5’s search tool returned rich metadata directly in snippets:

Author names and publication dates
Recipe yields and preparation times
Ingredient lists and instruction previews
Source credibility indicators

Cross-Content Analysis: Testing across different content types revealed systematic metadata extraction:

Content Type	Metadata Surfaced	Example
Scientific Papers	Authors, dates, abstracts, citation counts	arXiv papers with full author lists and submission dates
GitHub Repositories	Release versions, feature highlights, install commands	“v1.5.0 features” and “pip install” snippets
Apps	Ratings, download counts, developer info	“3.9 stars, 50M+ downloads, Niantic Inc.”
Government Data	Publishers, file formats, update dates, licenses	“Updated: Aug 2025, Format: JSON/Excel, Publisher: Bureau of Labor Statistics”

The Key Insight: In a separated test on TripAdvisor, using OpenAI’s GPT-OSS-120B, the model identified a schema:Restaurant entity with nested properties, ratings, and reviews—clear evidence that retrieval systems surface structured metadata for AI use.

But let’s be precise: the LLM doesn’t access structured data or raw HTML directly; it receives a sanitized snippet from the retrieval layer and, if it “opens” a page, a synthesized representation rather than the full source.

Real-World Evidence: How AI Systems Discover Structured Data Endpoints

A particularly revealing test emerged when querying GPT-5 about a specific product variant from a WordLift e-commerce client. The search surfaced not just the product page, but the company’s dedicated structured data endpoint containing complete product metadata.

The Query Process:

Input: Product variant number (a 12-digit GTIN)
AI Recognition: System identified this as a Global Trade Item Number
Discovery: Found both the official product page AND the structured data endpoint
Access: Gained complete product knowledge graph in a single retrieval

Critical Insight: The AI system didn’t just find content about the product—it discovered the machine-readable database behind it. This demonstrates that sophisticated retrieval systems are now capable of:

Entity-based discovery: Searching by persistent identifiers (GTINs, ISBNs, etc.)
Endpoint detection: Finding dedicated structured data URLs beyond main content pages
Complete graph access: Retrieving entire entity relationship networks in one query

This represents the future of AI-commerce interaction: instead of scraping product descriptions, AI agents will query structured endpoints directly, accessing real-time pricing, inventory, specifications, and relationship data.

Strategic Implication: E-commerce sites with comprehensive structured data endpoints become the authoritative source for AI agents, while those relying solely on traditional product pages risk being bypassed entirely.

The Critical Technical Distinction: Search vs. Direct Access

My testing revealed a crucial architectural limitation that most publishers don’t understand: structured data visibility varies dramatically between different LLM tool types.

When an AI agent uses a search tool (like GPT-5’s web.search or Gemini’s google_search and groundingMetadata), it gains full access to your structured data because search engines pre-index JSON-LD, microdata, and RDFa markup. The agent receives rich, semantically-enhanced snippets with complete entity information.

However, when an agent uses direct page access tools (like open_page or browse), a critical gap emerges: JSON-LD structured data becomes largely invisible. Only microdata embedded directly in HTML attributes remains accessible to the agent during direct page parsing.

Practical Impact:

<!-- This is INVISIBLE to direct page access tools -->
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Restaurant", 
  "name": "Giuseppe's Pizzeria",
  "aggregateRating": {"ratingValue": "4.5"}
}
</script>

<!-- This IS visible to direct page access tools -->
<div itemscope itemtype="https://schema.org/Restaurant">
  <h1 itemprop="name">Giuseppe's Pizzeria</h1>
  <span itemprop="aggregateRating" itemscope itemtype="https://schema.org/AggregateRating">
    <span itemprop="ratingValue">4.5</span> stars
  </span>
</div>

This explains why some AI responses perfectly understand your structured data (search-mediated access) while others miss the same information entirely (direct page access). As agentic systems evolve beyond search-engine dependency toward direct API interactions, this limitation will become more pronounced.

Strategic Response: Implement dual structured data strategies—maintain JSON-LD for search engine indexing while supplementing with microdata and semantic HTML for direct agent access. This defensive approach ensures compatibility across the entire evolutionary spectrum from current mixed-tool systems to future sophisticated agent architectures.

Here is the metadata observed by GPT-5 when the web.search tool is invoked on a recipe website.

Metadata Field	Example in Snippet
Author	Giada De Laurentiis, Rick Rodgers
Date Published/Updated	March 31 2006, December 6 2023
Recipe Yield	“Makes 8 servings”, “4 Servings”
Ingredients Mention	Yes — partial lists or key items
Descriptive Summary	Quick ingredient notes or style variations
Tags/Keywords	Often footnotes of recipe categories

Search Engine Routing: The testing revealed that different queries trigger different underlying search engines:

Google-style indicators: “People also ask” phrasing, arXiv citation counts, detailed research metadata, dataset licensing information

Bing-style indicators: Aggressive date formatting, rich inline author names, GitHub release tags, “Top 10” listicle formats

This aligns with Aleyda Solis’s research showing ChatGPT’s reliance on Google SERP snippets, though the routing appears more nuanced than single-provider dependency.

Why Structured Data Is Now Critical

My experiments with GPT-OSS-120B and GPT-5 confirm a fundamental shift: AI models are moving from processing text to interpreting structured data. When I queried for “Gluten-Free Pizza in Trastevere,” the model synthesized a comprehensive knowledge panel with structured tables and verifiable source provenance rather than returning simple links.

The model processes a page’s explicit knowledge graph, not just its unstructured text.

This leads to two strategic imperatives:

Entities over Keywords: AI retrieves “things” (entities with attributes), not “strings” (keywords). Success depends on providing machine-readable data that clearly describes these entities.
Structured Data as a Grounding Protocol: Schema.org in JSON-LD is no longer just for Google’s rich snippets—it’s the primary protocol for providing factual, verifiable grounding to LLMs and AI agents.

However, this grounding protocol has architectural dependencies—JSON-LD structured data is fully accessible through search-mediated retrieval but may be invisible during direct page access, requiring defensive markup strategies.

Practical takeaway for publishers:
The metadata visible in search snippets—author names, publication dates, ratings, prices—comes directly from your structured data. Sites with comprehensive schema markup appear accurately in AI responses; those without risk being misunderstood or ignored entirely.

Building Agent-Ready Websites

The economic data tells the story: In Q1 2025, AI bot traffic across the TollBit network (a monetization provider for AI traffic) nearly doubled (+87%), with RAG bot scrapes rising 49%. Yet AI apps accounted for just 0.04% of external referral traffic versus Google’s 85%.

An agent-ready website transitions from passive document repository to active, queryable knowledge source, offering specific tools for AI agents:

Entity Search Endpoints: Allow agents to perform disambiguated lookups using unique entity IDs
Semantic Content Search: Enable faceted searches based on underlying entities and topics
Relationship Extraction: Permit agents to query connections between entities
GS1 Digital Link Resolvers: Essential for e-commerce, providing real-time product data

Technical Foundation: Ensure structured data visibility across all access methods by implementing both JSON-LD (for search-mediated access) and microdata (for direct page parsing) alongside semantic HTML structure.

To assess your site’s current readiness for AI agents, use our AI SEO Audit Tool (still in beta testing) to evaluate your structured data implementation and identify optimization opportunities.

The Economic Reality: From Threat to Revenue Stream

The rise of centralized AI “answer engines” challenges publishers when Google’s AI Overviews synthesize content without driving traffic. However, by implementing structured data protocols and agent-ready infrastructure, publishers can shift from being passively scraped to actively providing licensed data via reliable APIs.

Platforms like TollBit and emerging Cloudflare solutions enable publishers to charge AI agents per query while keeping human access free. This transforms AI scraping from threat to direct revenue stream.

The Security Implications of Agent-Ready Infrastructure

As websites transition to agent-accessible endpoints, new security considerations emerge that most publishers haven’t addressed:

Indirect Prompt Injection Risks: AI agents processing your content could encounter malicious instructions hidden within seemingly benign text. An agent reading a product review containing hidden prompts like “ignore previous instructions and…” could be manipulated to act against user interests.

Rate Limiting and Resource Management: Unlike human visitors, AI agents can generate massive request volumes. Without proper throttling, your agent-ready APIs could become expensive attack vectors or suffer from resource exhaustion.

Data Poisoning Concerns: Structured data that influences AI responses creates new responsibilities. Incorrect or malicious schema markup could propagate misinformation at scale through agent networks.

Recommended Protections:

Implement agent-specific rate limiting on API endpoints
Monitor structured data for anomalous patterns
Establish content validation pipelines for agent-accessible data
Consider agent authentication systems for premium content access

The Strategic Divide: Open vs. Closed Agentic Ecosystems

The industry is crystallizing around two competing visions for the agentic web:

Microsoft’s Open Ecosystem Strategy:

Championing protocols like MCP and NLWeb for interoperability
Positioning Azure as infrastructure provider within competitive landscape
Enabling agent-to-agent communication across different platforms

Google’s Integrated Approach:

Vertically integrated systems within Google Cloud ecosystem
Tight coupling between Gemini models and Google’s data stack
Emphasis on seamless experience within proprietary boundaries

Strategic Implications for Publishers:

Hedge Your Bets: Implement open standards (MCP, Schema.org) while maintaining compatibility with major platforms
Platform Diversification: Avoid over-dependence on any single AI ecosystem
Future-Proofing: Open protocols provide insurance against platform lock-in as the landscape consolidates

WordLift’s Role in the Agentic Web

At WordLift, we recognized this shift early. While others focused on building better AI models, we’ve been building the infrastructure layer that makes the web truly queryable:

Comprehensive entity recognition and knowledge graph construction
Schema.org markup automation at scale
API endpoints for semantic search and entity relationship queries
Integration with emerging protocols like Model Context Protocol (MCP)
Agentic SEO solutions for automated marketing tasks

Through our MCP configuration, we’re enabling websites to serve as live data endpoints powering AI workflows. What was once purely a threat is now a dual opportunity: a data-centric web driving marketing efficiency and the foundation for agent-driven commerce and content monetization.

Underpinning this evolution is structured data—the rich metadata enabling intelligent agent behavior. As reasoning demands become more relational, the future belongs to GraphRAG: retrieving directly from knowledge graphs that provide cognitive scaffolding for reliable, complex reasoning.

What This Means for Your Business

The question for every digital business is: when an AI agent queries your domain, will it find a flat document to parse, or a rich database to interrogate? Will it be even able to access your website?

The SEO community has the tools, expertise, and responsibility to shape this agentic web. By leading on structured data standards, building API-first content systems, and negotiating fair access for AI agents, we can ensure this shift benefits publishers, brands, and users—human or machine.

The publishers who succeed will be those who act now to:

Establish agent-accessible APIs

Implement comprehensive structured data markup

Build entity-centric content architectures
Create machine-readable knowledge layers.

The agentic web is already here. It’s on us to build it.

Frequently Asked Questions

How exactly does GPT-5’s web browsing work technically?

GPT-5 operates with two distinct tools that work very differently:

web.search: Sends queries to search providers (usually Bing) and returns a JSON list with titles, snippets, and URLs. Importantly, this doesn’t include HTML or structured data from actual pages—just what the search API provides.
web.open_url: Fetches a snapshot of a specific URL and reads the HTML/markup directly. This is a separate, explicit step that can be run on URLs from search results.

This two-tool architecture explains why structured data visibility varies: search results include pre-processed metadata from indexing, while direct URL access only sees what’s embedded in the HTML markup itself.

Q: Do LLMs read structured data directly?

A: No, they don’t. This is a common misconception. The process happens at the search engine level, not the LLM level. Search engines like Google and Bing pre-process and index structured data (JSON-LD, microdata, RDFa) during crawling. When an AI agent uses the search tool, it receives rich snippets that include this pre-processed structured information. The LLM never sees your raw JSON-LD—it sees the search engine’s interpretation of it.

Q: Why do some AI responses include my structured data while others miss it completely?

A: This depends on which tool the AI agent uses:

Search-mediated access: Full structured data visibility through pre-processed snippets
Direct page access: Limited to microdata and semantic HTML only

As AI systems evolve toward more direct interactions (bypassing search engines), this disparity will become more pronounced, making dual markup strategies essential.

What’s the difference between being “AI-visible” and being “search-visible”?

Traditional SEO focuses on ranking in search results for human users. AI visibility means your content can be discovered, understood, and cited by AI agents across different access methods. This requires:

Comprehensive structured data for search-mediated discovery
Microdata and semantic HTML for direct agent access
Entity-based content architecture for relationship queries
API endpoints and MCP support for sophisticated agent interactions

Should I prioritize JSON-LD or microdata for AI visibility?

Implement both when possible. JSON-LD remains crucial for search engine indexing and search-mediated AI access. However, is currently helpful for direct agent interactions. A defensive strategy uses JSON-LD for comprehensive entity definition and microdata for the most critical properties that agents need during direct page access.

How can I test if my site is properly visible to AI agents?

Start with our AI SEO Audit Tool to evaluate your structured data implementation, entity coverage, and AI readiness across multiple factors.

References

This analysis draws from testing of GPT-5, cross-platform analysis using GPT-OSS-120B, Gemini 2.5 Pro, and Perplexity and from the following resources:

TURA Framework (Baidu): Tool-Augmented Unified Retrieval Agent research
Model Context Protocol (MCP): Anthropic’s open protocol specification
Microsoft NLWeb Initiative: Natural Language Web documentation
Aleyda Solis Research: ChatGPT’s reliance on Google SERP snippets
Dan Petrovic’s Gemini Testing: Analysis of Google’s search tool capabilities
Google AI Overviews Impact: Click-through rate analysis and publisher implications
Cloudflare Agent Economics: Pay-per-crawl proposals and infrastructure costs
AI SEO Audit Tool: https://wordlift.io/ai-audit/
MCP Server Implementation: WordLift’s agent-ready infrastructure