Discovery Infrastructure for AI Agents: llms.txt, agents.json, OpenAPI, and Semantic HTML — A Course Lesson

Learning Objectives

By the end of this lesson, you will be able to:

Explain why discovery infrastructure is a distinct economic layer in agent-powered systems, not merely a technical convenience
Describe the function and format of each of the four primary discovery mechanisms: llms.txt, agents.json, OpenAPI specifications, and semantic HTML patterns
Identify how discovery quality affects agent build-vs-buy decisions, service provider concentration, and switching costs
Apply practical implementation patterns whether you are publishing a service for agent consumption or building an agent that consumes external services
Evaluate the competitive dynamics of the emerging discovery market, including standardization risks and moat-building opportunities

1. Why Discovery Infrastructure Matters for Agent Economics

Autonomous agents do not browse the web the way humans do. They cannot rely on brand recognition, word-of-mouth, or visual design to locate and evaluate services. Instead, they depend on structured signals that answer three questions rapidly and reliably:

What does this service do?
How do I call it?
What will it cost me, and what constraints apply?

When those signals are absent or ambiguous, agents incur discovery friction — the computational and latency cost of inferring capability from unstructured content, attempting failed API calls, or falling back to general-purpose tools that are less efficient for the task.

Discovery friction is not a minor inconvenience. In agent economics, every tool-call has a cost: inference tokens to reason about the call, latency added to the task pipeline, and the opportunity cost of a suboptimal tool choice. Multiply that across thousands of agent runs and the aggregate cost of poor discovery infrastructure becomes significant.

The core economic argument: Discovery infrastructure is a cost-reduction layer that sits upstream of execution. A service that is easy to discover and correctly characterize will be selected more often, integrated more cheaply, and retained longer than an equivalent service that is hard to parse. This makes discovery infrastructure a direct driver of revenue for service providers and a direct driver of efficiency for agent builders.

This lesson treats discovery not as a developer convenience but as an economic primitive — something that shapes market structure, concentration, and competitive advantage in the agent economy.

2. The Four Pillars of Agent Discovery

Agent discovery infrastructure has converged around four complementary mechanisms. Each operates at a different layer of the stack and serves a different phase of the agent's decision process.

2.1 llms.txt: Human-Readable Service Catalogs

What it is: llms.txt is a plain-text file placed at the root of a domain (e.g., https://example.com/llms.txt) that describes the service in natural language optimized for language model consumption. It is analogous to robots.txt for crawlers, but its audience is an LLM reasoning about whether and how to use the service.

What it contains: - A concise description of what the service does and who it is for - The primary use cases the service supports - Pointers to more structured resources (API docs, agents.json, OpenAPI specs) - Any usage policies relevant to automated access (rate limits, authentication requirements, prohibited uses) - Contact or support information for agent operators

Why it matters economically: When an agent is given a task that requires external capability, it may receive a domain name or URL as a starting point. llms.txt allows the agent to quickly determine fit without parsing marketing copy, navigating JavaScript-heavy pages, or making speculative API calls. This reduces the token cost of capability assessment and increases the probability that a genuinely suitable service gets selected.

Format guidance: - Keep it under 1,000 words; agents do not benefit from verbose prose - Use short paragraphs or bullet lists; dense walls of text increase parsing cost - Be explicit about what the service does not do — false positives in capability assessment are expensive - Version-stamp the file so agents can detect staleness

Limitations: llms.txt is unstructured and relies on the LLM's ability to interpret natural language correctly. It is best used as a first-pass filter, not as a complete capability specification. For precise integration, agents need the machine-readable formats described below.

2.2 agents.json: Machine-Readable Agent Capabilities

What it is: agents.json is a structured JSON file (typically at /.well-known/agents.json) that provides a machine-readable capability manifest for agent consumption. Where llms.txt speaks to the LLM's language understanding, agents.json speaks to the agent's tool-selection and orchestration logic.

What it contains: - Service identity: name, version, canonical URL, category tags - Capability declarations: a list of discrete capabilities the service exposes, each with a name, description, input/output schema summary, and a pointer to the full API specification - Pricing signals: cost model (per-call, per-token, subscription), approximate price tier, and whether a free tier exists - Authentication requirements: OAuth, API key, JWT, or open - Rate limit metadata: requests per minute/hour, burst allowances - Agent-specific policies: whether the service permits autonomous (unattended) use, data retention policies, and any human-in-the-loop requirements - Reliability signals: SLA tier, uptime history URL, status page

Why it matters economically: agents.json enables programmatic capability matching. An orchestration layer can load agents.json files from a set of candidate services and perform structured comparison — cost per unit, latency SLA, capability overlap — without any LLM inference. This is dramatically cheaper than asking an LLM to read and compare documentation pages. For high-frequency agent deployments, the savings compound quickly.

Relationship to build-vs-buy decisions: When an agent's orchestration layer can read a structured capability manifest, it can make more accurate build-vs-buy comparisons. A capability that appears expensive in absolute terms may be cheap relative to the cost of fine-tuning an internal model to replicate it. agents.json makes that comparison tractable at runtime.

Format guidance: - Follow JSON Schema conventions so agents can validate the file programmatically - Use standardized category taxonomies where they exist (e.g., schema.org service types) to enable cross-service comparison - Include a last_updated timestamp; stale manifests erode agent trust - Provide a changelog_url so agents tracking service evolution can detect breaking changes

2.3 OpenAPI: Standardized API Contracts

What it is: OpenAPI (formerly Swagger) is a widely adopted specification format for describing REST APIs. An OpenAPI document (YAML or JSON) provides a complete, machine-readable contract for every endpoint a service exposes: paths, HTTP methods, request parameters, request bodies, response schemas, authentication flows, and error codes.

Why agents need it: OpenAPI is the execution layer of discovery. Once an agent has determined (via llms.txt or agents.json) that a service is a candidate, it needs to know precisely how to call it. OpenAPI provides that precision. Many agent frameworks — including LangChain, AutoGen, and similar orchestration tools — can ingest an OpenAPI spec and automatically generate tool definitions that the agent can invoke.

Key fields for agent consumption:

Field	Agent-Relevant Content
`info.description`	High-level service summary used in tool selection
`paths[*].summary`	Per-endpoint description used in action selection
`paths[*].operationId`	Stable identifier for tool registration
`components.schemas`	Input/output types for parameter validation
`components.securitySchemes`	Authentication method and credential format
`servers`	Base URL(s) including environment variants
`x-*` extensions	Custom agent metadata (cost hints, idempotency flags)

OpenAPI extensions for agent contexts: The x- extension namespace allows service providers to embed agent-specific metadata that the core OpenAPI spec does not cover. Useful extensions include:

x-agent-cost-hint: approximate cost per call in USD
x-idempotent: boolean indicating whether repeated calls are safe
x-human-review-required: boolean flagging endpoints that require human approval before execution
x-rate-limit-tier: named tier (e.g., "free", "pro", "enterprise") for quick cost-tier filtering

Why OpenAPI matters for service provider concentration: Services with well-maintained, agent-optimized OpenAPI specs get integrated into agent frameworks faster and with less custom code. This creates a compounding advantage: early, clean integration leads to higher usage, which leads to more developer familiarity, which reinforces selection. Services that publish poor or outdated OpenAPI specs face a structural disadvantage in agent-mediated markets regardless of their underlying capability quality.

2.4 Semantic HTML: Web-Native Agent Signals

What it is: Semantic HTML refers to the use of HTML elements and structured data markup (primarily Schema.org vocabulary embedded via JSON-LD, Microdata, or RDFa) to annotate web pages with machine-readable meaning. For agent discovery, the most relevant patterns are those that describe services, products, APIs, and organizations in ways that agents can parse without LLM inference.

Why it matters: Not every service will publish llms.txt or agents.json immediately. Semantic HTML is the fallback layer — the discovery signal that exists on most professionally maintained websites already, even if it was not designed with agents in mind. Agents that can parse structured data from HTML can extract capability signals from a much larger surface area of the web.

Key Schema.org types for service discovery:

Service / WebAPI: describes a service or API, including name, description, provider, and documentation URL
SoftwareApplication: describes software with pricing, platform, and feature information
Organization: provider identity, contact information, and trust signals
Offer / PriceSpecification: pricing structure, including free tiers and usage-based pricing
APIReference (via mainEntityOfPage): links a page to its API documentation

Practical patterns:

{
  "@context": "https://schema.org",
  "@type": "WebAPI",
  "name": "Example Data API",
  "description": "Provides real-time financial data for 50,000+ instruments.",
  "documentation": "https://example.com/docs",
  "provider": {
    "@type": "Organization",
    "name": "Example Corp"
  },
  "offers": {
    "@type": "Offer",
    "price": "0",
    "priceCurrency": "USD",
    "description": "Free tier: 100 requests/day"
  }
}

Limitations: Schema.org vocabulary was not designed for agent capability matching. It lacks fields for rate limits, authentication methods, and agent-specific policies. Semantic HTML is best treated as a discovery entry point that directs agents to richer structured resources, not as a complete specification.

The layered model: The four pillars work together in a discovery stack:

Semantic HTML        → "This page describes a service; here is its category and provider"
llms.txt             → "Here is what this service does in plain language"
agents.json          → "Here is a structured capability manifest with pricing and policies"
OpenAPI              → "Here is the exact contract for calling each endpoint"

An agent moving through this stack progressively refines its understanding at increasing cost and precision. Well-instrumented services publish all four layers; the agent uses whichever layer is sufficient for the current decision.

3. How Discovery Infrastructure Reduces Agent Friction

3.1 Discovery as a Cost Reduction Layer

Agent friction has three components:

Search cost: The cost of finding candidate services for a task
Evaluation cost: The cost of assessing whether a candidate service fits the task
Integration cost: The cost of writing and testing the code to call the service

Discovery infrastructure attacks all three:

llms.txt and agents.json reduce search cost by making services indexable and filterable without LLM inference
agents.json and OpenAPI reduce evaluation cost by providing structured capability and pricing data that can be compared programmatically
OpenAPI reduces integration cost by enabling automatic tool generation in agent frameworks

The aggregate effect is that well-instrumented services are cheaper to adopt. In competitive markets, lower adoption cost translates directly to higher selection frequency, all else equal.

Quantifying the reduction: While precise benchmarks vary by context, the directional logic is clear. Parsing a well-formed OpenAPI spec to generate a tool definition is a deterministic, low-latency operation. Asking an LLM to infer the same information from unstructured documentation requires multiple inference calls, introduces error risk, and adds latency to every new integration. For agent fleets running thousands of tasks, the difference is material.

3.2 Discovery vs. Build-vs-Buy Decisions

The build-vs-buy decision for agent capabilities is fundamentally an information problem. An agent (or its orchestration layer) must estimate:

The cost of calling an external service for N units of work
The cost of building and maintaining an internal capability that replicates the service
The quality differential between the two options

Poor discovery infrastructure degrades the quality of this estimate. If an agent cannot accurately determine what an external service costs, how reliable it is, or exactly what it can do, the build-vs-buy calculation defaults toward building — not because building is better, but because the external option is too uncertain to trust.

The implication for service providers: Publishing clear, accurate discovery infrastructure is not just a developer relations exercise. It directly affects whether agents choose to use your service at all. A service with opaque pricing and no structured capability manifest will be systematically underselected relative to its true value, because agents cannot accurately price the buy option.

The implication for agent builders: When evaluating external services, prioritize those with complete discovery stacks. The presence of a well-maintained agents.json and OpenAPI spec is itself a signal of operational maturity — services that invest in structured documentation tend to invest in reliability and backward compatibility as well.

3.3 Discovery and Service Provider Concentration

Research in agent service consumption patterns suggests that agent spending concentrates heavily among a small number of providers — a pattern consistent with the economics of high switching costs and strong network effects. Discovery infrastructure interacts with this concentration dynamic in two ways:

Discovery amplifies concentration: Services that are easy to discover and integrate get selected first. Once integrated, they benefit from the agent's familiarity and the sunk cost of integration. This means early leaders in discovery quality can entrench their position before competitors catch up.

Discovery can disrupt concentration: A new entrant with superior discovery infrastructure can reduce the evaluation cost advantage that incumbents enjoy. If an agent can accurately assess a new service's capabilities and pricing in seconds via a well-formed agents.json, the incumbent's familiarity advantage shrinks. Discovery infrastructure is therefore one of the few levers available to challengers in concentrated agent service markets.

4. Practical Implementation Patterns

4.1 Publishing Your Service for Agent Discovery

Step 1: Write llms.txt - Draft a 300–600 word plain-language description of your service - Lead with the primary use case, not your company story - List what the service does not do (reduces false-positive evaluations) - Link to your agents.json and OpenAPI spec - Publish at https://yourdomain.com/llms.txt

Step 2: Publish agents.json - Create a JSON file following the capability manifest structure described in Section 2.2 - Include at minimum: service name, capability list with descriptions, pricing tier, authentication method, and rate limits - Publish at https://yourdomain.com/.well-known/agents.json - Set appropriate cache headers (recommend 24-hour TTL with must-revalidate)

Step 3: Maintain an agent-optimized OpenAPI spec - Ensure every endpoint has a clear summary and description - Use stable operationId values — changing these breaks agent tool registrations - Add x- extensions for cost hints, idempotency flags, and human-review requirements - Validate the spec with a linter before publishing; malformed specs cause silent failures in agent frameworks - Version your spec and maintain a changelog

Step 4: Add Schema.org markup to your documentation pages - Add a WebAPI or Service JSON-LD block to your main documentation page - Include pricing, provider identity, and a link to your OpenAPI spec - Test with Google's Rich Results Test or equivalent structured data validators

Step 5: Test agent discovery end-to-end - Use an agent framework (LangChain, AutoGen, or similar) to attempt automatic tool generation from your OpenAPI spec - Ask an LLM to read your llms.txt and describe what your service does — if the description is inaccurate, revise - Simulate the agents.json evaluation: can a script parse your manifest and extract cost and capability data without errors?

4.2 Consuming Discovery Signals as an Agent Builder

Building a discovery pipeline:

Index phase: Collect llms.txt and agents.json files from candidate service domains. This can be done with lightweight HTTP fetches — no LLM inference required at this stage.
Filter phase: Use structured fields from agents.json (category, pricing tier, authentication method) to filter candidates programmatically. Eliminate services that don't match task requirements before invoking any LLM reasoning.
Evaluate phase: For remaining candidates, use LLM reasoning over llms.txt content to assess capability fit. This is the only phase that requires inference, and it operates on a pre-filtered, smaller candidate set.
Integrate phase: Load the OpenAPI spec for selected services and generate tool definitions. Most agent frameworks support this natively.
Monitor phase: Periodically re-fetch agents.json to detect pricing changes, new capabilities, or deprecations. Treat discovery as a continuous process, not a one-time setup.

Handling missing discovery infrastructure: - If llms.txt is absent, fall back to parsing the service's main documentation page with semantic HTML extraction - If agents.json is absent, attempt to infer pricing and capabilities from OpenAPI spec descriptions and x- extensions - If OpenAPI is absent, flag the service as high-integration-cost and weight it accordingly in build-vs-buy analysis - Maintain a local capability cache with TTL to avoid re-evaluating stable services on every run

Trust signals to track: - Last-modified date on discovery files (stale files suggest low operational investment) - Consistency between llms.txt claims and OpenAPI spec content (inconsistencies suggest poor maintenance) - Presence of a status page URL in agents.json (correlates with operational maturity)

4.3 Real-World Examples and Case Studies

Case: Developer tool APIs Several developer tool providers have begun publishing OpenAPI specs optimized for agent consumption, with explicit operationId stability guarantees and x- extensions for cost hints. These services report faster third-party integration cycles compared to services relying on prose documentation alone — consistent with the friction-reduction logic described above.

Case: Data and research APIs Data providers that publish structured capability manifests see higher uptake from agent-based workflows than those relying on sales-led discovery. The pattern reflects a structural shift: agents do not respond to sales outreach, attend webinars, or read case studies. They respond to structured signals. Providers that have not adapted their go-to-market to this reality are systematically underrepresented in agent-mediated consumption.

Case: The llms.txt early adopter pattern A small but growing number of SaaS providers have published llms.txt files, primarily in the developer tools and AI infrastructure categories. Early adopters report that the files are being fetched by agent frameworks and LLM-powered research tools, confirming that the consumption infrastructure exists even where the publication infrastructure is still sparse.

The gap: As of the time of writing, the majority of web services — including many with high-quality APIs — have not published any agent-specific discovery infrastructure. This gap represents both a risk (services being systematically missed by agent-mediated discovery) and an opportunity (first movers in any category can establish discovery-layer advantages before competitors).

5. The Emerging Discovery Market

5.1 Who Profits from Better Discovery?

Discovery infrastructure creates value at three levels:

Service providers benefit from lower customer acquisition cost in agent-mediated markets. A service that is easy to discover and evaluate requires less sales and marketing investment to reach agent operators. The discovery file is, in effect, a 24/7 sales pitch optimized for the actual buyer — the agent's orchestration layer.

Agent builders benefit from lower integration cost and more accurate build-vs-buy decisions. Fleets built on well-instrumented services are cheaper to maintain and more adaptable to changing task requirements.

Discovery intermediaries — registries, catalogs, and aggregators that index and normalize discovery files across many services — capture value by reducing the search cost component of agent friction. A well-maintained registry of agents.json files, with normalized schemas and quality scores, is a genuinely valuable infrastructure layer. Research in adjacent markets suggests that infrastructure aggregators in high-fragmentation markets can achieve durable competitive positions.

5.2 Discovery as Competitive Moat

Discovery infrastructure can function as a competitive moat in several ways:

First-mover indexing advantage: Agent frameworks and orchestration platforms that index discovery files early build familiarity with a service's capabilities. This familiarity is encoded in training data, tool registries, and cached integrations. Latecomers face a higher bar to displace an incumbent that is already well-represented in agent tool libraries.

Schema lock-in: If a service's agents.json schema becomes the de facto standard for its category, competitors face pressure to conform to that schema to be comparable. The schema-setter captures a subtle but durable advantage: their capabilities are always described in their own terms.

Integration depth: Services that invest in deep OpenAPI coverage — documenting edge cases, error codes, and idempotency behavior — are harder to replace than services with shallow specs. An agent that has been built around a detailed spec has implicit dependencies on that spec's structure. Switching requires re-testing, not just re-pointing.

Quality signal differentiation: In a market where most services have poor discovery infrastructure, a service with excellent discovery infrastructure signals operational quality more broadly. Agents (and the humans who build them) use discovery quality as a proxy for reliability, maintainability, and vendor trustworthiness.

5.3 Future: Standardization vs. Fragmentation

The discovery infrastructure landscape is currently in an early, fragmented state. Multiple competing conventions exist for capability manifests, and no single standard has achieved the adoption that OpenAPI has achieved for API contracts. This creates both risk and opportunity.

The standardization scenario: A small number of formats — likely including agents.json in some form, plus OpenAPI with agent-specific extensions — achieve broad adoption, possibly through endorsement by major agent framework providers or platform operators. In this scenario, discovery infrastructure becomes a commodity layer, and competitive advantage shifts to capability quality and pricing.

The fragmentation scenario: Multiple incompatible discovery formats persist, each favored by different agent frameworks or platform ecosystems. In this scenario, discovery intermediaries that normalize across formats capture significant value, and service providers face the cost of maintaining multiple discovery representations.

The most likely near-term outcome is partial standardization: OpenAPI achieves near-universal adoption for the execution layer (it largely already has), while the higher-level capability manifest layer (agents.json equivalents) remains fragmented for longer, with 2–4 competing formats coexisting. Service providers should invest in OpenAPI quality immediately and monitor the manifest layer for emerging consensus before committing to a single format.

Implications for agent builders: Build your discovery pipeline to be format-agnostic at the manifest layer. Parse whatever structured signals are available rather than requiring a specific format. This insulates your agent fleet from format fragmentation and positions you to benefit from whichever standard emerges.

Key Takeaways

Discovery infrastructure is an economic layer, not just a technical one. It directly affects service selection frequency, integration cost, and build-vs-buy decisions in agent-mediated markets.
The four pillars serve different phases of agent decision-making: llms.txt for natural language capability assessment, agents.json for programmatic capability matching, OpenAPI for execution-layer integration, and semantic HTML as a universal fallback.
Poor discovery infrastructure systematically undervalues good services. Agents cannot select what they cannot accurately evaluate. Services that invest in structured discovery signals will be overrepresented in agent-mediated consumption relative to their true market share in human-mediated markets.
Discovery quality is a proxy for operational maturity. Services that maintain accurate, versioned, agent-optimized discovery files signal the same discipline that produces reliable APIs and backward-compatible changes.
The discovery market is early and fragmented. OpenAPI is the most mature layer. The capability manifest layer (agents.json and equivalents) is still converging. First movers in any service category have an opportunity to establish discovery-layer advantages before the market consolidates.
Agent builders should treat discovery as a continuous pipeline, not a one-time setup. Services change their capabilities, pricing, and policies. Discovery files that are fetched once and cached indefinitely will produce stale, inaccurate tool registrations.
Discovery intermediaries — registries and aggregators — are an emerging infrastructure opportunity. The value of normalizing and quality-scoring discovery files across many services is real and currently underserved.

Discovery Infrastructure for AI Agents: llms.txt, agents.json, OpenAPI, and Semantic HTML — A Course Lesson

Discovery Infrastructure for AI Agents: llms.txt, agents.json, OpenAPI, and Semantic HTML — A Course Lesson

Learning Objectives

1. Why Discovery Infrastructure Matters for Agent Economics

2. The Four Pillars of Agent Discovery

2.1 llms.txt: Human-Readable Service Catalogs

2.2 agents.json: Machine-Readable Agent Capabilities

2.3 OpenAPI: Standardized API Contracts

2.4 Semantic HTML: Web-Native Agent Signals

3. How Discovery Infrastructure Reduces Agent Friction

3.1 Discovery as a Cost Reduction Layer

3.2 Discovery vs. Build-vs-Buy Decisions

3.3 Discovery and Service Provider Concentration

4. Practical Implementation Patterns

4.1 Publishing Your Service for Agent Discovery

4.2 Consuming Discovery Signals as an Agent Builder

4.3 Real-World Examples and Case Studies

5. The Emerging Discovery Market

5.1 Who Profits from Better Discovery?

5.2 Discovery as Competitive Moat

5.3 Future: Standardization vs. Fragmentation

Key Takeaways

Further Reading & Resources