The 114K Token Problem: Why Playwright MCP Burns Your AI Coding Agent's Context on Salesforce

Token economics is the invisible tax on every “AI-powered” Salesforce QA stack. Most teams don’t realize they’re paying it until their Healer agent stops working halfway through a release.

There’s an uncomfortable truth in the AI-driven testing space that most vendor marketing dances around: context windows are finite, and Salesforce pages are absolutely enormous. The effects of these two facts are the biggest reasons that Playwright MCP — paired with Claude Code, Copilot, Cursor, or any other coding agent — falls down on enterprise Salesforce work.

It’s not a UX problem. It’s an arithmetic problem.

Where the 114,000 Tokens Go

A Salesforce Lightning page is not a simple form. A single Opportunity edit page renders dozens of LWC and Aura components, each with its own nested shadow DOM, dynamic field metadata, layout-driven section ordering, related-lists, action toolbars, and inline detail components. When Playwright MCP captures an accessibility snapshot of this page, the model receives an enormous dump of anonymous DOM references — and the token cost accumulates fast.

Here’s the mechanism: each browser snapshot on a Lightning page runs between 10,000 and 50,000 tokens on its own. A realistic end-to-end test requires 5 to 15 of those snapshots. By the time you’ve run a single Salesforce test session, you’ve accumulated roughly 114,000 tokens of accessibility-tree data — none of which contains a single field API name, component type, or anything a stable test can actually be built from.

For comparison, a full Provar MCP authoring session — from inspecting the org through to generating, validating, and planning a test — runs approximately 12,000 to 15,000 tokens end-to-end. That’s a 7.5× efficiency advantage on Salesforce workloads, before any quality improvement is counted.

Microsoft, who built Playwright MCP, has acknowledged this structural limit directly. Their own repository now recommends CLI-based workflows over MCP for coding agent use cases, noting that CLI invocations avoid loading large tool schemas and verbose accessibility trees into the model context. When the tool’s own authors recommend against using it for your workflow, that’s not an edge case — that’s the design admitting a structural limit.

Benchmark note: Token figures are reproducible using a measurement script in the Provar MCP repository (github.com/ProvarTesting/provardx-cli). Figures are based on Provar MCP v1.5.1 and @playwright/mcp@0.0.75.

Why This Math Wrecks Coding Agents

AI coding agents operate inside a fixed context window — typically between 100,000 and 200,000 tokens depending on the model. Claude Sonnet, for example, operates at 200,000 tokens; GPT-4o at 128,000. That context window has to hold everything the model uses to reason: the user’s prompt, the codebase, conversation history, prior tool outputs, and any data it’s currently working with.

Burn 114K of that on a single Salesforce accessibility tree and the agent has roughly two choices: drop everything else, or stop reasoning effectively. Neither is acceptable for production QA.

Let’s run the math on a realistic session.

A coding agent assigned a regression sweep across 10 Salesforce flows — Lead creation, Opportunity edit, Quote generation, three Cases, two Account merges, and two Reports — needs context for each. At 114K tokens per snapshot, the first test alone has saturated the window. By test two, the agent is either truncating earlier context (and forgetting what your test plan actually says) or losing the ability to plan, heal, and reason about coverage.

This reality is why teams report the same pattern again and again: the demo runs beautifully on a single test, then the agent appears to “go stupid” by the fifth or sixth scenario. But the agent hasn’t gone stupid; it’s run out of room.

What Actually Happens Downstream

Three failure modes follow from token starvation, and they show up in different parts of the QA process:

Healing degrades silently. A Healer agent re-derives broken selectors by reasoning about the current page versus the prior one. When the context window is mostly consumed by raw accessibility trees, there’s no room for the Healer to hold both. Heals become guesses; guesses pass review by other agents because no human is in the loop; the test suite drifts toward looking healthy while actually being wrong.

Planning collapses to single-step thinking. A Planner agent is supposed to decompose a complex business process into ordered scenarios. With no room to hold the plan, it generates locally optimal next steps that don’t compose into a coverage strategy. You get a hundred tests of the same flow with minor variations, and no tests of the flow you shipped last sprint.

Quality analysis stops happening at all. Asking an agent, “Which business processes have no coverage?” requires it to hold both the coverage model and the project state in context simultaneously. At 114K per page snapshot, there isn’t room. The agent answers based on whatever’s left in the window, which is usually whatever it most recently saw.

None of these failure modes throw errors. They all look like the system is working.

The Per-Interaction Reality

The token gap isn’t just visible at the session level. It’s present at every single interaction:

Workflow	Playwright MCP
Non-SF page (e.g. example.com)	~120 tokens
SF Lightning snapshot	10,000–50,000 tokens
Generate a 4-step test case	4+ snapshots required (no structured path)

Provar MCP‘s structured CLI approach returns between 500 and 2,000 tokens for a Lightning interaction, and around 1,000 tokens to generate a 4-step test case end-to-end. The accessibility tree never enters the context window. The agent has room to reason.

Why CLI Beats MCP for This Workload

Structured CLI commands win on token economics for a straightforward reason: a CLI call returns the answer to a specific question, while an MCP accessibility snapshot returns everything the model might possibly need to know about the page. The CLI is a query. The snapshot is a dump.

Provar MCP leans hard into the CLI model. Instead of asking an agent to interpret a 114K-token DOM tree to figure out which field is the Amount field, Provar MCP exposes a structured Salesforce-metadata tool. The agent asks for the metadata it needs — field API name, type, required-ness, component — and gets a few hundred tokens of structured response back. The accessibility tree never enters the context window. The agent has room to reason.

Provar’s Page Objects are stable in a way Playwright selectors aren’t. The agent isn’t writing locator (‘[ref=”e234″]’) after staring at a DOM dump. It’s writing a metadata reference like Opportunity.Amount__c that resolves correctly across dev, SIT, UAT, and production — and that survives the next release.

Practical Implications for QA Leaders

A few questions are worth asking your team if you’re running Playwright MCP today:

How many tests can your coding agent run in a single session before its outputs start to drift?
- If you haven’t measured this yet, it’s time to measure. The number is almost always lower than people expect.
Does your Healer agent succeed silently or fail loudly?
- If healing produces no observable signal when it fails, you don’t have healing. You have a thin layer of plausible-looking edits sitting on top of broken tests.
What does your token bill look like?
- If you’re running coding agents against Salesforce through Playwright MCP at any meaningful scale, you are paying for that 114K per session every time. Many teams discover that switching to a structured-metadata MCP server cuts inference costs by 80% or more, before any quality improvement is counted.
What happens after a Salesforce release?
- The most difficult version of this question is also the most important. DOM-derived tests break. Metadata-grounded tests don’t. The token cost is what makes the difference economically viable.

The Underlying Point

Token economics is not a “performance optimization” concern. It’s a correctness concern. An AI coding agent operating in a starved context window doesn’t slow down — it gets quietly wrong. For consumer apps, that’s annoying. For Salesforce orgs running regulated data, it’s unacceptable.

The Salesforce-aware answer is not “use a bigger model.” It is “stop putting accessibility trees in the context window when structured metadata will do.” That’s the architectural choice Provar MCP made, and it’s why the math actually works out at enterprise scale.

Run the numbers yourself. Deploy Provar MCP alongside your current setup for thirty days, measure token cost per test session and stability across a Salesforce release, and decide on data — not demos.

Ready to start? Book some time with a Provar expert today!

The 114K Token Problem: Why Playwright MCP Burns Your AI Coding Agent’s Context on Salesforce

Share the article

Tags

Chapters

Where the 114,000 Tokens Go

Why This Math Wrecks Coding Agents

What Actually Happens Downstream

The Per-Interaction Reality

Why CLI Beats MCP for This Workload

Practical Implications for QA Leaders

The Underlying Point

Share the article

Continue Reading

Why AI Agents Require a Different Testing Strategy

How to Test AI Agents with Greater Consistency

Why Evaluator Accuracy Matters in AI Testing

Preparing AI Agents for Production with Repeatable Testing

Join the Provar TrustAI Partner Program

From Manual to Automated: A Guide to Testing nCino with Provar

Provar’s Test Automation Experts