Claude Code can write a Playwright test in thirty seconds. So can Copilot. So can Cursor. The demo is impressive — until you point it at a real Salesforce org and watch what happens after the next release.

The pitch from every AI coding agent on the market sounds the same right now:

Describe a flow in natural language, get a working Playwright test, and a Healer agent will fix the locators when they drift. 

For a generic SaaS application, the pitch works astonishingly well. For Salesforce, including Lightning Web Components, Aura, dynamic metadata, triannual releases, regulated data, real test governance, it falls apart in ways that are not obvious until it’s too late — and you’re three months into a rollout and your “AI-generated test suite” has become an “AI-generated rework backlog.”

This post is a head-to-head. Where Playwright plus a coding agent is good enough, where it isn’t, and what Salesforce QA teams actually need from an MCP server in 2026.

The Salesforce Reality vs. The Coding Agent Promise

AI coding agents make four reasonable promises: 

  • generate tests from natural language
  • self-heal when locators break
  • plan complex scenarios, and 
  • run anywhere a browser runs. 

All four are true on a normal web app.

On Salesforce, four things break that promise:

Salesforce isn’t a generic web app. Accessibility snapshots — the thing the AI actually sees — miss everything that matters. LWC shadow DOM hides component internals. Aura wraps interactions in framework conventions. Dynamic metadata means the page the agent saw yesterday is structurally different today, even before a release.

Healer agents can’t repair metadata-level breaks. When Salesforce changes a field’s API name, a locator strategy, or a component pattern after a release, DOM healing can’t recover. There is no selector to re-derive. The schema itself changed. Only a metadata-aware system can fix that, and accessibility snapshots aren’t metadata.

Planner agents have no test governance context. They don’t know what a test plan is. They don’t know which business processes are covered and which aren’t. They don’t know which sign-offs your release process requires. They generate .spec files, not a coverage map.

Token costs explode on Salesforce pages. Each browser snapshot on a Lightning page runs 10,000 to 50,000 tokens. A full test session accumulates roughly 114,000 tokens of accessibility-tree data. That’s not a footnote — it’s a structural limit that leaves almost no room for planning, healing, or quality analysis once the first test is done.

The Fundamental Gap: HTML vs. Metadata

The clearest way to see the difference is to look at what the AI actually receives.

Playwright MCP feeds the model an accessibility tree of anonymous references: generic elements, input fields, buttons, divs — with no field API names, no component types, no environment context. Useful for clicking around a webpage. Useless for stable, reproducible enterprise tests.

Provar MCP feeds the model Salesforce metadata: the field API name, its type, whether it’s required, the component it maps to, and the Page Object reference. For standard Salesforce objects, no Page Object is generated at all — Provar binds fields directly by API name at runtime, making them stable across every environment and every release automatically.

This isn’t a Provar vs. Microsoft argument. It’s an architectural one. If your test framework’s source of truth is the rendered DOM, every Salesforce release is a regression event.

The Token Economics in Numbers

Here’s what the context window comparison actually looks like:

ApproachTokens per Salesforce test session
Playwright MCP~114,000 (accumulated across 5–15 snapshots)
Playwright CLI~27,000 — still Salesforce context, still DOM-derived
Provar MCP (Authoring mode)~7,900–15,000 end-to-end

That’s a 7.5x efficiency advantage for Provar MCP on Salesforce workloads. At 114,000 tokens per session, a coding agent running 10 tests has already exhausted the context window of most models on DOM data alone — leaving nothing for planning, healing, or quality analysis.

Provar MCP also gives teams direct control over token consumption. Two configuration options let QA teams tune the tool surface to their workflow: a compact schema mode that reduces catalog size by 36% with no feature loss, and a tool-group scoping option that cuts catalog size by 57% when an agent only needs a subset of capabilities for a specific workflow — such as authoring, bug triage, or CI/CD pipeline validation.

Microsoft’s own Playwright MCP repository now recommends against MCP mode for coding agent workflows, citing exactly this token overhead. If the tool’s authors are telling you not to use it for your workflow, the responsible question is: what would actually work?

Page Objects: AI-Generated vs. Metadata-Grounded

A coding agent driving Playwright generates a selector from the accessibility snapshot: an anonymous DOM reference that survives until a Salesforce release updates the LWC, the test moves from sandbox to production, the field order changes in the layout, or the shadow DOM structure shifts. In an enterprise Salesforce org, at least one of those happens every quarter.

A coding agent driving Provar MCP generates a metadata-bound result. For custom apps and non-Salesforce pages, it produces a real Page Object in standard Selenium-style format — stable across dev, SIT, UAT, and production. For standard Salesforce objects, no Page Object is required at all; the field binding happens via API name at runtime and survives releases by construction.

Provar runs 240 quality validation rules across five layers — project, plan, suite, test case, and page object — against the generated structure before it ever executes. That means structural defects are caught at authoring time, not at 3am in your CI pipeline. These rules cover everything from XML structure and identifier validity to coverage gap detection and configuration validation before a CI/CD run triggers.

240 quality rules: 9 at project level · 11 at plan level · 6 at suite level · 175 at test case level. Validated before a single line executes.

Credential Governance: The Audit Question

Enterprise Salesforce orgs hold regulated data. The credential model under Playwright MCP and most coding agents was not designed with that in mind.

Session cookies sit in a filesystem cache readable by any process with filesystem access. There’s no audit log of which AI invocation touched which org or what actions were taken. There’s no centralised policy for which developers can authenticate against which environments. Playwright MCP’s architecture structurally cannot meet the auditability requirements of SOC 2, ISO 27001, or FedRAMP compliance — standards that most enterprise Salesforce customers operate under.

Provar MCP delegates credentials to the Salesforce CLI’s encrypted credential store — the MCP server never reads or logs them. Every tool call is written with a unique request ID in structured JSON, producing a per-invocation audit trail for compliance teams. The server communicates locally only, with no network listener, and all file operations are path-scoped with explicit traversal blocks.

This is the difference between “it works on my laptop” and “it passes a compliance audit.”

Where Provar Is Still Essential — Even With Copilot

The most honest version of this comparison admits the obvious: AI coding agents are genuinely useful. They draft scaffolding faster than any human. They propose test scenarios from user stories. They refactor on demand. None of that goes away.

But there are six things they still can’t do for Salesforce QA, and there’s no plausible roadmap on which they will:

  • Agentforce testing: A coding agent can navigate to an Agentforce UI and type. It cannot test the underlying AI agent’s behaviour — prompt variation, structured response evaluation, hallucination detection, or behavioural drift monitoring over time. Provar was first to market with API-level Agentforce testing.
  • Test plan and coverage governance: Coding agents emit .spec files. They do not maintain a model of which business processes those files cover, which ones don’t, or which regressions are open. Provar Quality Hub does.
  • Non-engineer participation: Every Playwright + coding agent workflow requires engineers. QA analysts, BAs, and manual testers are locked out. Provar’s low-code Test Builder keeps the entire team contributing.
  • Triannual release alignment: When Winter, Spring, or Summer ships, Playwright tests break at the metadata layer. Provar participates in Salesforce BETA programs and ships pre-tested compatibility.
  • Enterprise support and SLA: GitHub issues are not a support contract. Provar provides a dedicated CSM, the Success Portal, and University of Provar with real accountability for testing outcomes.
  • Execution infrastructure: Provar Grid handles scalable parallel execution. Playwright agents require your team to design, build, and maintain that infrastructure in-house.

The Honest Answer: Use Both

This is not a “Provar replaces your AI coding agent” piece. The right architecture is the boring architecture: the coding agent drafts; Provar grounds, validates, governs, and executes.

AI Coding Agent: let it do what it’s good atProvar MCP: where it raises the floorOnly Provar can do this
Generating test structure from user storiesMetadata-grounded authoring with no DOM dependencySalesforce triannual release alignment
Proposing scenarios from natural language240 quality rules validated before executionAgentforce AI agent testing (API-level)
Reviewing and refactoring test codeCoverage gap analysis across the full projectNon-engineer no-code test authoring
Generating test data variations7.5× token efficiency on Salesforce workloadsTest plan / suite / coverage governance
Secure org access with full audit trailProvar Grid managed execution

The coding agent writes the test. Provar ensures it’s correct, stable, governed, and maintainable across the full Salesforce release cycle. Anything else is a science project you’ll be rewriting after the next release.

Want to see it run against your org? Book a demo with a Provar expert, or run a 30-day PoC alongside your current Playwright + agent stack and measure coverage, token cost, and team adoption side by side.