What AI Coding Agents Still Can't Test: Six Salesforce QA Gaps Copilot Won't Close

“We already have Copilot writing our tests.” Great. We have the work it’s leaving on the floor.

There’s a comfortable narrative making the rounds in QA leadership circles right now: AI coding agents have absorbed test automation. Engineers describe a flow in natural language, Copilot, Cursor or Claude Code emits a Playwright spec, a Healer agent keeps it green, and the team moves on. For an isolated, generic web app, it is more or less true. For an enterprise Salesforce org, it is dangerously incomplete.

Today’s blog is honest accounting — not the marketing version where AI agents are useless (they’re not!) but the operational version where six specific responsibilities still don’t have plausible AI-only answers.

If your QA strategy assumes coding agents have closed these gaps, this is what’s actually still open.

1. Agentforce Testing

Agentforce is no longer a curiosity. Salesforce customers are shipping AI agents into production for sales follow-up, case deflection, internal HR queries, and a growing list of regulated use cases. Those agents need to be tested — and testing an AI agent is a fundamentally different problem from testing a web form.

A Playwright + coding agent stack can do exactly one useful thing here: navigate to the Agentforce UI and type a prompt. It cannot evaluate whether the response was correct. It cannot vary the prompt across the dimensions that matter — phrasing, persona, language, edge cases. It cannot detect hallucination. And it cannot monitor behavioural drift over time, which is the hard production problem.

To understand why drift matters: an Agentforce agent trained on one set of case data may respond materially differently to the same prompt after a retraining cycle — passing all UI-level checks while giving customers subtly different advice. A screen-scrape approach has no way to detect this. Only API-level access to prompt variation, structured response evaluation, and golden-set comparison can.

Provar was first to market with this capability. A general-purpose Healer agent does not approach the problem. If Agentforce is on your roadmap, and at this point that’s most Salesforce orgs, your test framework needs to handle AI-on-AI evaluation as a first-class concern, not as an afterthought.

2. Test Plan and Coverage Governance

Coding agents generate .spec files, but it is the entirety of their output model. They do not maintain a structured representation of which Salesforce business processes those specs cover, which ones are uncovered, which sign-offs your release process requires, or which regression gaps are currently open.

This is the difference between test files and test governance. The first is an artifact. The second is a system.

A real Salesforce QA function needs answers to questions a .spec file cannot answer:

Which of our top-50 business processes have automated coverage today?
Which haven’t been touched since the last release?
Which orphan tests exist with no parent plan or suite assignment?
What is the regression risk we are knowingly accepting for the next deployment?

Provar Quality Hub maintains the coverage map as data, not as documentation. The coding agent can keep generating tests; somebody still has to answer “are we testing the right things?”, and that somebody needs a governance layer that does not exist in any open-source Playwright workflow.

3. Non-Engineer Participation

This gap often gets discussed least and matters most. Every Playwright + coding agent workflow requires engineers. The deliverable is code, the review is code review, and the maintenance happens in pull requests.

That excludes the people who often know the Salesforce business processes best: QA analysts, business analysts, manual testers, and admins who have lived inside the org for years. Salesforce estimates that admins and business analysts own the majority of Salesforce configurations in enterprise orgs — not engineers. In an engineering-led testing model, those people become consumers of dashboards rather than contributors to coverage.

Provar’s low-code Test Builder solves this differently. The same metadata layer that grounds AI-generated Page Objects also powers a visual authoring environment that a non-engineer can use to extend a test, validate a new flow, or add coverage for a corner case they noticed in production. The engineering team isn’t a bottleneck. The team that knows the business contributes directly.

“Engineers only” looks fine on a slide. It looks worse twelve months into an Agentforce rollout, when the people closest to the customer impact are locked out of the test framework.

4. Triannual Release Alignment

Salesforce ships three major platform releases a year: Winter, Spring, Summer. Each one can change LWC component patterns, field API behaviour, layout rendering, and dozens of smaller surfaces that automation depends on. This is not an unforeseeable event. It happens on a known schedule, and it breaks DOM-derived tests as a matter of physics.

The key distinction is what breaks. Healer agents operating on accessibility snapshots can sometimes patch selectors after a release — but they patch selectors, not schema changes. When Salesforce renames a field, restructures a related list, or changes a component’s render contract, there’s no selector to re-derive. The underlying test reference is gone. A DOM-aware Healer has nothing to work with.

Provar participates in Salesforce BETA programs. Compatibility is pre-tested before the release ships. Customer test suites get release-aligned updates rather than mid-release fire drills. This is the kind of work that does not show up in a feature comparison until you’ve lived through a release that breaks your AI-generated test suite the morning of a major customer demo.

5. Enterprise Support and SLA

GitHub issues are not a support contract. Community Discord servers are not a support contract. Stack Overflow is not a support contract.

If your AI-driven QA stack is open-source Playwright plus a coding agent, your escalation path when production testing breaks on the day of a release is essentially “post in a public channel and hope.” For most enterprise Salesforce customers — regulated industries, large deal cycles, contractual go-live commitments — this is operationally unacceptable.

Provar provides a dedicated Customer Success Manager, the Success Portal for structured tickets, and University of Provar for ongoing training. There’s a name on the account. There’s an SLA. There’s accountability for testing outcomes, which is what actually matters when the release is shipping and something is wrong.

This isn’t a glamorous capability. But it is the one most often missing from “we replaced our QA tool with AI” success stories — usually because the team telling that story is twelve months in and hasn’t hit a real incident yet.

6. Execution Infrastructure

Generating a test is the cheap half. Executing thousands of tests in parallel — across environments, with proper isolation, deterministic data setup, and CI/CD integration — is the expensive half.

A Playwright + coding agent setup gives you the first half. The second half — Grid orchestration, environment management, parallelisation, run scheduling, retry policy, result aggregation — is something your team builds and maintains in-house. It’s a real engineering investment that most QA functions significantly underestimate at adoption time, often equivalent to months of dedicated engineering effort annually.

Provar Grid is the productised version of that work. Test execution scales horizontally without anyone writing a CI orchestration layer. Engineers spend their time on test logic, not on infrastructure plumbing.

What This Adds Up To

The conclusion is not that AI coding agents have no place in Salesforce QA. They do. They are excellent at drafting test scaffolding from user stories, proposing scenarios from natural language, reviewing and refactoring code, and producing test data variations. That work used to take hours. It now takes seconds.

But the parts of the QA function that matter most at enterprise scale — testing AI agents, governing coverage, including non-engineers, surviving releases, having a real support relationship, and running the actual test execution — none of those are in scope for a coding agent. They were never going to be. They aren’t even the same kind of problem.

The mature Salesforce QA architecture in 2026 looks like this: a coding agent drafts; Provar grounds the draft in metadata, validates it against 240 quality rules, slots it into a governed test plan, executes it on managed infrastructure, and keeps it aligned across the Salesforce release cycle. The two layers do different work. Pretending you can collapse them into one is how you end up with an impressive demo and a brittle production stack.

If your QA leadership conversation right now is, “do we still need a Salesforce-native test platform now that we have Copilot?” The honest answer is yes. And the more interesting question is whether your current vendor has actually closed those six gaps — or just put them on a roadmap.

Curious where your current stack sits against this list? Provar’s CSMs run gap assessments against real orgs. Book some time with a Provar expert today.

What AI Coding Agents Still Can’t Test: Six Salesforce QA Gaps Copilot Won’t Close

Share the article

Tags

Chapters

1. Agentforce Testing

2. Test Plan and Coverage Governance

3. Non-Engineer Participation

4. Triannual Release Alignment

5. Enterprise Support and SLA

6. Execution Infrastructure

What This Adds Up To

Share the article

Continue Reading

Why AI Agents Require a Different Testing Strategy

How to Test AI Agents with Greater Consistency

Why Evaluator Accuracy Matters in AI Testing

Preparing AI Agents for Production with Repeatable Testing

Join the Provar TrustAI Partner Program

From Manual to Automated: A Guide to Testing nCino with Provar

Provar’s Test Automation Experts