Acceptance criteria as the contract, for AI coding agents

← RCF

The acceptance criterion is the most important thing in the document chain. Not the most prestigious. Not the most discussed at planning. The most important. Everything above it exists to produce it, and everything below it exists to satisfy it. If RCF works for you, it’s because the ACs are doing the work. If it doesn’t, the ACs are almost always the reason.

What are acceptance criteria in AI-assisted development?Copy link

Acceptance criteria are short, testable statements of what a piece of software must do for a specific scenario. In AI-assisted development they do extra work, because the agent doing the coding doesn’t carry the team’s institutional memory and won’t infer intent from a hallway conversation. The AC is the contract the agent has to satisfy and the test can verify, in that order. Everything that follows is mechanics.

What an AC actually isCopy link

A Given/When/Then sentence. Given a precondition, when an event happens, then an observable thing must be true. Three clauses. One specific behaviour.

Two things matter about that shape. First, it’s testable as written. If you can’t turn a Given/When/Then into a piece of test code, the criterion isn’t one. It’s a wish. Second, it’s observable. The then clause has to describe something a test can check from the outside: a response code, a stored value, a screen state, a side effect on another system. If the only way to confirm the criterion is to read the source code, the criterion isn’t one. It’s a code review note in fancy dress.

The one-to-one ruleCopy link

Every acceptance criterion gets exactly one test suite. Every test suite implements exactly one acceptance criterion. The mapping is one to one, by construction. The AC ID and the test suite ID point at each other.

That rule is the single piece of machinery that makes the rest of RCF work. It sounds bureaucratic. It isn’t. It’s the difference between “we have tests” and “we have tests that prove the product does the thing.” Those are very different sentences.

The case for the rule is straightforward. If a criterion has zero test suites, it’s decorative. Nobody’s proving the criterion is met. If a criterion has two test suites, you’ve duplicated work and now have two places to update when the criterion changes. If a test suite covers two criteria, you can’t answer “is AC-x passing” without picking apart which assertions belong to which AC, which means you can’t answer it at all in practice. One to one removes the ambiguity. Always.

Why this matters more with AI agents in the loopCopy link

Without AI, a developer might write three tests for one criterion because that’s how the testing in their language wants to be organised. The team knows the three tests cover one AC, and life goes on. The cost of the missing formal mapping is low because the team carries it in their heads.

With AI, the team can’t carry it in their heads, because the team isn’t doing the work. An agent is. The agent doesn’t know which three tests cover which AC unless something tells it. If you ask an agent to “make sure the tests cover the criteria,” you get tests that look plausible against the criteria, and the gap between “look plausible” and “actually cover” is exactly where the hard bugs live. The one-to-one rule closes the gap. The agent writes the suite for the criterion. The suite’s name points at the criterion. The criterion’s test id points at the suite. There’s nowhere for the gap to hide.

What a good AC looks likeCopy link

A good AC is specific, observable, and self-contained. Specific means it describes one behaviour, not a family of related behaviours. Observable means the then clause is checkable without reading the source. Self-contained means the test can be written from the AC text alone, without having to reconstruct context from the wider story.

Here’s a workable one:

AC-042-02. Given a signed-in user, when they upload a file over 5 MB or in an unsupported format, then the upload is rejected with an HTTP 400 response, an error code of upload-too-large or upload-format, and no partial state is persisted in the photo store.

Three things to notice. The precondition is named (signed-in user). The triggers are named (over 5 MB; unsupported format). The expectations are concrete and checkable (HTTP 400, specific error codes, no partial state). Writing a test against that AC is mechanical work.

What a bad AC looks likeCopy link

The usual failure mode is “a thing the developer noticed they wanted to check, but wrote up like an AC.” Here’s a common shape:

AC-042-bad. The upload feature handles errors gracefully and provides good user feedback.

Not a criterion. A wish. “Gracefully” and “good user feedback” aren’t observable. Two engineers will write different tests for it. Both tests will pass. The product will still ship with a broken error flow. The fix is to split the wish into actual criteria, each of which names a specific error condition and a specific expected behaviour. That work is annoying. It’s also the work the methodology is for. Skip it and the methodology stops paying back.

Where ACs come fromCopy link

ACs sit on user stories. They’re written when the story is written, ideally by the same person who wrote the story, which is the product owner or whoever is doing that role on the project. They aren’t added by the developer at the last minute. The point of the AC is to commit, ahead of code, to what the code must achieve. Adding ACs after the code is built is reverse-engineering the test suite, and you may as well not bother.

In practice, ACs evolve. You write a story, you sketch the ACs, you start to build, you discover that one of the ACs was missing a precondition, you update the AC, you commit the update, and now you have the corrected criterion and a record that it changed. The change is a deliberate act, with a deliberate commit. That’s different from quietly editing the AC to match what the developer happened to build. One is the methodology working. The other is the methodology being papered over.

One aside on AI in this part of the chain. Writing a rounded set of ACs is painstaking work, and it’s a genuine sweet spot for AI assistance. Given well-written requirements and stories, an agent will produce a surprisingly thorough first cut of criteria, including the edge cases the author would have missed on a first pass. It needs review and rework, like any AI output, but the productivity lift is real. There’s a longer piece coming on AI-assisted upstream work; for now, take it as read that this isn’t a part of the chain where AI is unwelcome.

ID schemes and referencesCopy link

AC IDs are scoped to the story. AC-042-01 is the first criterion on US-042. The string is opaque (never renumbered, never reused) and load-bearing (test suite IDs point at it, code annotations reference it, work items mention it). On a project of any size, ACs are the thing you look up by ID most often.

The ID is intentionally human-typeable. You should be able to read an AC ID out loud in a meeting, grep for it in the repo, find it in the test runner’s output, and have all three places agree. If your IDs are UUIDs, you’ve made the chain unreadable, which means it won’t get used, which means the methodology won’t pay back.

What you get when ACs are doing their jobCopy link

Three things, and they’re the things every methodology promises and most don’t deliver.

You can answer “is this requirement satisfied” in a single query. Pull the requirement’s stories, pull their ACs, pull the suites, check they’re green. Done.

You can answer “what does this code do that matters” by tracing the function’s tests back through the chain. The chain doesn’t tell you how the function works, but it tells you why the function exists. The how is the easy part. The why is what teams keep forgetting.

You can hand a slice of work to an AI agent and have the agent’s output checked against the criteria a human committed to, instead of the criteria the agent happened to imagine while it was working. This is how RCF closes the AI trust gap on agent-generated code. The trust sits with the contract, not the agent. ACs are where it lives.