AdvancedAgentic Coding

Reviewing agent work

Read the diff, run the tests, and operate as the quality gate that keeps agentic coding reliable.

Using AIAdvanced14 min read

Recommended first

Directing agents effectively

By the end of this lesson you will be able to:

Perform a systematic diff review that catches the most common categories of agent mistakes
Identify the red flags that indicate an agent made a consequential unintended change
Apply the trust-but-verify loop — diff, run tests, try the feature manually
Explain why human review is structurally necessary in agentic coding, not optional

The previous lessons focused on the input side of agentic coding: writing memory files, scoping tasks, giving clear instructions. This lesson focuses on the output side — what you do after the agent finishes.

The most important skill in agentic coding is reading the diff.

Not skimming it. Not assuming it is fine because the tests passed. Reading it — every changed line — with the same care you would apply to a code review from a developer you do not fully trust yet.

Why review is structurally necessary

Agents make mistakes. This is not a temporary limitation that will be fixed in the next model version — it is a structural property of systems that generate plausible text. An agent can produce code that:

Does the right thing in the case you described and the wrong thing in the cases you did not.
Removes existing error handling because it was in the way of the new feature.
Passes all the tests that existed before, while breaking an invariant that no test was covering.
Makes the change you asked for, plus several unrequested changes that interact subtly with the rest of the codebase.

None of these problems are visible from the agent's output messages. They are only visible in the diff.

Your review is the quality gate. An agent without a human review loop is automation without a check. You are not slowing down the process by reviewing carefully — you are the mechanism that makes the process trustworthy.

The review loop

After every agent task, run this loop in order:

1. Look at the diff.

git diff HEAD

Or if the agent created a commit:

git show

Read every changed line. For each change, ask two questions: did I ask for this, and is it correct?

2. Run the tests.

pytest -v        # Python
npm test         # JavaScript

Passing tests are necessary but not sufficient. Tests cover what the test author anticipated. A good review supplements the test suite with human judgment about what the tests might not cover.

3. Try the feature manually.

Invoke the thing the agent just built. Does it do what you expected? Does the output look right? Are the edge cases handled? Does it fail gracefully on bad input?

4. Read the changed files in context.

Open the files that changed and read them as whole units, not just the changed lines. The changed lines make sense in isolation; the surrounding code may reveal a problem that the diff view did not surface.

The red flag catalogue

These are the patterns to look for first when reviewing agent output:

Deleted error handling. The agent needed to restructure a function and removed a try/except block or a null check in the process. This is one of the most common and most dangerous agent mistakes. Error handling rarely has tests, so it passes the test suite silently.

# Before (what you had)
def load_config(path):
    try:
        with open(path) as f:
            return json.load(f)
    except FileNotFoundError:
        return {}

# After (what the agent returned — spot the problem)
def load_config(path):
    with open(path) as f:
        return json.load(f)

Hardcoded values. The agent needed a value from somewhere and decided to inline it rather than look it up properly. Common examples: hardcoded port numbers, file paths, timeout values, or user IDs that should come from configuration.

Renamed identifiers without cause. The agent renamed a variable, function, or class as part of the task — sometimes to "improve" readability. This breaks any code that referred to the old name and wasn't in the files the agent touched. Always check: was any name changed that is used outside the files in this diff?

Missing edge cases. The agent implemented the happy path correctly and gave no thought to what happens when the input is empty, negative, null, or malformed. This is not a bug you can see in the diff — it is an absence. Look at what the changed code handles and ask what it does not.

Silent scope changes. Files you did not expect to see in the diff. The agent decided that something "obviously" needed updating too. Sometimes this is correct. Always verify intentionally — do not assume.

Test passage is not the same as correctness. A test suite that was written before the feature existed cannot test the feature. After any agent task, ask: what new behaviour was introduced, and is there now a test that covers it? If not, write one before you move on.

What to do when you find a problem

Small, localized problem: fix it yourself. You know the codebase, you can see the issue, it is faster to write the fix than to describe it.

Problem caused by misunderstanding the intent: revert the agent's changes (git checkout .) and give a more precise instruction. Describe specifically what the agent got wrong and how to do it correctly.

Problem caused by a missing constraint: add the constraint to CLAUDE.md or to the next task instruction. The agent did not know what you wanted because you did not tell it. Update your memory file so it knows next time.

Widespread problems across many files: revert everything and reconsider whether the task was scoped too broadly. A task that produces many problems in many files usually means the goal was not well-defined before the agent started.

The compounding risk

Here is the scenario that justifies everything in this lesson:

You run an agent task, skim the diff, see that the tests pass, and ship. A week later you run an agent task that builds on that code. The agent builds on the first agent's mistakes — the missing error handling, the hardcoded value — and adds more code that makes the same assumptions. Two weeks later you have a production bug that traces back to the deleted try/except from the first review you did not do.

This is not hypothetical. Agentic coding without review is technical debt at machine speed. The review loop breaks the chain.