Claude Code for Developers: Build an AI Agent

Developers 3-5 days Advanced

The short version

Most 'AI agent' projects are a chat wrapper with a system prompt wearing a trench coat. A real agent is a loop: a model that decides, calls a tool, observes the result, and decides again until the task is done. As a developer, your job is not to make it clever - it is to make it bounded, observable, and recoverable. This guide builds a working agent with Claude Code, starting from the dumbest version that runs, giving it tools rather than open-ended freedom, wrapping it in an eval harness so you can change it without fear, and hardening it against the ways agents actually fail once real inputs hit them.

What an agent actually is - and when you don't need one

An agent is a model in a loop with tools. It receives a goal, decides on an action, executes that action through a tool you gave it, observes the result, and loops until it judges the task complete or hits a stop condition. That is the whole idea. Everything that makes agents hard is a consequence of that loop being non-deterministic: the same input can take a different path, tool calls can fail, and the model can talk itself into a dead end. Before you build one, be honest about whether you need the loop at all. If the task is a fixed sequence of steps you can write out in advance, you want a pipeline with a single model call per step, not an agent - it will be cheaper, faster, and trivially debuggable. Reach for an agent only when the path genuinely cannot be known ahead of time and the system has to decide its own next move based on what it finds.

This distinction is where most developer time gets wasted. Teams build a full agentic loop for a workflow that was really just three deterministic steps, then spend weeks taming the non-determinism they introduced for no reason. The most senior move in this whole space is choosing the least agentic architecture that solves the problem, and only adding autonomy where the problem actually demands it.

The agent loop in plain terms

Strip away the frameworks and the loop is small enough to hold in your head: send the model the goal and the list of tools it may call; the model responds either with a final answer or a request to call a tool with specific arguments; you execute that tool, capture the result, append it to the conversation, and send it back; repeat. The art is entirely in the boundaries you put around that loop - how many iterations before you stop, what happens when a tool throws, how you keep the context from ballooning, and how you know, from the outside, what the agent was thinking when it did something dumb. Claude Code is excellent for building this because you can have it scaffold the loop and the tool definitions, then iterate on the boundaries with you as you watch real runs.

What you'll build

A single-purpose agent that does one genuinely open-ended task end to end - say, a research agent that takes a question, searches, reads sources, and returns a cited answer, or a triage agent that reads an incoming issue, gathers context from your codebase, and proposes a labeled disposition. It has a bounded loop, two or three real tools, structured logging of every step, and an eval suite that runs it against fixed cases so you can change the prompt or a tool without silently breaking it.

  • Claude Code installed, plus an API key for the model your agent will call
  • A clear, single task definition with an unambiguous 'done' condition
  • Two or three tools the agent needs - each a plain function with typed inputs and outputs
  • A handful of representative test cases with known-good outcomes
  • Structured logging from the start - every decision, tool call, and result
  • A hard iteration cap and a budget ceiling so a runaway loop cannot cost you

Start with the dumbest version that runs

Have Claude Code build the loop with one tool and a low iteration cap, and run it against a single test case. Resist adding tools, memory, or cleverness until that minimal loop completes a real task end to end. The reason is debugging surface area: every tool and every feature multiplies the number of paths the agent can take, and you want a working baseline you fully understand before you expand it. A dumb agent that reliably does one thing is infinitely more useful than a sophisticated one that fails in ways you cannot reproduce. Once the minimal loop works, add the second tool, run the cases again, and only then the third.

Give it tools, not freedom

The biggest reliability lever is the design of your tools, not the prompt. Each tool should do one well-defined thing, validate its own inputs, and return a clear, structured result - including a clear, structured error when something goes wrong, because the model can recover from an error it can read but not from an exception that crashes the loop. Make tools narrow on purpose. A tool called run_any_sql is a liability; a tool called get_orders_for_customer that takes a customer id and returns a typed list is one the agent cannot misuse. Every constraint you bake into a tool is a class of failure the model can no longer cause. Think of tool design as defining the agent's entire universe of possible actions, and make that universe small and safe.

Memory, state, and the context window

Two kinds of memory matter and they are different problems. Within a single run, the conversation grows with every tool result, and a long task will blow past the context window if you append blindly. Decide deliberately what stays in context - usually the goal, recent steps, and a running summary - and what gets dropped or compressed. Across runs, if your agent needs to remember things between sessions, that is real persistence: a store you write to and retrieve from, with the agent deciding what is worth saving. Do not conflate the two, and do not reach for a vector database the moment someone says memory. Most agents need a clear within-run context strategy and either no cross-run memory or a simple keyed store; the heavy retrieval machinery is for when you have proven you need it.

Where agents go wrong in production

The failures are predictable, which means you can defend against them up front. Agents loop forever when no progress is being made - so cap iterations and detect repeated identical actions. They burn money on long runs - so set a token budget and stop when it is hit. They confidently fabricate when a tool returns nothing useful - so make tools return explicit empty results the model is instructed to surface rather than paper over. They do the wrong irreversible thing - so any tool with real-world side effects should require confirmation or run against a dry-run first. And they fail invisibly without logs - so structured logging of every decision is not optional; it is the only way to understand a run that went wrong, because you cannot set a breakpoint inside the model's reasoning. Build these guards in from the first version and your agent graduates from a demo to something you can actually put in front of real inputs.

Common questions

  • When should I build an agent instead of a simple pipeline?

    Only when the path to done genuinely cannot be known in advance and the system must decide its own next step from what it finds. If the task is a fixed sequence of steps, build a pipeline with one model call per step - it is cheaper, faster, and far easier to debug. Choosing the least agentic architecture that solves the problem is the senior move.

  • What does the agent loop actually consist of?

    Send the model the goal and available tools; it responds with either a final answer or a tool call; you execute the tool, append the result, and send it back; repeat until done or a stop condition fires. The intelligence is in the boundaries you put around that loop - iteration caps, error handling, context management, and logging - not in the loop itself.

  • How do I make an agent reliable?

    Design narrow tools that validate inputs and return structured results and structured errors, start with the smallest loop that works, and wrap everything in an eval harness so changes are measurable. Reliability comes from constraining what the agent can do and being able to test it, not from clever prompting.

  • Do I need a vector database for memory?

    Usually not at first. Separate within-run context management (what stays in the conversation as it grows) from cross-run persistence (remembering between sessions). Most agents need a deliberate context strategy and either no cross-run memory or a simple keyed store. Reach for retrieval machinery only once you have proven you need it.

  • Why is an eval harness so important for agents?

    Because agents are non-deterministic, so 'it worked when I tried it' proves nothing. A harness that runs the agent against fixed cases and scores outcomes lets you change a prompt or tool and immediately see what improved and what regressed. It turns tuning from guesswork into engineering, and it is the single highest-leverage thing you build.

  • What are the most common ways agents fail in production?

    Looping forever with no progress, burning budget on long runs, fabricating when a tool returns nothing, taking wrong irreversible actions, and failing invisibly without logs. Defend with iteration caps and repeat-action detection, a token budget, explicit empty results, confirmation or dry-run on side-effecting tools, and structured logging of every step.

Keep going

Build it. Ship it. Get paid.

Step-by-step lessons for builds like this inside the club. Join Claude Code Club for $9/month.