Why avoid LangChain and similar frameworks?

They save you an hour on Saturday and cost you four on Sunday when something breaks inside their abstraction. Raw Anthropic SDK calls are around two hundred lines of TypeScript total, and you can debug every part of the loop.

How many tools should the agent have?

Two to start, three at most for v1. More tools confuse the model about which to call. Start with a search tool and a fetch tool, and only add a third when your evaluation runs show you need it.

Do I need a vector database for memory?

No. SQLite via better-sqlite3 with cosine similarity computed in TypeScript handles under ten thousand memories comfortably. Move to a vector database only when you cross that scale or need server-side filtering.

What does the evaluation harness actually do?

It loads ten to twenty example tasks, runs the agent against each, and scores the output with either a deterministic check, a regex, or a judge prompt sent back to Claude. You compare scores before and after each change to know if you're improving.

Claude or GPT for the model?

Claude is currently the best at multi-step tool plans, and the Anthropic SDK has the cleanest tool-use API. Start there. GPT-4-class models are competitive and cheaper at high volume, so swap if cost becomes the constraint.

Why cap the iteration loop?

Models sometimes call the same tool repeatedly when they don't like a result. A loop counter capped at ten iterations stops a stuck agent from burning your API budget while you sleep.

All use cases

Build an AI Agent in a Weekend with Claude Code

Developer Weekend Intermediate

What you'll build

A weekend is enough time to build a real agent if you pick one job and stay disciplined. Start with a clear task, give the agent two or three tools, add a memory layer, and write a tiny eval harness so you can tell if changes make it better or worse. Deploy as a small Hono service behind an API key.

What you're building

You're building an AI agent that does one job well. Not a chatbot, not a general assistant. An agent in the strict sense: a loop that calls a model, lets the model use tools, feeds results back, and stops when the job is done. Examples that fit a weekend: a research agent that summarizes a topic from five sources, a code review agent that comments on a pull request, a triage agent that reads support email and routes it to the right queue, or a job-board agent that finds three relevant postings every morning and writes a one-paragraph summary of each.

By Sunday night you should have a working agent you can call from the command line and from an HTTP endpoint, with a small evaluation harness that proves it does the job better than a single prompt would. That last part is what separates a real agent from a clever demo. Without an eval, every change feels like an improvement and most are sideways. With one, you can ship and iterate with the confidence of numbers.

Pick the job before the architecture. Agents that try to be 'helpful in general' produce mediocre output everywhere. Agents with a sharp definition of done outperform much larger systems on their narrow task. Write one sentence that describes the input, one sentence that describes the output, and one sentence that describes success. If you can't write those three sentences, the project isn't ready for code yet.

What you need before you start

You need a Claude API key or an OpenAI key, Node 20 or later, and a comfort level with TypeScript. You need to have read Anthropic's tool-use docs once, even skimmed. You need a way to test the agent against real inputs, which means having ten to twenty example tasks written down before you write any code. Without examples, you can't tell if your agent is improving or just changing. The examples should include three or four cases the model will probably get wrong on the first try. Those are the ones worth optimizing against, because the easy ones tell you nothing about quality.

Claude Code installed locally, plus an Anthropic API key
Node 20 or later and pnpm
@anthropic-ai/sdk in your package.json
Hono for the HTTP server, or fastify if you prefer
A SQLite file or a Turso database for memory
Ten to twenty example tasks with expected behavior

Saturday morning: the agent loop

Start with a single file, agent.ts, that exports a runAgent function. The function takes a task string, calls the model with a system prompt that describes the job, and exposes two or three tools. Use Anthropic's tool_use API, not a wrapper framework. You'll understand the loop better and debug faster. The loop is simple: send messages, check for tool_use blocks in the response, run the tools, append the results as tool_result blocks, send again. Stop when the model returns end_turn with no tool use.

Resist using LangChain, LlamaIndex, or any framework this weekend. They will save you an hour on Saturday and cost you four hours on Sunday when something goes wrong inside their abstraction. Raw SDK calls are around two hundred lines of TypeScript total. You can read all of it, which means when the agent behaves oddly you can find the bug in fifteen minutes rather than reading three layers of someone else's documentation.

Spend time on the system prompt. It's the agent's job description, and a tight one is worth more than another tool. Tell the model what it is, what success looks like, the format of the final answer, and what to do when stuck. A good system prompt is around two hundred words, opinionated, and rewritten three or four times during the weekend as you watch the agent fail on real tasks.

Saturday afternoon: tools

Tools are functions the model can call. Each one needs a JSON schema, a description, and a TypeScript implementation. Start with two: a search tool that hits Brave Search or Exa, and a fetch tool that downloads a URL and returns clean text via something like @mozilla/readability. If your agent writes anything to a database or file system, add a third tool for that, but keep the surface narrow. Every tool you add is a new failure mode, a new schema to maintain, and a new branch in the model's decision tree.

Tool descriptions matter more than tool names. The model decides which tool to call by reading the description, not the function signature. Write each one as if you were explaining to a smart intern when to use it and when not to. Include one example of a good input and one of a bad input. The descriptions should be three to five sentences. Anything shorter is too vague, anything longer crowds the context window.

Saturday evening: memory

Memory is where weekend agents usually break. You need two kinds. Session memory is just the message list, which the SDK manages for you within a single run. Long-term memory is anything the agent should remember between runs, and that needs a store. Use SQLite via better-sqlite3 for local dev and Turso for production. Define a tiny schema: a memories table with id, text, embedding, and created_at. Add a remember tool that writes, and a recall tool that does a similarity search.

Generate embeddings with text-embedding-3-small from OpenAI or with voyage-3 if you want to stay in the Anthropic ecosystem. Either works. Don't introduce a vector database this weekend. SQLite with cosine similarity in TypeScript is plenty fast for under ten thousand memories. The cost is a few microseconds per query and the operational footprint is one file on disk. You can move to pgvector or Qdrant later when traffic justifies the extra surface area.

Sunday morning: evaluation

This is the step most people skip and most people regret. Write an evals.ts file that loads your ten to twenty example tasks, runs the agent against each, and scores the output. Scoring can be exact match, regex, or a second call to Claude that judges the result. Run the evals before and after every change. If a tweak that felt smarter scores lower, undo it. Without this loop you're just vibing, which feels productive while you're typing and depressing when you can't tell if last week's version was actually better.

Keep the eval cheap and fast. The whole suite should run in under two minutes and cost less than a dollar. If it doesn't, you'll stop running it, which defeats the point. Start with five examples, get the harness clean, then grow to twenty. Examples should be the actual tasks you want the agent to handle in production, not synthetic toy problems, because synthetic improvements don't transfer.

1Define ten to twenty tasks with expected behavior
2Write a runner that loops over them
3Score each result with a deterministic check or a judge prompt
4Save the run as JSON with timestamp and git commit
5Compare runs before and after each change

Choices to make along the way

Claude versus GPT versus a local model: Claude is the best at following multi-step tool plans as of mid-2026, and the Anthropic SDK has the cleanest tool-use API. GPT-4-class models are competitive and cheaper at scale. Local models via Ollama are tempting but tool-use reliability is still uneven below thirty billion parameters. Start with Claude, swap later if you need to.

Hono versus Fastify versus a Cloudflare Worker: Hono is the right default because it runs identically on Node, Bun, and Cloudflare. If you want to deploy to a Worker on Sunday night, Hono is the only option that doesn't require rewriting. Fastify is fine if you're staying on Node forever.

Sunday afternoon: shipping

Wrap the agent in a Hono server with one POST endpoint, /run, behind a simple bearer-token check. Deploy to Fly, Render, or a Cloudflare Worker. Add logging that captures every tool call with timing, the full message history, and the final result. You'll want this when an agent does something weird in production. Save logs to a file in dev and to Logtail or Axiom in production. Set a per-request budget cap in dollars too, so a runaway agent can't cost you more than the cap before the loop aborts and returns a clear error.

If you want others to run their own copies, push the repo to GitHub with a clean README. The club at claudecodeclub.ai shares agent repos every week, and a well-documented weekend project is a good first contribution to show off what nine dollars a month and a couple of evenings can produce. The README should include a sample task, a sample output, and a one-line install command. Anything more than three steps and people bounce.

Stream the agent's intermediate steps when calling from a UI. The model thinks for ten or twenty seconds per loop iteration, and a blank screen during that time feels broken. Server-sent events from the Hono endpoint work cleanly and Claude can write the streaming handler in one go. Even a simple 'thinking' line that updates every second turns a slow agent into a satisfying one.

How to extend it

After v1, the natural next steps are a planning layer that breaks a task into subtasks before the loop starts, a reflection step where the agent critiques its own output before returning, and a longer-term memory that summarizes past runs into a profile. Each of those is a weekend project on its own. Don't try to add them in the first weekend. Ship the simple loop, watch it run for a week, then add the layer the runs actually need.

Common gotchas

Forgetting to handle the case where the model calls a tool and then calls it again with the same arguments because it didn't like the first result. Add a small loop counter and cap the number of iterations at ten. Forgetting to truncate the conversation history when it gets long, which makes calls slow and expensive. Forgetting to set a timeout on tool calls, which lets a stuck fetch hang the whole agent. Finally, don't trust your own judgment about whether the agent got smarter after a change. Run the evals.

One more: not handling structured tool errors. When a tool fails, return a tool_result with an error message the model can read, not a thrown exception that crashes the loop. The model is good at recovering from a clear error message and bad at recovering from a silent abort. A weekend agent that handles errors gracefully feels twice as smart as one that doesn't, even when the underlying logic is identical.

Common questions

Why avoid LangChain and similar frameworks?
They save you an hour on Saturday and cost you four on Sunday when something breaks inside their abstraction. Raw Anthropic SDK calls are around two hundred lines of TypeScript total, and you can debug every part of the loop.
How many tools should the agent have?
Two to start, three at most for v1. More tools confuse the model about which to call. Start with a search tool and a fetch tool, and only add a third when your evaluation runs show you need it.
Do I need a vector database for memory?
No. SQLite via better-sqlite3 with cosine similarity computed in TypeScript handles under ten thousand memories comfortably. Move to a vector database only when you cross that scale or need server-side filtering.
What does the evaluation harness actually do?
It loads ten to twenty example tasks, runs the agent against each, and scores the output with either a deterministic check, a regex, or a judge prompt sent back to Claude. You compare scores before and after each change to know if you're improving.
Claude or GPT for the model?
Claude is currently the best at multi-step tool plans, and the Anthropic SDK has the cleanest tool-use API. Start there. GPT-4-class models are competitive and cheaper at high volume, so swap if cost becomes the constraint.
Why cap the iteration loop?
Models sometimes call the same tool repeatedly when they don't like a result. A loop counter capped at ten iterations stops a stuck agent from burning your API budget while you sleep.

More to build

Build it. Ship it. Get paid.

Step-by-step lessons for every one of these inside the club. Join Claude Code Club for $9/month.

Join the club See the curriculum

Related: the library, guides, and comparisons.