Is it legal to scrape a public site?

It depends on the site's terms of service and your jurisdiction. Public, non-personal data with no login wall is the safe zone for personal projects. Anything behind a login or containing personal information is not what this guide covers, and you should ask before building.

Why save the raw HTML to disk?

When the extractor breaks weeks from now because the site changed, you can re-run parsing on the cached HTML without re-fetching. It's the single biggest time-saver in a scraping pipeline.

Playwright or plain fetch?

Playwright handles JavaScript-rendered pages, infinite scroll, and login flows but uses more memory. Plain fetch plus cheerio is faster and lighter but breaks on modern frontends. Start with Playwright unless you know the site renders on the server.

How do I keep the API bill low?

Cache summaries by a hash of the structured fields. Skip the API call when the hash hasn't changed. Without caching, every weekly re-run costs as much as the first run. With caching, the second run is nearly free.

How big should the extracted record be?

Five to ten fields is the sweet spot. Wider records turn into noisy summaries and brittle extractors. Pick the fields you'd put in a spreadsheet column header and stop.

How do I avoid getting blocked?

Add a two-to-five second delay between requests, set a polite user agent with a contact email, and respect robots.txt where applicable. Slow scrapers get tolerated, fast scrapers get blocked.

All use cases

Scrape and Summarize Data with Claude Code in a Weekend

Anyone Weekend Beginner

What you'll build

You can scrape a public site and turn the raw data into useful summaries in a weekend. Use Playwright to grab the pages, store the raw HTML in a folder, parse the structured fields with a small extractor, and let Claude write a one-paragraph summary per record. The whole pipeline is two hundred lines.

What you're building

You're building a small pipeline that turns a public website into a clean dataset and a set of human-readable summaries. Examples that fit a weekend: every Hacker News Show HN post from the last month with a one-paragraph summary, every event listing in your city for the next two weeks, every job posting on a niche board with a tag for senior versus junior, every product on a small e-commerce site you're researching for a competitor analysis. The pattern is the same in every case: list page, detail pages, structured fields, one-paragraph summary, output file.

Output by Sunday is a JSON file with structured records, a Markdown file with the summaries, and a tiny CLI you can re-run weekly. No web app, no database, no users. Just a script you own that turns the internet into something you can paste into a doc. Once you have one of these working, you'll find five more uses for the same shape. It's the small, sturdy automation builders end up running for years.

What you need before you start

You need Node 20, a Claude API key, and the willingness to read the terms of service of the site you're scraping. Public, non-personal data with no login wall is the safe zone. Anything behind a login or anything with personal information is not what this guide is for. If in doubt, don't, and ask in the club at claudecodeclub.ai where other builders can flag legal landmines. The line between research and infringement is fuzzy in some jurisdictions and sharp in others, and an hour of caution is cheaper than a takedown letter.

Node 20 and pnpm
Claude Code installed locally
Playwright for browser automation
@anthropic-ai/sdk for the summaries
A folder you can write JSON and HTML to
A clear definition of what fields you want to extract

Saturday morning: the fetch layer

Set up a Node project with TypeScript. Install Playwright and run its install command so the browser binaries download. Write a small fetch.ts that takes a URL, opens a Chromium page, waits for the right selector, saves the full HTML to a data/raw/{slug}.html file, and closes. Always save the raw HTML to disk. Always. When the parsing breaks two weeks from now, you'll be glad you don't have to refetch, and you'll have a permanent archive of the source pages at the moment your records were created.

Add a small delay between requests, two to five seconds, and a polite user agent string with a contact email. Most sites tolerate slow scrapers and block fast ones. Don't use a headless flag in development, so you can watch what the browser is doing while you're iterating. Flip it to headless once the script works. The visual feedback during dev catches the kinds of bugs logs hide, like a cookie banner that covered the content you wanted to scrape.

Handle the cookie banner explicitly. Many sites refuse to render content behind a consent dialog and Playwright will happily save a blank page. Write a small helper that dismisses the common banners by clicking buttons matching 'Accept all', 'Reject all', or 'Continue', and run it on every page before the main wait. One helper covers ninety percent of sites.

Saturday afternoon: extraction

Now turn the raw HTML into a list of records. Use cheerio or node-html-parser for static pages and Playwright's own page.evaluate for pages that need JavaScript to render. Define a TypeScript type for the record you want. Five to ten fields is the sweet spot. Ask Claude Code to write the extractor function with the type as the goal and a sample HTML file as the input. Run it on five pages, inspect the JSON, fix what's wrong, run it again. The iteration loop is the whole job, and it's fast because everything runs locally against cached HTML.

Save the extracted records to data/records.json as one JSON array. Keep the file in git so you can see what changed week over week. If the file is big, switch to JSONL, one record per line, which diffs better in git than one giant array. Even small datasets feel different when you can see the weekly diff. New records show up as added lines. Removed ones show up as deletions. Git becomes the audit trail for free.

Choices to make along the way

Playwright versus a fetch-and-parse approach: Playwright is heavier but handles JavaScript-rendered pages, infinite scroll, and login flows. Plain fetch plus cheerio is faster and uses less memory but breaks on any site built with a modern frontend. Start with Playwright unless you know the site renders on the server.

Claude versus GPT for the summaries: either works at this scale. Claude is better at following a strict format like 'two sentences, no marketing words, neutral tone.' GPT-4o-mini is cheaper if you're processing thousands of records. Set up the call so swapping the model is a one-line change.

Sunday morning: the summary layer

Loop over the records and call Claude with a strict prompt for each. The prompt should specify the audience, the length, the tone, and the format. Include the structured fields directly in the prompt so the model isn't re-parsing HTML. Cache the result keyed by record id, so re-runs don't re-summarize unchanged records. Caching cuts the API bill by ninety percent the second week you run the script, and saves you from accidentally regenerating summaries at three in the morning when a cron job retries.

Run the calls in parallel with a small concurrency limit, six or eight, using p-limit. The Anthropic API handles that load fine and the wall-clock time drops from minutes to seconds. Anything higher and you start hitting rate limits, which means retries and unpredictable bills. Six is the sweet spot for a weekend project.

Sunday afternoon: the output

Generate two output files. A Markdown digest at output/digest.md with a heading per record, the summary, and a link back to the source. A CSV at output/records.csv with the structured fields, so you can open it in a spreadsheet. Both files regenerate every run, so don't edit them by hand. If you need annotations, add a notes field to the source records and let the renderer pick it up.

Sort the digest by something meaningful. Newest first works for events and job boards. Score-based ordering works for product research. A digest in the order the scraper happened to crawl pages is the same as no order, which makes the file harder to skim than it needs to be.

How to test it

Run the full pipeline on five records first. Spot check every field against the source page. Fix the extractor until the JSON matches the page. Then run on twenty. Then run on the full set. Don't run on the full set until five is correct. The Anthropic API will happily charge you for two thousand records of wrong summaries if your extractor returns garbage.

Write a small validator that checks each record has the required fields, the title isn't empty, the URL is a valid URL, and any numbers are within plausible ranges. Run the validator on every output and fail the run if more than five percent of records are bad. A noisy validator catches the silent regressions when a site changes its markup and your extractor starts pulling the wrong field.

Test the summary prompt on three edge cases: a record with very little detail, a record with way too much, and a record in a language you didn't expect. Each one tells you something the well-formed cases don't. Adjust the prompt until all three produce something defensible.

How to ship it

Put the repo on GitHub and add a workflow that runs the script on a schedule, weekly is fine, and uploads the output files as an artifact. Or run it on your laptop with a calendar reminder. Either is fine for personal use. If you want the digest in your inbox, add a one-line node-resend call at the end that mails output/digest.md to yourself. The club at claudecodeclub.ai has a starter template for the email step that handles formatting nicely.

If the digest grows beyond a personal tool, publish it. A weekly newsletter built on a niche scrape is one of the cheapest, most defensible content products you can run. Substack or Beehiiv handle the distribution, your script handles the work, and you become the person who knows that niche first. The compound effect over a year of weekly issues is real.

Store the run history in a runs.jsonl file so you can answer the question 'when did this record first appear?' Some downstream uses, like new-listing alerts or trend analysis, depend on knowing what was new versus what was always there. The cost of writing one line per record is trivial and unlocks features you'll want later.

How to extend it

Add a second source and merge into one digest. Add a tag step where Claude classifies each record into one of a fixed set of categories. Add a 'what changed since last week' diff that highlights new records. Add a small web UI with Vite if you want to share the digest with someone else. Each of those is half a day.

Common gotchas

Not saving raw HTML is the biggest gotcha. Hammering a site with no delay is the second, which gets you blocked and possibly a stern email. Not caching summary results is the third, which burns money on every re-run. Trusting the extractor on the full dataset before testing on five is the fourth, which produces a beautiful digest of nonsense. Finally, scraping anything that requires a login or contains personal data. Don't. The legal and ethical lines are the lines.

Common questions

Is it legal to scrape a public site?
It depends on the site's terms of service and your jurisdiction. Public, non-personal data with no login wall is the safe zone for personal projects. Anything behind a login or containing personal information is not what this guide covers, and you should ask before building.
Why save the raw HTML to disk?
When the extractor breaks weeks from now because the site changed, you can re-run parsing on the cached HTML without re-fetching. It's the single biggest time-saver in a scraping pipeline.
Playwright or plain fetch?
Playwright handles JavaScript-rendered pages, infinite scroll, and login flows but uses more memory. Plain fetch plus cheerio is faster and lighter but breaks on modern frontends. Start with Playwright unless you know the site renders on the server.
How do I keep the API bill low?
Cache summaries by a hash of the structured fields. Skip the API call when the hash hasn't changed. Without caching, every weekly re-run costs as much as the first run. With caching, the second run is nearly free.
How big should the extracted record be?
Five to ten fields is the sweet spot. Wider records turn into noisy summaries and brittle extractors. Pick the fields you'd put in a spreadsheet column header and stop.
How do I avoid getting blocked?
Add a two-to-five second delay between requests, set a polite user agent with a contact email, and respect robots.txt where applicable. Slow scrapers get tolerated, fast scrapers get blocked.

More to build

Build it. Ship it. Get paid.

Step-by-step lessons for every one of these inside the club. Join Claude Code Club for $9/month.

Join the club See the curriculum

Related: the library, guides, and comparisons.