Guide

Building a scraping stack for LLM training data (proxy architecture)

Updated 2026 — written for teams collecting public web data to train and fine-tune models.

Everyone wants more training data and nobody talks about the plumbing. A clean corpus doesn’t appear; you crawl it, one stubborn domain at a time, against rate limits and anti-bot walls built specifically to stop traffic like yours. Here’s the architecture we keep seeing work in production — the layers, where they break, and why the proxy tier is the part most teams underbuild.

1. The five layers of a corpus crawler

A training-data scraping stack is five layers: a URL frontier that decides what to fetch next, a fetch layer that issues requests, a proxy tier that carries those requests to the open web, a storage sink that captures raw responses, and a processing stage that dedupes, filters, and tokenizes. Most failures that look like "data quality" problems are actually fetch and proxy failures leaking error pages into the sink.

2. The URL frontier

Start with seed domains and expand via sitemaps and in-page links, but cap depth and per-host budgets so one large site does not swallow the whole crawl. Keep the frontier in something durable — Redis, a queue, or a database — so a crash does not lose your place across a multi-day run. Politeness lives here too: per-host delay, robots handling, and a global concurrency ceiling.

3. The fetch layer

For static HTML, a fast async client (httpx, aiohttp, or Go) handles the bulk of a corpus cheaply. Reserve a headless browser (Playwright) for the minority of pages that need JavaScript rendering, because browsers cost an order of magnitude more per page. Set sane timeouts, retry transient failures with backoff, and treat a 429 as a signal to slow down on that host — not to retry harder on the same IP.

4. The proxy tier — where crawls actually die

This is the layer teams underbuild and then wonder why their crawl flatlines at 30%. At web scale you are issuing more requests per minute to a host than any human ever would, from a narrow IP range. Datacenter ranges get throttled, then blocked, then null-routed. Routing through real 4G/5G mobile exits flips the math: each request looks like an ordinary carrier visitor, and thousands of real users share each IP block, so your crawler hides inside legitimate traffic.

For corpus collection, rotate per request to maximize IP diversity — you want breadth, not session continuity. Lean on automatic backoff on 429/503 and per-IP cookie jars so paginated targets stay coherent. The rotation API lets the fetch layer ask for a fresh exit whenever a host starts pushing back, instead of stubbornly retrying into a wall.

5. The storage sink

Write raw responses straight to object storage — S3, R2, or GCS — keyed by content hash. Store the raw bytes first and parse later; never throw away the original, because your extraction rules will change and you will want to re-run them without re-crawling. Capture response headers and the final URL alongside the body so you can audit provenance and filter by source later.

6. Dedupe and quality filtering

Web crawls are mostly duplicates. Exact-dedupe by content hash first, then near-dedupe with MinHash or SimHash to collapse boilerplate-heavy near-copies. Strip navigation and chrome, drop pages below a length or language threshold, and filter out the challenge pages that slipped through. This is where a reliable proxy tier pays off twice: fewer error pages reach the sink, so your filters spend their budget on real text.

7. Observability and politeness

Track block rate, bytes collected, dedupe ratio, and per-host success as first-class metrics. A rising block rate on one host means back off or rotate harder there, not everywhere. Respect robots and reasonable per-host limits — a crawler that behaves gets blocked less, which is both an ethics win and a throughput win.

8. Why the proxy choice compounds

Every other layer assumes the fetch succeeded. If the proxy tier returns CAPTCHAs and soft-bans, your frontier wastes budget, your sink fills with garbage, and your dedupe stage cannot tell a real article from a challenge page. Getting the exit layer right is what makes the rest of the stack honest. Mobile exits with per-request rotation, backoff, and metadata-only logging are the foundation a clean corpus is built on.

Related