LLM training data proxy for bulk web corpora
Assembling a pretrain or fine-tune corpus means pulling millions of pages from thousands of domains. The moment your crawler runs at that volume from one ASN, the blocks start: 403s, 429s, CAPTCHAs, and the silent soft-bans that quietly poison a dataset with error pages. An LLM training data proxy puts every fetch behind a real mobile IP, so the crawl reaches the end instead of flatlining.
Corpus crawls flatline on IP reputation
Anti-bot systems do not care that your intent is benign — they see request velocity from a datacenter range and throttle it. Mobile carrier exits make each fetch read as a phone user, the traffic these defenses are tuned to allow.
Why training crawls get blocked
Anti-bot systems do not care that your intent is benign. They see request velocity from a datacenter range and throttle it. A web-scale corpus crawl is the most aggressive traffic pattern a site sees, so it trips every rate limit you have. Routing through 4G/5G carrier IPs makes each request look like an ordinary mobile visitor, which is exactly the traffic these defenses are tuned to allow.
Rotate per request, sink raw to object storage
For corpus collection you want maximum IP diversity, not session stickiness. Set rotation to per-request, spray the crawl across the full pool, and stream raw HTML straight to S3, R2, or GCS. Dedupe by content hash downstream. Our side logs metadata only — no payloads, no inspection of what you collect.
Throughput that matches your worker count
Pretrain crawls run hundreds of concurrent workers. The proxy layer has to absorb that concurrency without becoming the bottleneck. Automatic backoff on 429/503, per-IP cookie jars to keep pagination coherent, and a rotation API mean you scale workers up until the target rate-limits the IP, not the gateway.
Clean data in, clean model out
A blocked request that returns a challenge page is worse than no request — it injects junk into your corpus. Reliable exits mean you collect the actual article, not a "verify you are human" interstitial, so your dedupe and quality filters have real text to work with.
Built for pretrain-scale crawling
Per-request diversity
Maximum IP spread across the pool so a million-domain frontier never funnels through one blockable exit.
Worker-matched throughput
Absorbs hundreds of concurrent crawl workers; you scale until the target throttles, not the gateway.
Raw HTML sink
Stream pages straight to S3 / R2 / GCS and dedupe by content hash downstream — no payload retention here.
Challenge-free fetches
Reliable exits return the real article, not an interstitial, so quality filters get clean text.
Corpus crawl at a glance
LLM training data proxy — questions
What is an LLM training data proxy?+
Why do corpus crawls get blocked?+
How does it scale with workers?+
Does it keep junk out of the corpus?+
Start a corpus crawl
Sizing the crawl
Planning a pretrain-scale pull? Check the pricing for per-IP rates and bulk discounts, then create an account and point your crawler at the rotation endpoint in under 90 seconds.
Pull a sample dataset, free
Run one real mobile IP for an hour with no card. Point your crawler or agent at the source, watch the data come back clean, then move to a plan.