How does it scale with my worker count?

Automatic backoff on 429/503, per-IP cookie jars for coherent pagination, and a rotation API let you scale concurrent workers up until the target rate-limits the IP, not the gateway.

Does it keep junk out of my corpus?

Yes. A blocked request that returns a challenge page poisons a dataset with error text. Reliable exits collect the actual article, not a verify-you-are-human interstitial, so dedupe and quality filters have real content to work with.

LLM TRAINING DATA PROXY

LLM training data proxy for bulk web corpora

Q: What is an LLM training data proxy?

An LLM training data proxy routes every fetch in a pretrain or fine-tune corpus crawl through a real mobile IP, so the crawl reaches the end instead of flatlining on 403s, 429s, and soft-bans from a single ASN.

Assembling a pretrain or fine-tune corpus means pulling millions of pages from thousands of domains. The moment your crawler runs at that volume from one ASN, the blocks start: 403s, 429s, CAPTCHAs, and the silent soft-bans that quietly poison a dataset with error pages. An LLM training data proxy puts every fetch behind a real mobile IP, so the crawl reaches the end instead of flatlining.

Start a corpus crawl See pricing

Scale

Millions

pages per corpus crawl

Exit type

PL mobile

real 4G/5G · 4 carriers

Backoff

Auto 2×

on 429 / 503 walls

WHY IT MATTERS

Corpus crawls flatline on IP reputation

Anti-bot systems do not care that your intent is benign — they see request velocity from a datacenter range and throttle it. Mobile carrier exits make each fetch read as a phone user, the traffic these defenses are tuned to allow.

Why training crawls get blocked

Anti-bot systems do not care that your intent is benign. They see request velocity from a datacenter range and throttle it. A web-scale corpus crawl is the most aggressive traffic pattern a site sees, so it trips every rate limit you have. Routing through 4G/5G carrier IPs makes each request look like an ordinary mobile visitor, which is exactly the traffic these defenses are tuned to allow.

Rotate per request, sink raw to object storage

For corpus collection you want maximum IP diversity, not session stickiness. Set rotation to per-request, spray the crawl across the full pool, and stream raw HTML straight to S3, R2, or GCS. Dedupe by content hash downstream. Our side logs metadata only — no payloads, no inspection of what you collect.

Throughput that matches your worker count

Pretrain crawls run hundreds of concurrent workers. The proxy layer has to absorb that concurrency without becoming the bottleneck. Automatic backoff on 429/503, per-IP cookie jars to keep pagination coherent, and a rotation API mean you scale workers up until the target rate-limits the IP, not the gateway.

Clean data in, clean model out

A blocked request that returns a challenge page is worse than no request — it injects junk into your corpus. Reliable exits mean you collect the actual article, not a "verify you are human" interstitial, so your dedupe and quality filters have real text to work with.

WHAT YOU GET

Built for pretrain-scale crawling

TD/01

Per-request diversity

Maximum IP spread across the pool so a million-domain frontier never funnels through one blockable exit.

TD/02

Worker-matched throughput

Absorbs hundreds of concurrent crawl workers; you scale until the target throttles, not the gateway.

TD/03

Raw HTML sink

Stream pages straight to S3 / R2 / GCS and dedupe by content hash downstream — no payload retention here.

TD/04

Challenge-free fetches

Reliable exits return the real article, not an interstitial, so quality filters get clean text.

SPEC SHEET

Corpus crawl at a glance

config · llm training data

Rotation	Per-request for maximum IP diversity
Concurrency	Hundreds of workers, gateway never the bottleneck
Backoff	Automatic 2× on 429 / 503
Storage	Raw HTML stream to S3 / R2 / GCS
Exit type	Real PL 4G/5G mobile, 4 carriers
Logging	Metadata only — no corpus inspection

FAQ

LLM training data proxy — questions

What is an LLM training data proxy?+

It routes every fetch in a pretrain or fine-tune corpus crawl through a real mobile IP, so the crawl reaches the end instead of flatlining on 403s, 429s, and soft-bans from one ASN.

Why do corpus crawls get blocked?+

A web-scale crawl is the most aggressive traffic a site sees, so it trips every rate limit from a datacenter range. Carrier IPs make each request read as an ordinary mobile visitor defenses allow.

How does it scale with workers?+

Automatic backoff on 429/503, per-IP cookie jars, and a rotation API let you scale concurrent workers until the target rate-limits the IP, not the gateway.

Does it keep junk out of the corpus?+

Yes. Reliable exits return the real article rather than a challenge page, so your dedupe and quality filters work on genuine content instead of error interstitials.

NEXT STEP

Start a corpus crawl

Sizing the crawl

Planning a pretrain-scale pull? Check the pricing for per-IP rates and bulk discounts, then create an account and point your crawler at the rotation endpoint in under 90 seconds.

Start a corpus crawl

→Structured dataset collection proxy →RAG pipeline proxy →Proxy for AI agents →Building a scraping stack for LLM training data

FREE TRIAL

Pull a sample dataset, free

Run one real mobile IP for an hour with no card. Point your crawler or agent at the source, watch the data come back clean, then move to a plan.

Start 1-hour free trial See pricing