NET · OPERATIONALPOOL · 14,200 MOBILE IPsCARRIERS · 4 / 4REQ/MIN · 184,219
p99 82msorbit · llmproxy--:--:-- UTC
REQ/MIN184,219p99342 msPOOL14,200 MOBACTIVE SES8,412RETRY Q17CARRIERS4/4EGRESS4G/5G · MOBILEEXTRACT/s927EMBED/s1,204CRAWL JOBS36UPTIME 30d99.94%DROPPED0.04%REQ/MIN184,219p99342 msPOOL14,200 MOBACTIVE SES8,412RETRY Q17CARRIERS4/4EGRESS4G/5G · MOBILEEXTRACT/s927EMBED/s1,204CRAWL JOBS36UPTIME 30d99.94%DROPPED0.04%REQ/MIN184,219p99342 msPOOL14,200 MOBACTIVE SES8,412RETRY Q17CARRIERS4/4EGRESS4G/5G · MOBILEEXTRACT/s927EMBED/s1,204CRAWL JOBS36UPTIME 30d99.94%DROPPED0.04%
LLM TRAINING DATA PROXY

LLM training data proxy for bulk web corpora

Assembling a pretrain or fine-tune corpus means pulling millions of pages from thousands of domains. The moment your crawler runs at that volume from one ASN, the blocks start: 403s, 429s, CAPTCHAs, and the silent soft-bans that quietly poison a dataset with error pages. An LLM training data proxy puts every fetch behind a real mobile IP, so the crawl reaches the end instead of flatlining.

Scale
Millions
pages per corpus crawl
Exit type
PL mobile
real 4G/5G · 4 carriers
Backoff
Auto 2×
on 429 / 503 walls
WHY IT MATTERS

Corpus crawls flatline on IP reputation

Anti-bot systems do not care that your intent is benign — they see request velocity from a datacenter range and throttle it. Mobile carrier exits make each fetch read as a phone user, the traffic these defenses are tuned to allow.

Why training crawls get blocked

Anti-bot systems do not care that your intent is benign. They see request velocity from a datacenter range and throttle it. A web-scale corpus crawl is the most aggressive traffic pattern a site sees, so it trips every rate limit you have. Routing through 4G/5G carrier IPs makes each request look like an ordinary mobile visitor, which is exactly the traffic these defenses are tuned to allow.

Rotate per request, sink raw to object storage

For corpus collection you want maximum IP diversity, not session stickiness. Set rotation to per-request, spray the crawl across the full pool, and stream raw HTML straight to S3, R2, or GCS. Dedupe by content hash downstream. Our side logs metadata only — no payloads, no inspection of what you collect.

Throughput that matches your worker count

Pretrain crawls run hundreds of concurrent workers. The proxy layer has to absorb that concurrency without becoming the bottleneck. Automatic backoff on 429/503, per-IP cookie jars to keep pagination coherent, and a rotation API mean you scale workers up until the target rate-limits the IP, not the gateway.

Clean data in, clean model out

A blocked request that returns a challenge page is worse than no request — it injects junk into your corpus. Reliable exits mean you collect the actual article, not a "verify you are human" interstitial, so your dedupe and quality filters have real text to work with.

WHAT YOU GET

Built for pretrain-scale crawling

TD/01

Per-request diversity

Maximum IP spread across the pool so a million-domain frontier never funnels through one blockable exit.

TD/02

Worker-matched throughput

Absorbs hundreds of concurrent crawl workers; you scale until the target throttles, not the gateway.

TD/03

Raw HTML sink

Stream pages straight to S3 / R2 / GCS and dedupe by content hash downstream — no payload retention here.

TD/04

Challenge-free fetches

Reliable exits return the real article, not an interstitial, so quality filters get clean text.

SPEC SHEET

Corpus crawl at a glance

config · llm training data
RotationPer-request for maximum IP diversity
ConcurrencyHundreds of workers, gateway never the bottleneck
BackoffAutomatic 2× on 429 / 503
StorageRaw HTML stream to S3 / R2 / GCS
Exit typeReal PL 4G/5G mobile, 4 carriers
LoggingMetadata only — no corpus inspection
FAQ

LLM training data proxy — questions

What is an LLM training data proxy?+
It routes every fetch in a pretrain or fine-tune corpus crawl through a real mobile IP, so the crawl reaches the end instead of flatlining on 403s, 429s, and soft-bans from one ASN.
Why do corpus crawls get blocked?+
A web-scale crawl is the most aggressive traffic a site sees, so it trips every rate limit from a datacenter range. Carrier IPs make each request read as an ordinary mobile visitor defenses allow.
How does it scale with workers?+
Automatic backoff on 429/503, per-IP cookie jars, and a rotation API let you scale concurrent workers until the target rate-limits the IP, not the gateway.
Does it keep junk out of the corpus?+
Yes. Reliable exits return the real article rather than a challenge page, so your dedupe and quality filters work on genuine content instead of error interstitials.
NEXT STEP

Start a corpus crawl

Sizing the crawl

Planning a pretrain-scale pull? Check the pricing for per-IP rates and bulk discounts, then create an account and point your crawler at the rotation endpoint in under 90 seconds.

FREE TRIAL

Pull a sample dataset, free

Run one real mobile IP for an hour with no card. Point your crawler or agent at the source, watch the data come back clean, then move to a plan.