NET · OPERATIONALPOOL · 14,200 MOBILE IPsCARRIERS · 4 / 4REQ/MIN · 184,219
p99 82msorbit · llmproxy--:--:-- UTC
REQ/MIN184,219p99342 msPOOL14,200 MOBACTIVE SES8,412RETRY Q17CARRIERS4/4EGRESS4G/5G · MOBILEEXTRACT/s927EMBED/s1,204CRAWL JOBS36UPTIME 30d99.94%DROPPED0.04%REQ/MIN184,219p99342 msPOOL14,200 MOBACTIVE SES8,412RETRY Q17CARRIERS4/4EGRESS4G/5G · MOBILEEXTRACT/s927EMBED/s1,204CRAWL JOBS36UPTIME 30d99.94%DROPPED0.04%REQ/MIN184,219p99342 msPOOL14,200 MOBACTIVE SES8,412RETRY Q17CARRIERS4/4EGRESS4G/5G · MOBILEEXTRACT/s927EMBED/s1,204CRAWL JOBS36UPTIME 30d99.94%DROPPED0.04%
DATASET COLLECTION PROXY

Dataset collection proxy for structured ML records

Not every dataset is a raw corpus. Fine-tuning sets, evals, and benchmarks need targeted, structured records: products, prices, reviews, entity pages — clean fields you can label and split into rows. A dataset collection proxy routes that field-level scrape through real mobile exits, so a target serves you consistent, comparable records across the whole run instead of throttling you into a patchy table.

Pool
14,200
mobile IPs · 4 carriers
Sessions
Sticky
multi-step record paths
Output
Structured
labeled fields, not raw HTML
WHY IT MATTERS

Structured collection breaks on inconsistent targets, not on code

A labeled dataset needs every record measured the same way. When a site rotates locale, currency, or layout mid-run, your fields drift and the dataset is silently corrupted. Stable mobile sessions keep a target serving one coherent experience so each record stays comparable.

Target specific record types, not the whole web

Dataset collection is narrow by design. You already know the sites and the record types — product pages, review threads, pricing tables, entity profiles — and you want the fields off each one. Route that targeted scrape through the carrier pool so you can hit the same hosts repeatedly without one exit accumulating enough volume to get flagged and skew your sampling.

Sticky sessions hold a record together

A structured record often spans a search, a listing, and a detail page that have to stay on one IP and cookie jar to read consistently. Set sticky sessions so each multi-step extraction path completes on a single exit. That is what keeps paginated, multi-page records coherent instead of fragmenting into rows with half their fields missing.

Keep fields comparable across the run

The value of a labeled dataset is that every row is measured the same way. Per-IP cookie jars and automatic 2× backoff on 429/503 keep a target serving the same locale, currency, and layout from the first record to the last, so a field like price or rating stays comparable rather than drifting as the site swaps experiences under load.

Clean structured output, ready to label

You extract fields and write rows to your own store — Parquet, JSONL, a database, wherever your ML pipeline reads from. The exit layer logs metadata only: byte counts, timestamps, exit IP, status code. No payload inspection, no retention of the records you collect. The dataset is yours; the proxy just keeps the target answering cleanly.

WHAT YOU GET

Built for targeted, structured runs

DS/01

Targeted host coverage

Hit the same known sites repeatedly through the carrier pool without one exit skewing your sampling on per-host limits.

DS/02

Sticky record sessions

Hold a search → listing → detail path on one IP and cookie jar so multi-step records extract coherent fields.

DS/03

Consistent field context

Per-IP cookie jars keep locale, currency, and layout stable so a field stays comparable across the whole run.

DS/04

Backoff for clean rows

Auto 2× backoff on 429/503 keeps a target answering so records land complete instead of half-empty.

SPEC SHEET

Structured collection at a glance

config · dataset collection
RotationSticky sessions (default) or per-request on demand
Pool size14,200 mobile IPs across 4 PL carriers
SessionsPer-IP cookie jars for multi-step record paths
BackoffAutomatic 2× on 429 / 503
ProtocolsSOCKS5 + HTTP(S), OpenVPN, VLESS (Xray)
LoggingMetadata only — no records retained
FAQ

Dataset collection proxy — questions

What is a dataset collection proxy?+
It routes a targeted, field-level scrape through mobile exit IPs so you pull clean structured records — products, prices, reviews, entity pages — into a labeled dataset for fine-tuning, evals, or RAG, rather than a raw web dump.
How is this different from a training-data crawl?+
A training-data crawl pulls raw HTML wide for a pretrain corpus. Dataset collection is narrow and structured: you target record types on known sites and extract labeled fields, so the output is rows ready to use, not raw bytes to filter later.
Should I use sticky sessions?+
Usually. Structured records span a search, a listing, and a detail page that must stay on one IP and cookie jar to read consistently, so sticky sessions keep multi-step extraction returning coherent fields.
How do I keep records consistent?+
Per-IP cookie jars plus backoff on 429/503 keep a target serving the same locale, currency, and layout across a run, so a field like price or rating stays comparable from the first record to the last.
NEXT STEP

Start a collection run

Sizing the pool

Collecting structured records across a set of targets? Check the pricing for per-IP rates and bulk discounts, then create an account and point your extractor at the endpoint in under 90 seconds.

FREE TRIAL

Pull a sample dataset, free

Run one real mobile IP for an hour with no card. Point your crawler or agent at the source, watch the data come back clean, then move to a plan.