Smart Proxy Escalation

scrapai includes SmartProxyMiddleware that intelligently manages proxy usage to avoid blocks while minimizing costs.

How It Works

SmartProxyMiddleware uses intelligent escalation:

Direct connections first - Fast and free
Detect blocking - Automatic detection of 403/429 errors
Datacenter proxy fallback - Cheap, fast option
Learn domain patterns - Remembers which domains need proxies
Expert-in-the-loop - Asks before using expensive residential proxies

Smart cost control: Direct connections when possible, datacenter proxies for blocks, residential proxies only with explicit approval. Reduces proxy costs by 80-90%.

Setup

Add proxy credentials to .env:

# Datacenter Proxy (default - used with --proxy-type datacenter or no flag)
DATACENTER_PROXY_USERNAME=your_username
DATACENTER_PROXY_PASSWORD=your_password
DATACENTER_PROXY_HOST=dc.yourproxy.com
DATACENTER_PROXY_PORT=10000

# Residential Proxy (used with --proxy-type residential flag)
RESIDENTIAL_PROXY_USERNAME=your_username
RESIDENTIAL_PROXY_PASSWORD=your_password
RESIDENTIAL_PROXY_HOST=residential.yourproxy.com
RESIDENTIAL_PROXY_PORT=8000

Proxy Configuration:Check your proxy provider’s documentation for:

Correct hostname and port for datacenter vs residential proxies
Rotating vs sticky IP options
Authentication requirements

Since SmartProxyMiddleware uses a single proxy connection, use rotating IP ports when available for better IP distribution.

Usage

Auto Mode (Default) - Expert-in-the-Loop

# Auto mode (default) - smart escalation
./scrapai crawl spider_name --project proj --limit 10

# Explicit auto mode
./scrapai crawl spider_name --project proj --limit 10 --proxy-type auto

When datacenter proxy fails, you’ll see:

⚠️  EXPERT-IN-THE-LOOP: Datacenter proxy failed for some domains
🏠 Residential proxy is available but may incur HIGHER COSTS

Blocked domains: example.com, site.org

To proceed with residential proxy, run:
  ./scrapai crawl spider_name --project proj --proxy-type residential

Cost protection: Residential proxies require explicit user approval - no surprise costs!

Explicit Proxy Modes

# Force datacenter proxy only (even if residential configured)
./scrapai crawl spider_name --project proj --limit 10 --proxy-type datacenter

# Force residential proxy (explicit approval given)
./scrapai crawl spider_name --project proj --limit 10 --proxy-type residential

Configuration

No spider configuration needed! SmartProxyMiddleware works automatically for all spiders once configured in .env.

The middleware is enabled by default in settings.py with priority 350.

Statistics Tracking

SmartProxyMiddleware tracks detailed usage statistics:

Direct requests - Connections without proxy
Proxy requests - Connections using proxy
Blocked retries - Requests that hit 403/429 and retried with proxy
Blocked domains - Domains that consistently need proxies

Statistics logged when spider closes:

📊 Proxy Statistics for 'spider_name':
   Direct requests: 1847
   Proxy requests: 153
   Blocked & retried: 153
   Blocked domains: 2
   Domains that needed proxy: example.com, protected-site.com

Proxy Providers

SmartProxyMiddleware works with any HTTP proxy provider. Popular options include Bright Data, Oxylabs, IPRoyal, Smartproxy, and others.

Troubleshooting

Proxy Not Being Used

Check .env configuration

Verify all 4 variables are set (USERNAME, PASSWORD, HOST, PORT)

Verify credentials

Ensure proxy credentials are correct

Test proxy manually

curl -x http://user:pass@host:port https://httpbin.org/ip

Check logs

Look for “Datacenter proxy available” message on spider start

Still Getting Blocked with Proxy

Check if proxy IP is blocked

Proxy IP may already be blocked by target site

Try different provider

Switch to alternative proxy provider

Add delays

Increase DOWNLOAD_DELAY in spider config:

{"settings": {"DOWNLOAD_DELAY": 2}}

Reduce concurrency

Lower CONCURRENT_REQUESTS in spider config:

{"settings": {"CONCURRENT_REQUESTS": 8}}

Proxy Costs Too High

SmartProxyMiddleware should already minimize costs by using direct connections first.

If costs are still high:

Check which domains are marked as blocked (in stats at spider close)
Verify those domains actually need proxies
Consider if site has changed and unblocking is possible
Some sites may require proxies for all requests - this is expected

Cloudflare Bypass

Handle Cloudflare-protected sites

Checkpoint Resume

Pause and resume long crawls

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

Smart Proxy Escalation

How It Works

Setup

Usage

Auto Mode (Default) - Expert-in-the-Loop

Explicit Proxy Modes

Configuration

Statistics Tracking

Proxy Providers

Troubleshooting

Proxy Not Being Used

Still Getting Blocked with Proxy

Proxy Costs Too High

Cloudflare Bypass

Checkpoint Resume

​How It Works

​Setup

​Usage

​Auto Mode (Default) - Expert-in-the-Loop

​Explicit Proxy Modes

​Configuration

​Statistics Tracking

​Proxy Providers

​Troubleshooting

​Proxy Not Being Used

​Still Getting Blocked with Proxy

​Proxy Costs Too High

​Related Guides

Cloudflare Bypass

Checkpoint Resume

How It Works

Setup

Usage

Auto Mode (Default) - Expert-in-the-Loop

Explicit Proxy Modes

Configuration

Statistics Tracking

Proxy Providers

Troubleshooting

Proxy Not Being Used

Still Getting Blocked with Proxy

Proxy Costs Too High

Related Guides