Skip to main content
ScrapAI includes SmartProxyMiddleware that intelligently manages proxy usage to avoid blocks while minimizing costs.

How It Works

SmartProxyMiddleware uses intelligent escalation:
  1. Direct connections first - Fast and free
  2. Detect blocking - Automatic detection of 403/429 errors
  3. Datacenter proxy fallback - Cheap, fast option
  4. Learn domain patterns - Remembers which domains need proxies
  5. Expert-in-the-loop - Asks before using expensive residential proxies
Smart cost control: Direct connections when possible, datacenter proxies for blocks, residential proxies only with explicit approval. Reduces proxy costs by 80-90%.

Setup

Add proxy credentials to .env:
# Datacenter Proxy (default - used with --proxy-type datacenter or no flag)
DATACENTER_PROXY_USERNAME=your_username
DATACENTER_PROXY_PASSWORD=your_password
DATACENTER_PROXY_HOST=dc.yourproxy.com
DATACENTER_PROXY_PORT=10000
Proxy Configuration:Check your proxy provider’s documentation for:
  • Correct hostname and port for datacenter vs residential proxies
  • Rotating vs sticky IP options
  • Authentication requirements
Since SmartProxyMiddleware uses a single proxy connection, use rotating IP ports when available for better IP distribution.

Usage

Auto Mode (Default) - Expert-in-the-Loop

# Auto mode (default) - smart escalation
./scrapai crawl spider_name --project proj --limit 10

# Explicit auto mode
./scrapai crawl spider_name --project proj --limit 10 --proxy-type auto
When datacenter proxy fails, you’ll see:
⚠️  EXPERT-IN-THE-LOOP: Datacenter proxy failed for some domains
🏠 Residential proxy is available but may incur HIGHER COSTS

Blocked domains: example.com, site.org

To proceed with residential proxy, run:
  ./scrapai crawl spider_name --project proj --proxy-type residential
Cost protection: Residential proxies require explicit user approval - no surprise costs!

Explicit Proxy Modes

# Force datacenter proxy only (even if residential configured)
./scrapai crawl spider_name --project proj --limit 10 --proxy-type datacenter

Configuration

No spider configuration needed! SmartProxyMiddleware works automatically for all spiders once configured in .env.
The middleware is enabled by default in settings.py with priority 350.

Statistics Tracking

SmartProxyMiddleware tracks detailed usage statistics:
  • Direct requests - Connections without proxy
  • Proxy requests - Connections using proxy
  • Blocked retries - Requests that hit 403/429 and retried with proxy
  • Blocked domains - Domains that consistently need proxies
Statistics logged when spider closes:
📊 Proxy Statistics for 'spider_name':
   Direct requests: 1847
   Proxy requests: 153
   Blocked & retried: 153
   Blocked domains: 2
   Domains that needed proxy: example.com, protected-site.com

Proxy Providers

SmartProxyMiddleware works with any HTTP proxy provider. Popular options include Bright Data, Oxylabs, IPRoyal, Smartproxy, and others.

Troubleshooting

Proxy Not Being Used

1

Check .env configuration

Verify all 4 variables are set (USERNAME, PASSWORD, HOST, PORT)
2

Verify credentials

Ensure proxy credentials are correct
3

Test proxy manually

curl -x http://user:pass@host:port https://httpbin.org/ip
4

Check logs

Look for “Datacenter proxy available” message on spider start

Still Getting Blocked with Proxy

1

Check if proxy IP is blocked

Proxy IP may already be blocked by target site
2

Try different provider

Switch to alternative proxy provider
3

Add delays

Increase DOWNLOAD_DELAY in spider config:
{"settings": {"DOWNLOAD_DELAY": 2}}
4

Reduce concurrency

Lower CONCURRENT_REQUESTS in spider config:
{"settings": {"CONCURRENT_REQUESTS": 8}}

Proxy Costs Too High

SmartProxyMiddleware should already minimize costs by using direct connections first.
If costs are still high:
  1. Check which domains are marked as blocked (in stats at spider close)
  2. Verify those domains actually need proxies
  3. Consider if site has changed and unblocking is possible
  4. Some sites may require proxies for all requests - this is expected

Cloudflare Bypass

Handle Cloudflare-protected sites

Checkpoint Resume

Pause and resume long crawls