Skip to main content

Overview

ScrapAI’s SmartProxyMiddleware intelligently manages proxy usage to avoid blocks while minimizing costs. It automatically escalates from direct connections → datacenter proxies → residential proxies (with user approval). Key features:
  • Starts with direct connections (fast, free)
  • Detects blocking (403/429 errors) and retries with proxies
  • Learns per-domain blocking patterns
  • Requires user approval for expensive residential proxies
  • Reduces proxy costs by 80-90%

Proxy Types

Fast (low latency), cheap (~$1/GB), high bandwidth. Works for most sites but some block datacenter IPs.

Residential Proxies

Real residential IPs, harder to block, but expensive ($3-15/GB) and slower. Use only when datacenter proxies fail. Requires --proxy-type residential flag.

Running on VPS/Cloud Servers

Cloud servers (AWS, DigitalOcean, Azure, etc.) often have poor IP reputation. Anti-bot services may block datacenter IPs immediately, causing 403/429 errors or unresolvable Cloudflare challenges.Solution: Use --proxy-type residential flag. Residential IPs bypass IP-based blocking.
When residential proxies are required:
  • Running on cloud/VPS servers
  • Strong anti-bot protection (Cloudflare, DataDome, PerimeterX)
  • Datacenter proxies getting blocked
Local development (home ISP) usually works with datacenter proxies. VPS/cloud servers often need residential.

Configuration

Setup Datacenter Proxy

Add credentials to .env:
# Datacenter Proxy (default - used automatically)
DATACENTER_PROXY_USERNAME=your_username
DATACENTER_PROXY_PASSWORD=your_password
DATACENTER_PROXY_HOST=proxy.provider.com
DATACENTER_PROXY_PORT=8080  # Check your provider's documentation

Setup Residential Proxy

Add credentials to .env:
# Residential Proxy (used with --proxy-type residential flag)
RESIDENTIAL_PROXY_USERNAME=your_username
RESIDENTIAL_PROXY_PASSWORD=your_password
RESIDENTIAL_PROXY_HOST=residential.provider.com
RESIDENTIAL_PROXY_PORT=8080  # Check your provider's documentation

Environment Variables

DATACENTER_PROXY_USERNAME
string
Username for datacenter proxy authentication.
DATACENTER_PROXY_PASSWORD
string
Password for datacenter proxy authentication.
DATACENTER_PROXY_HOST
string
Datacenter proxy server hostname.Example: proxy.provider.com
DATACENTER_PROXY_PORT
number
Datacenter proxy server port.Example: 8080 (check your provider’s documentation for the correct port)
RESIDENTIAL_PROXY_USERNAME
string
Username for residential proxy authentication.
RESIDENTIAL_PROXY_PASSWORD
string
Password for residential proxy authentication.
RESIDENTIAL_PROXY_HOST
string
Residential proxy server hostname.Example: residential.provider.com
RESIDENTIAL_PROXY_PORT
number
Residential proxy server port.Example: 8080 (check your provider’s documentation for the correct port)

Usage

Auto Mode (Default)

./scrapai crawl spider_name --project proj --limit 10
Starts direct → escalates to datacenter on block → prompts for residential if needed.

Force Datacenter

./scrapai crawl spider_name --project proj --proxy-type datacenter

Force Residential

./scrapai crawl spider_name --project proj --proxy-type residential
All modes try direct connections first and only use proxies when needed (403/429 errors).

Statistics

Proxy usage is logged when spider closes:
  • Direct requests vs proxy requests
  • Blocked retries
  • Domains requiring proxies

Proxy Providers

Works with any HTTP proxy provider: Decodo, Bright Data, Oxylabs, IPRoyal, etc. Most providers offer both datacenter and residential proxies.

Technical Details

Middleware logic:
  1. Try direct connection first
  2. On 403/429 → mark domain blocked, retry with proxy
  3. Use proxy immediately for known-blocked domains
Implementation: SmartProxyMiddleware in middlewares.py (priority 350). Works automatically for all spiders once configured in .env.

Troubleshooting

Proxy not being used:
  • Verify all 4 variables in .env (USERNAME, PASSWORD, HOST, PORT)
  • Test: curl -x http://user:pass@host:port https://httpbin.org/ip
  • Check logs for “proxy available” message
Still getting blocked:
  • Switch to residential proxies: --proxy-type residential
  • Add delays: set DOWNLOAD_DELAY in spider config
  • Reduce concurrency: set CONCURRENT_REQUESTS
High costs:
  • Check blocked domains in spider stats
  • Verify those domains actually need proxies
  • Consider cheaper provider
Connection issues:
  • Test connection: curl -x http://user:pass@host:port https://httpbin.org/ip
  • Check firewall and network connectivity
  • Verify proxy credentials

Best Practices

  • Start with datacenter proxies (cheaper, faster)
  • Use auto mode to minimize costs
  • Monitor blocked domains in statistics
  • Test without proxies first
  • Respect rate limits with delays and reduced concurrency