Overview
ScrapAI’s SmartProxyMiddleware intelligently manages proxy usage to avoid blocks while minimizing costs. It automatically escalates from direct connections → datacenter proxies → residential proxies (with user approval). Key features:- Starts with direct connections (fast, free)
- Detects blocking (403/429 errors) and retries with proxies
- Learns per-domain blocking patterns
- Requires user approval for expensive residential proxies
- Reduces proxy costs by 80-90%
Proxy Types
Datacenter Proxies (Recommended)
Fast (low latency), cheap (~$1/GB), high bandwidth. Works for most sites but some block datacenter IPs.Residential Proxies
Real residential IPs, harder to block, but expensive ($3-15/GB) and slower. Use only when datacenter proxies fail. Requires--proxy-type residential flag.
Running on VPS/Cloud Servers
When residential proxies are required:- Running on cloud/VPS servers
- Strong anti-bot protection (Cloudflare, DataDome, PerimeterX)
- Datacenter proxies getting blocked
Configuration
Setup Datacenter Proxy
Add credentials to.env:
Setup Residential Proxy
Add credentials to.env:
Environment Variables
Username for datacenter proxy authentication.
Password for datacenter proxy authentication.
Datacenter proxy server hostname.Example:
proxy.provider.comDatacenter proxy server port.Example:
8080 (check your provider’s documentation for the correct port)Username for residential proxy authentication.
Password for residential proxy authentication.
Residential proxy server hostname.Example:
residential.provider.comResidential proxy server port.Example:
8080 (check your provider’s documentation for the correct port)Usage
Auto Mode (Default)
Force Datacenter
Force Residential
Statistics
Proxy usage is logged when spider closes:- Direct requests vs proxy requests
- Blocked retries
- Domains requiring proxies
Proxy Providers
Works with any HTTP proxy provider: Decodo, Bright Data, Oxylabs, IPRoyal, etc. Most providers offer both datacenter and residential proxies.Technical Details
Middleware logic:- Try direct connection first
- On 403/429 → mark domain blocked, retry with proxy
- Use proxy immediately for known-blocked domains
SmartProxyMiddleware in middlewares.py (priority 350). Works automatically for all spiders once configured in .env.
Troubleshooting
Proxy not being used:- Verify all 4 variables in
.env(USERNAME, PASSWORD, HOST, PORT) - Test:
curl -x http://user:pass@host:port https://httpbin.org/ip - Check logs for “proxy available” message
- Switch to residential proxies:
--proxy-type residential - Add delays: set
DOWNLOAD_DELAYin spider config - Reduce concurrency: set
CONCURRENT_REQUESTS
- Check blocked domains in spider stats
- Verify those domains actually need proxies
- Consider cheaper provider
- Test connection:
curl -x http://user:pass@host:port https://httpbin.org/ip - Check firewall and network connectivity
- Verify proxy credentials
Best Practices
- Start with datacenter proxies (cheaper, faster)
- Use auto mode to minimize costs
- Monitor blocked domains in statistics
- Test without proxies first
- Respect rate limits with delays and reduced concurrency