Overview
ScrapAI features SmartProxyMiddleware that intelligently manages proxy usage to avoid blocks while minimizing costs. The middleware automatically detects when proxies are needed and escalates from direct connections to datacenter proxies to residential proxies.How It Works
Auto Mode Strategy (Default)
- Start with direct connections (fast, free)
- Detect blocking (403/429 errors)
- Automatically retry with datacenter proxy (cheap, fast)
- Learn which domains need proxies
- Use proxy proactively for known-blocked domains
- Expert-in-the-loop if datacenter fails → ask user before using expensive residential
Smart Cost Control
- Direct connections are faster and free
- Proxies only used when necessary
- Datacenter proxies preferred (cheaper)
- Residential proxies require explicit user approval
- Learns per-domain blocking patterns
- Reduces proxy bandwidth costs by 80-90%
- No surprise costs - expensive proxies need human approval
Proxy Types
Datacenter Proxies (Recommended)
Best for:
- Most websites
- High-speed scraping
- Cost-effective scaling
- General use cases
- Fast (low latency)
- Cheap (< $1/GB typical)
- High bandwidth
- Reliable
- Some sites block datacenter IPs
- Easier to detect
Residential Proxies
Advantages:- Real residential IPs
- Harder to block
- Better for strict sites
- Expensive
- Slower (higher latency)
- Lower bandwidth
⚠️ Running on VPS/Cloud Servers
Why this happens:- Datacenter IPs are flagged in threat intelligence databases
- Many bots, scrapers, and malicious traffic originate from cloud providers
- Anti-bot services preemptively block entire datacenter IP ranges
- Your server IP may be on blocklists even if YOU never abused it
- Running ScrapAI on any cloud/VPS server
- Scraping sites with strong anti-bot protection
- Getting blocked despite using datacenter proxies
- Target site explicitly blocks datacenter IPs
Configuration
Setup Datacenter Proxy
Add credentials to.env:
Setup Residential Proxy
Add credentials to.env:
Environment Variables
Username for datacenter proxy authentication.
Password for datacenter proxy authentication.
Datacenter proxy server hostname.Example:
proxy.provider.comDatacenter proxy server port.Example:
8080 (check your provider’s documentation for the correct port)Username for residential proxy authentication.
Password for residential proxy authentication.
Residential proxy server hostname.Example:
residential.provider.comResidential proxy server port.Example:
8080 (check your provider’s documentation for the correct port)Example Configuration
Datacenter Proxies
Residential Proxies
Usage
Auto Mode (Default)
Smart escalation with cost control:- ✅ Start with direct connections (fast, free)
- ✅ On block (403/429) → Try datacenter proxy (cheap, fast)
- ⚠️ Datacenter fails → Expert-in-the-loop prompt:
- 👤 User decides whether to use expensive residential proxies
Cost protection: Residential proxies require explicit user approval - no surprise costs!
Datacenter Only
Force datacenter proxy only (even if residential configured):Residential Only
Force residential proxy (explicit approval given):- ✅ Start with direct connections (fast, free)
- ✅ Only use proxy when blocked (403/429 errors)
- ✅ Learn which domains need proxies
- ✅ Use proxy proactively for blocked domains
--proxy-type flag controls escalation behavior and cost limits.
Statistics Tracking
SmartProxyMiddleware tracks proxy usage:- Direct requests - Connections without proxy
- Proxy requests - Connections using proxy
- Blocked retries - Requests that hit 403/429 and retried with proxy
- Blocked domains - Domains that consistently need proxies
Proxy Providers
SmartProxyMiddleware works with any HTTP proxy provider. Use whatever proxy service you prefer. Popular options include:Datacenter Proxies
Residential Proxies
- Use with
--proxy-type residentialflag on crawl command - Same smart strategy (direct first, proxy only when blocked)
- Most providers offer both datacenter and residential proxies - configure RESIDENTIAL_PROXY_* vars in .env
Technical Details
Middleware Logic
- On first request to domain → try direct connection
- If response is 403/429 → mark domain as blocked, retry with proxy
- On subsequent requests to blocked domain → use proxy immediately
- Blocked domains remembered for spider lifetime
Proxy URL Format
Implementation
- Location:
middlewares.py - Class:
SmartProxyMiddleware - Priority: 350 (in
settings.py) - Type: Scrapy downloader middleware
No Spider Configuration Needed
SmartProxyMiddleware works automatically for all spiders once configured in.env. The middleware is enabled by default in settings.py with priority 350.
Troubleshooting
Proxy not being used
- Check
.envhas all 4 variables set (USERNAME, PASSWORD, HOST, PORT) - Verify proxy credentials are correct
- Test proxy manually:
- Check logs for “Datacenter proxy available” message on spider start
Still getting blocked with proxy
- Check if proxy IP is already blocked by target site
- Try different proxy provider
- Add delays between requests (set
DOWNLOAD_DELAYin spider config) - Reduce concurrency (set
CONCURRENT_REQUESTSin spider config) - Switch to residential proxies:
Proxy costs too high
SmartProxyMiddleware should already minimize costs by using direct connections first. If costs are still high:- Check which domains are marked as blocked (in stats at spider close)
- Verify those domains actually need proxies
- Consider if site has changed and unblocking is possible
- Some sites may require proxies for all requests - this is expected
- Consider switching to cheaper proxy provider
Authentication failed
Connection timeout
- Check proxy server is reachable:
- Verify firewall allows outbound connections
- Try different proxy port
- Contact proxy provider support
Best Practices
-
Start with datacenter proxies
- Cheaper and faster
- Works for most sites
-
Use auto mode
- Minimizes costs automatically
- Expert-in-the-loop prevents surprise charges
-
Monitor statistics
- Review blocked domains in logs
- Adjust strategy based on patterns
-
Test without proxies first
- Many sites don’t require proxies
- Save costs where possible
-
Respect rate limits
- Add delays between requests
- Reduce concurrency if needed
- Proxies don’t make aggressive scraping acceptable
When to Use Proxies
Recommend proxy setup when:- User asks about proxies or rate limiting
- Spider is getting blocked (403/429 errors in logs)
- User needs to scrape at scale (1000s of pages)
- User mentions proxy provider (Bright Data, Oxylabs, Smartproxy, etc.)
- Crawls are failing with “Access Denied” or “Too Many Requests”