Skip to main content

Overview

ScrapAI features SmartProxyMiddleware that intelligently manages proxy usage to avoid blocks while minimizing costs. The middleware automatically detects when proxies are needed and escalates from direct connections to datacenter proxies to residential proxies.

How It Works

Auto Mode Strategy (Default)

  1. Start with direct connections (fast, free)
  2. Detect blocking (403/429 errors)
  3. Automatically retry with datacenter proxy (cheap, fast)
  4. Learn which domains need proxies
  5. Use proxy proactively for known-blocked domains
  6. Expert-in-the-loop if datacenter fails → ask user before using expensive residential

Smart Cost Control

  • Direct connections are faster and free
  • Proxies only used when necessary
  • Datacenter proxies preferred (cheaper)
  • Residential proxies require explicit user approval
  • Learns per-domain blocking patterns
  • Reduces proxy bandwidth costs by 80-90%
  • No surprise costs - expensive proxies need human approval

Proxy Types

Best for:
  • Most websites
  • High-speed scraping
  • Cost-effective scaling
  • General use cases
Advantages:
  • Fast (low latency)
  • Cheap (< $1/GB typical)
  • High bandwidth
  • Reliable
Limitations:
  • Some sites block datacenter IPs
  • Easier to detect

Residential Proxies

Use only when needed:
  • Sites that block datacenter IPs
  • Requires explicit --proxy-type residential flag
  • Higher cost ($3-15/GB typical)
Advantages:
  • Real residential IPs
  • Harder to block
  • Better for strict sites
Limitations:
  • Expensive
  • Slower (higher latency)
  • Lower bandwidth

⚠️ Running on VPS/Cloud Servers

Remote servers (AWS, DigitalOcean, Hetzner, etc.) often have poor IP reputation!Anti-bot services like Cloudflare, DataDome, and PerimeterX track datacenter IP ranges and may block requests from:
  • AWS EC2, Lambda
  • DigitalOcean Droplets
  • Google Cloud Compute
  • Azure VMs
  • Hetzner, OVH, Vultr
  • Any VPS/cloud provider
Symptoms:
  • Getting blocked even without high request rates
  • 403/429 errors immediately on first request
  • Cloudflare challenges that never resolve
  • “Access Denied” messages
Solution: Use residential proxies with --proxy-type residential flag. Residential IPs have clean reputation and bypass IP-based blocking.
Why this happens:
  • Datacenter IPs are flagged in threat intelligence databases
  • Many bots, scrapers, and malicious traffic originate from cloud providers
  • Anti-bot services preemptively block entire datacenter IP ranges
  • Your server IP may be on blocklists even if YOU never abused it
When you MUST use residential proxies:
  • Running ScrapAI on any cloud/VPS server
  • Scraping sites with strong anti-bot protection
  • Getting blocked despite using datacenter proxies
  • Target site explicitly blocks datacenter IPs
Local development vs Production:
  • Local (home ISP): Datacenter proxies usually work fine
  • VPS/Cloud servers: Residential proxies often required
  • Test locally first, then switch to residential when deploying to cloud

Configuration

Setup Datacenter Proxy

Add credentials to .env:
# Datacenter Proxy (default - used automatically)
DATACENTER_PROXY_USERNAME=your_username
DATACENTER_PROXY_PASSWORD=your_password
DATACENTER_PROXY_HOST=proxy.provider.com
DATACENTER_PROXY_PORT=8080  # Check your provider's documentation

Setup Residential Proxy

Add credentials to .env:
# Residential Proxy (used with --proxy-type residential flag)
RESIDENTIAL_PROXY_USERNAME=your_username
RESIDENTIAL_PROXY_PASSWORD=your_password
RESIDENTIAL_PROXY_HOST=residential.provider.com
RESIDENTIAL_PROXY_PORT=8080  # Check your provider's documentation

Environment Variables

DATACENTER_PROXY_USERNAME
string
Username for datacenter proxy authentication.
DATACENTER_PROXY_PASSWORD
string
Password for datacenter proxy authentication.
DATACENTER_PROXY_HOST
string
Datacenter proxy server hostname.Example: proxy.provider.com
DATACENTER_PROXY_PORT
number
Datacenter proxy server port.Example: 8080 (check your provider’s documentation for the correct port)
RESIDENTIAL_PROXY_USERNAME
string
Username for residential proxy authentication.
RESIDENTIAL_PROXY_PASSWORD
string
Password for residential proxy authentication.
RESIDENTIAL_PROXY_HOST
string
Residential proxy server hostname.Example: residential.provider.com
RESIDENTIAL_PROXY_PORT
number
Residential proxy server port.Example: 8080 (check your provider’s documentation for the correct port)

Example Configuration

Datacenter Proxies

# Get credentials from your proxy provider
DATACENTER_PROXY_USERNAME=your_username
DATACENTER_PROXY_PASSWORD=your_password
DATACENTER_PROXY_HOST=proxy.provider.com
DATACENTER_PROXY_PORT=8080  # Check your provider's documentation

Residential Proxies

# Get credentials from your proxy provider
RESIDENTIAL_PROXY_USERNAME=your_username
RESIDENTIAL_PROXY_PASSWORD=your_password
RESIDENTIAL_PROXY_HOST=residential.provider.com
RESIDENTIAL_PROXY_PORT=8080  # Check your provider's documentation
Check your proxy provider’s documentation for the correct host, port, and whether they offer rotating vs sticky IPs. ScrapAI’s SmartProxyMiddleware works with any standard HTTP proxy.

Usage

Auto Mode (Default)

Smart escalation with cost control:
# Auto mode (default) - smart escalation
./scrapai crawl spider_name --project proj --limit 10

# Explicit auto mode
./scrapai crawl spider_name --project proj --limit 10 --proxy-type auto
How auto mode works:
  1. ✅ Start with direct connections (fast, free)
  2. ✅ On block (403/429) → Try datacenter proxy (cheap, fast)
  3. ⚠️ Datacenter failsExpert-in-the-loop prompt:
    ⚠️  EXPERT-IN-THE-LOOP: Datacenter proxy failed for some domains
    🏠 Residential proxy is available but may incur HIGHER COSTS
    
    Blocked domains: example.com, site.org
    
    To proceed with residential proxy, run:
      ./scrapai crawl spider_name --project proj --proxy-type residential
    
  4. 👤 User decides whether to use expensive residential proxies
Cost protection: Residential proxies require explicit user approval - no surprise costs!

Datacenter Only

Force datacenter proxy only (even if residential configured):
./scrapai crawl spider_name --project proj --limit 10 --proxy-type datacenter

Residential Only

Force residential proxy (explicit approval given):
./scrapai crawl spider_name --project proj --limit 10 --proxy-type residential
All modes follow smart strategy:
  • ✅ Start with direct connections (fast, free)
  • ✅ Only use proxy when blocked (403/429 errors)
  • ✅ Learn which domains need proxies
  • ✅ Use proxy proactively for blocked domains
The --proxy-type flag controls escalation behavior and cost limits.

Statistics Tracking

SmartProxyMiddleware tracks proxy usage:
  • Direct requests - Connections without proxy
  • Proxy requests - Connections using proxy
  • Blocked retries - Requests that hit 403/429 and retried with proxy
  • Blocked domains - Domains that consistently need proxies
Statistics are logged when spider closes:
📊 Proxy Statistics for 'spider_name':
   Direct requests: 1847
   Proxy requests: 153
   Blocked & retried: 153
   Blocked domains: 2
   Domains that needed proxy: example.com, protected-site.com

Proxy Providers

SmartProxyMiddleware works with any HTTP proxy provider. Use whatever proxy service you prefer. Popular options include:

Datacenter Proxies

Residential Proxies

  • Use with --proxy-type residential flag on crawl command
  • Same smart strategy (direct first, proxy only when blocked)
  • Most providers offer both datacenter and residential proxies - configure RESIDENTIAL_PROXY_* vars in .env

Technical Details

Middleware Logic

  1. On first request to domain → try direct connection
  2. If response is 403/429 → mark domain as blocked, retry with proxy
  3. On subsequent requests to blocked domain → use proxy immediately
  4. Blocked domains remembered for spider lifetime

Proxy URL Format

http://username:password@proxy.example.com:8080

Implementation

  • Location: middlewares.py
  • Class: SmartProxyMiddleware
  • Priority: 350 (in settings.py)
  • Type: Scrapy downloader middleware

No Spider Configuration Needed

SmartProxyMiddleware works automatically for all spiders once configured in .env. The middleware is enabled by default in settings.py with priority 350.

Troubleshooting

Proxy not being used

  1. Check .env has all 4 variables set (USERNAME, PASSWORD, HOST, PORT)
  2. Verify proxy credentials are correct
  3. Test proxy manually:
    curl -x http://user:pass@host:port https://httpbin.org/ip
    
  4. Check logs for “Datacenter proxy available” message on spider start

Still getting blocked with proxy

  1. Check if proxy IP is already blocked by target site
  2. Try different proxy provider
  3. Add delays between requests (set DOWNLOAD_DELAY in spider config)
  4. Reduce concurrency (set CONCURRENT_REQUESTS in spider config)
  5. Switch to residential proxies:
    ./scrapai crawl spider_name --project proj --proxy-type residential
    

Proxy costs too high

SmartProxyMiddleware should already minimize costs by using direct connections first. If costs are still high:
  1. Check which domains are marked as blocked (in stats at spider close)
  2. Verify those domains actually need proxies
  3. Consider if site has changed and unblocking is possible
  4. Some sites may require proxies for all requests - this is expected
  5. Consider switching to cheaper proxy provider

Authentication failed

# Verify credentials
echo $DATACENTER_PROXY_USERNAME
echo $DATACENTER_PROXY_HOST

# Test proxy connection
curl -x http://$DATACENTER_PROXY_USERNAME:$DATACENTER_PROXY_PASSWORD@$DATACENTER_PROXY_HOST:$DATACENTER_PROXY_PORT https://httpbin.org/ip

Connection timeout

  1. Check proxy server is reachable:
    ping $DATACENTER_PROXY_HOST
    
  2. Verify firewall allows outbound connections
  3. Try different proxy port
  4. Contact proxy provider support

Best Practices

  1. Start with datacenter proxies
    • Cheaper and faster
    • Works for most sites
  2. Use auto mode
    • Minimizes costs automatically
    • Expert-in-the-loop prevents surprise charges
  3. Monitor statistics
    • Review blocked domains in logs
    • Adjust strategy based on patterns
  4. Test without proxies first
    • Many sites don’t require proxies
    • Save costs where possible
  5. Respect rate limits
    • Add delays between requests
    • Reduce concurrency if needed
    • Proxies don’t make aggressive scraping acceptable

When to Use Proxies

Recommend proxy setup when:
  • User asks about proxies or rate limiting
  • Spider is getting blocked (403/429 errors in logs)
  • User needs to scrape at scale (1000s of pages)
  • User mentions proxy provider (Bright Data, Oxylabs, Smartproxy, etc.)
  • Crawls are failing with “Access Denied” or “Too Many Requests”