Skip to main content
ScrapAI includes SmartProxyMiddleware that intelligently manages proxy usage to avoid blocks while minimizing costs.

How It Works

Auto Mode Strategy (default):
1

Start with direct connections

Fast and free - no proxy overhead
2

Detect blocking

Automatic detection of 403/429 errors
3

Retry with datacenter proxy

Cheap and fast fallback option
4

Learn domain patterns

Remembers which domains need proxies
5

Proactive proxy usage

Uses proxy immediately for known-blocked domains
6

Expert-in-the-loop

Asks user before using expensive residential proxies
Smart cost control:
  • Direct connections are faster and free
  • Proxies only used when necessary
  • Datacenter proxies preferred (cheaper)
  • Residential proxies require explicit user approval
  • Learns per-domain blocking patterns
  • Reduces proxy bandwidth costs by 80-90%
  • No surprise costs - expensive proxies need human approval

Setup

Add proxy credentials to .env:
# Datacenter Proxy (default - used with --proxy-type datacenter or no flag)
DATACENTER_PROXY_USERNAME=your_username
DATACENTER_PROXY_PASSWORD=your_password
DATACENTER_PROXY_HOST=dc.yourproxy.com
DATACENTER_PROXY_PORT=10000
Proxy Configuration:Check your proxy provider’s documentation for:
  • Correct hostname and port for datacenter vs residential proxies
  • Rotating vs sticky IP options
  • Authentication requirements
Since SmartProxyMiddleware uses a single proxy connection, use rotating IP ports when available for better IP distribution.

Usage

Auto Mode (Default) - Expert-in-the-Loop

# Auto mode (default) - smart escalation
./scrapai crawl spider_name --project proj --limit 10
Auto mode flow:
1

Start with direct connections

✅ Fast, free connections - no proxy overhead
2

Detect block (403/429)

✅ Automatically retry with datacenter proxy (cheap, fast)
3

Datacenter fails

⚠️ Expert-in-the-loop prompt appears:
⚠️  EXPERT-IN-THE-LOOP: Datacenter proxy failed for some domains
🏠 Residential proxy is available but may incur HIGHER COSTS

Blocked domains: example.com, site.org

To proceed with residential proxy, run:
  ./scrapai crawl spider_name --project proj --proxy-type residential
4

User decides

👤 You choose whether to use expensive residential proxies
Cost protection: Residential proxies require explicit user approval - no surprise costs!

Explicit Proxy Modes

# Force datacenter proxy only (even if residential configured)
./scrapai crawl spider_name --project proj --limit 10 --proxy-type datacenter
All modes follow smart strategy:
  • ✅ Start with direct connections (fast, free)
  • ✅ Only use proxy when blocked (403/429 errors)
  • ✅ Learn which domains need proxies
  • ✅ Use proxy proactively for blocked domains
The --proxy-type flag controls escalation behavior and cost limits.

Configuration

No spider configuration needed! SmartProxyMiddleware works automatically for all spiders once configured in .env.
The middleware is enabled by default in settings.py with priority 350.

Statistics Tracking

SmartProxyMiddleware tracks detailed usage statistics:
  • Direct requests - Connections without proxy
  • Proxy requests - Connections using proxy
  • Blocked retries - Requests that hit 403/429 and retried with proxy
  • Blocked domains - Domains that consistently need proxies
Statistics logged when spider closes:
📊 Proxy Statistics for 'spider_name':
   Direct requests: 1847
   Proxy requests: 153
   Blocked & retried: 153
   Blocked domains: 2
   Domains that needed proxy: example.com, protected-site.com

Implementation Details

Middleware Logic (from source: middlewares.py):
1

First request to domain

Try direct connection (no proxy)
2

If response is 403/429

Mark domain as blocked, retry with proxy
3

Subsequent requests to blocked domain

Use proxy immediately (learned behavior)
4

Domain patterns remembered

Blocked domains tracked for spider lifetime
Proxy URL Format:
http://username:password@proxy.example.com:8080
Implementation location:
  • File: middlewares.py:16-277
  • Class: SmartProxyMiddleware
  • Priority: 350 (configured in settings.py)
  • Type: Scrapy downloader middleware

Proxy Providers

SmartProxyMiddleware works with any HTTP proxy provider: Use whatever proxy service you prefer. Popular options include Bright Data, Oxylabs, IPRoyal, Smartproxy, and others. SmartProxyMiddleware works with any HTTP proxy provider.

Residential Proxies

For sites that block datacenter IPs:
Use with --proxy-type residential flag on crawl command. Same smart strategy (direct first, proxy only when blocked).Configure RESIDENTIAL_PROXY_* vars in .env with your provider’s residential proxy credentials.

Troubleshooting

Proxy Not Being Used

1

Check .env configuration

Verify all 4 variables are set (USERNAME, PASSWORD, HOST, PORT)
2

Verify credentials

Ensure proxy credentials are correct
3

Test proxy manually

curl -x http://user:pass@host:port https://httpbin.org/ip
4

Check logs

Look for “Datacenter proxy available” message on spider start

Still Getting Blocked with Proxy

1

Check if proxy IP is blocked

Proxy IP may already be blocked by target site
2

Try different provider

Switch to alternative proxy provider
3

Add delays

Increase DOWNLOAD_DELAY in spider config:
{"settings": {"DOWNLOAD_DELAY": 2}}
4

Reduce concurrency

Lower CONCURRENT_REQUESTS in spider config:
{"settings": {"CONCURRENT_REQUESTS": 8}}

Proxy Costs Too High

SmartProxyMiddleware should already minimize costs by using direct connections first.
If costs are still high:
  1. Check which domains are marked as blocked (in stats at spider close)
  2. Verify those domains actually need proxies
  3. Consider if site has changed and unblocking is possible
  4. Some sites may require proxies for all requests - this is expected

When to Mention to Users

Recommend proxy setup when:
  • User asks about proxies or rate limiting
  • Spider is getting blocked (403/429 errors in logs)
  • User needs to scrape at scale (1000s of pages)
  • User mentions proxy provider (Bright Data, Oxylabs, Smartproxy, etc.)
  • Crawls are failing with “Access Denied” or “Too Many Requests”