Proxy Configuration

Overview

scrapai’s SmartProxyMiddleware intelligently manages proxy usage to avoid blocks while minimizing costs. It automatically escalates from direct connections → datacenter proxies → residential proxies (with user approval). Key features:

Starts with direct connections (fast, free)
Detects blocking (403/429 errors) and retries with proxies
Learns per-domain blocking patterns
Requires user approval for expensive residential proxies
Reduces proxy costs by 80-90%

Proxy Types

Datacenter Proxies (Recommended)

Fast (low latency), cheap (~$1/GB), high bandwidth. Works for most sites but some block datacenter IPs.

Residential Proxies

Real residential IPs, harder to block, but expensive ($3-15/GB) and slower. Use only when datacenter proxies fail. Requires --proxy-type residential flag.

Running on VPS/Cloud Servers

Cloud servers (AWS, DigitalOcean, Azure, etc.) often have poor IP reputation. Anti-bot services may block datacenter IPs immediately, causing 403/429 errors or unresolvable Cloudflare challenges.Solution: Use --proxy-type residential flag. Residential IPs bypass IP-based blocking.

When residential proxies are required:

Running on cloud/VPS servers
Strong anti-bot protection (Cloudflare, DataDome, PerimeterX)
Datacenter proxies getting blocked

Local development (home ISP) usually works with datacenter proxies. VPS/cloud servers often need residential.

Configuration

Setup Datacenter Proxy

Add credentials to .env:

# Datacenter Proxy (default - used automatically)
DATACENTER_PROXY_USERNAME=your_username
DATACENTER_PROXY_PASSWORD=your_password
DATACENTER_PROXY_HOST=proxy.provider.com
DATACENTER_PROXY_PORT=8080  # Check your provider's documentation

Setup Residential Proxy

Add credentials to .env:

# Residential Proxy (used with --proxy-type residential flag)
RESIDENTIAL_PROXY_USERNAME=your_username
RESIDENTIAL_PROXY_PASSWORD=your_password
RESIDENTIAL_PROXY_HOST=residential.provider.com
RESIDENTIAL_PROXY_PORT=8080  # Check your provider's documentation

Environment Variables

DATACENTER_PROXY_USERNAME

string

Username for datacenter proxy authentication.

DATACENTER_PROXY_PASSWORD

string

Password for datacenter proxy authentication.

DATACENTER_PROXY_HOST

string

Datacenter proxy server hostname.Example: proxy.provider.com

DATACENTER_PROXY_PORT

number

Datacenter proxy server port.Example: 8080 (check your provider’s documentation for the correct port)

RESIDENTIAL_PROXY_USERNAME

string

Username for residential proxy authentication.

RESIDENTIAL_PROXY_PASSWORD

string

Password for residential proxy authentication.

RESIDENTIAL_PROXY_HOST

string

Residential proxy server hostname.Example: residential.provider.com

RESIDENTIAL_PROXY_PORT

number

Residential proxy server port.Example: 8080 (check your provider’s documentation for the correct port)

Usage

Auto Mode (Default)

./scrapai crawl spider_name --project proj --limit 10

Starts direct → escalates to datacenter on block → prompts for residential if needed.

Force Datacenter

./scrapai crawl spider_name --project proj --proxy-type datacenter

Force Residential

./scrapai crawl spider_name --project proj --proxy-type residential

All modes try direct connections first and only use proxies when needed (403/429 errors).

Statistics

Proxy usage is logged when spider closes:

Direct requests vs proxy requests
Blocked retries
Domains requiring proxies

Proxy Providers

Works with any HTTP proxy provider: Decodo, Bright Data, Oxylabs, IPRoyal, etc. Most providers offer both datacenter and residential proxies.

Technical Details

Middleware logic:

Try direct connection first
On 403/429 → mark domain blocked, retry with proxy
Use proxy immediately for known-blocked domains

Implementation: SmartProxyMiddleware in middlewares.py (priority 350). Works automatically for all spiders once configured in .env.

Troubleshooting

Proxy not being used:

Verify all 4 variables in .env (USERNAME, PASSWORD, HOST, PORT)
Test: curl -x http://user:pass@host:port https://httpbin.org/ip
Check logs for “proxy available” message

Still getting blocked:

Switch to residential proxies: --proxy-type residential
Add delays: set DOWNLOAD_DELAY in spider config
Reduce concurrency: set CONCURRENT_REQUESTS

High costs:

Check blocked domains in spider stats
Verify those domains actually need proxies
Consider cheaper provider

Connection issues:

Test connection: curl -x http://user:pass@host:port https://httpbin.org/ip
Check firewall and network connectivity
Verify proxy credentials

Best Practices

Start with datacenter proxies (cheaper, faster)
Use auto mode to minimize costs
Monitor blocked domains in statistics
Test without proxies first
Respect rate limits with delays and reduced concurrency

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

Proxy Configuration

Overview

Proxy Types

Datacenter Proxies (Recommended)

Residential Proxies

Running on VPS/Cloud Servers

Configuration

Setup Datacenter Proxy

Setup Residential Proxy

Environment Variables

Usage

Auto Mode (Default)

Force Datacenter

Force Residential

Statistics

Proxy Providers

Technical Details

Troubleshooting

Best Practices

​Overview

​Proxy Types

​Datacenter Proxies (Recommended)

​Residential Proxies

​Running on VPS/Cloud Servers

​Configuration

​Setup Datacenter Proxy

​Setup Residential Proxy

​Environment Variables

​Usage

​Auto Mode (Default)

​Force Datacenter

​Force Residential

​Statistics

​Proxy Providers

​Technical Details

​Troubleshooting

​Best Practices

​Related Documentation

Overview

Proxy Types

Datacenter Proxies (Recommended)

Residential Proxies

Running on VPS/Cloud Servers

Configuration

Setup Datacenter Proxy

Setup Residential Proxy

Environment Variables

Usage

Auto Mode (Default)

Force Datacenter

Force Residential

Statistics

Proxy Providers

Technical Details

Troubleshooting

Best Practices

Related Documentation