Only enable Cloudflare bypass when the site explicitly requires it. Always test WITHOUT --browser first.
Detection Indicators
Your site needs Cloudflare bypass if you see:
- “Checking your browser” or “Just a moment” messages
- 403/503 HTTP errors with Cloudflare branding
- Challenge pages before content loads
VPS/Cloud Server IP Reputation Issue:If running on AWS, DigitalOcean, Hetzner, or any cloud provider, Cloudflare may block your server’s IP even with browser bypass enabled. Cloud/datacenter IPs are often flagged as high-risk.Solution: Combine --browser with residential proxies:./scrapai crawl spider --project proj --proxy-type residential --browser
See Proxy Configuration for details.
Display Requirements
Cloudflare bypass requires a visible browser (not headless). Cloudflare detects and blocks headless browsers.
Platform support:
- Windows: Uses native display automatically ✓
- macOS: Uses native display automatically ✓
- Linux desktop: Uses native display automatically ✓
- Linux servers (VPS without GUI): Auto-detects missing display and uses Xvfb (virtual display) ✓
Installing Xvfb on Linux servers:
sudo apt-get install xvfb
The crawler automatically detects your environment and uses Xvfb when no display is available on Linux.
Inspector Usage
Start with default HTTP (fast)
Works for most sites:./scrapai inspect https://example.com --project proj
Try browser mode if JS-rendered
For JavaScript-heavy sites:./scrapai inspect https://example.com --project proj --browser
Use Cloudflare bypass only when blocked
For Cloudflare-protected sites:./scrapai inspect https://example.com --project proj --browser
Smart resource usage: Start with default HTTP (fast). Escalate to --browser if content is JS-rendered. Use --browser only when seeing challenge pages or 403/503 errors.
Strategies
Hybrid Mode (Recommended)
Browser verification once per 10 minutes, then fast HTTP with cached cookies. 20-100x faster than browser-only mode.
Do NOT set CONCURRENT_REQUESTS - uses Scrapy default of 16 for optimal performance.
{
"settings": {
"CLOUDFLARE_ENABLED": true,
"CLOUDFLARE_STRATEGY": "hybrid",
"CLOUDFLARE_COOKIE_REFRESH_THRESHOLD": 600,
"CF_MAX_RETRIES": 5,
"CF_RETRY_INTERVAL": 1,
"CF_POST_DELAY": 5
}
}
How it works:
- Browser verifies Cloudflare once and caches cookies
- Subsequent requests use fast HTTP with cached cookies
- Auto-refreshes cookies every 10 minutes
- Falls back to browser if cookies become invalid
Browser-Only Mode (Legacy)
Only use if hybrid mode fails. Browser for every request. Much slower.Requires CONCURRENT_REQUESTS: 1 to prevent browser conflicts.
{
"settings": {
"CLOUDFLARE_ENABLED": true,
"CLOUDFLARE_STRATEGY": "browser_only",
"CONCURRENT_REQUESTS": 1
}
}
Settings Reference
| Setting | Default | Description |
|---|
CLOUDFLARE_ENABLED | false | Enable CF bypass |
CLOUDFLARE_STRATEGY | ”hybrid" | "hybrid” or “browser_only” |
CLOUDFLARE_COOKIE_REFRESH_THRESHOLD | 600 | Seconds before cookie refresh |
CF_MAX_RETRIES | 5 | Max verification attempts |
CF_RETRY_INTERVAL | 1 | Seconds between retries |
CF_POST_DELAY | 5 | Seconds after successful verification |
CF_WAIT_SELECTOR | — | CSS selector to wait for before extracting |
CF_WAIT_TIMEOUT | 10 | Max seconds to wait for selector |
CF_PAGE_TIMEOUT | 120000 | Page navigation timeout (ms) |
CONCURRENT_REQUESTS | — | Must be 1 for browser-only mode |
Complete Spider Example
{
"name": "mysite",
"allowed_domains": ["example.com"],
"start_urls": ["https://www.example.com/articles"],
"rules": [
{
"allow": ["/article/[^/]+$"],
"callback": "parse_article",
"follow": false,
"priority": 100
},
{
"allow": ["/articles/"],
"callback": null,
"follow": true,
"priority": 50
}
],
"settings": {
"CLOUDFLARE_ENABLED": true,
"CLOUDFLARE_STRATEGY": "hybrid",
"CLOUDFLARE_COOKIE_REFRESH_THRESHOLD": 600,
"CF_MAX_RETRIES": 5,
"CF_RETRY_INTERVAL": 1,
"CF_POST_DELAY": 5,
"CF_WAIT_SELECTOR": "h1.title-med-1",
"DOWNLOAD_DELAY": 2
}
}
Timeouts & Hang Prevention
Browser operation timeout: 300 seconds (5 minutes) per operation to prevent infinite hangs.
If a browser operation exceeds 300 seconds, the crawl fails with a TimeoutError instead of hanging forever. This protects against:
- Browser subprocess hangs
- Network stalls
- Infinite CF challenge loops
- Cross-thread asyncio deadlocks
Typical operation times:
- CF verification: 10-60 seconds
- Page load: 5-30 seconds
- Cookie refresh: 10-30 seconds
If you consistently hit the 300s timeout, investigate:
- Network connectivity issues
- Site blocking your IP/region
- Browser/Chrome subprocess problems
- System resource constraints (CPU/memory)
Troubleshooting
Crawl Hangs at “Getting/refreshing CF cookies”
Symptoms: Browser opens but never navigates. Logs show “Getting/refreshing CF cookies” but no progress.
Possible causes:
- Asyncio event loop mismatch (fixed in latest version)
- Browser subprocess issues - Chrome/nodriver incompatible with thread-based event loop
- Display/X11 issues on Linux servers
- Network/firewall blocking browser traffic
Solutions:
Update to latest version
Ensure you’re on latest version with timeout fix
Verify browser opens
Check browser actually opens (not headless failing)
Check display (Linux servers)
Verify Xvfb is installed: sudo apt-get install xvfb
Test with inspector
Test with --browser flag on inspector first:./scrapai inspect https://example.com --project proj --browser
Check system resources
Verify CPU, memory, and disk space availability
Works on One Machine But Not Another
Environmental factors affecting browser subprocesses:
- Python/asyncio version differences
- Display environment (X11 vs Wayland vs headless)
- Chrome/Chromium version and availability
- System resources and timing (race conditions)
- Network conditions (DNS, latency, firewalls)
- Security software interfering with browser
Debugging steps:
Test inspector on both machines
./scrapai inspect https://example.com --project proj --browser
Check Chrome installation
Verify display (Linux)
echo $DISPLAY # Should show :99 with Xvfb
Review logs for errors
Check logs for specific error messages
Try different strategy
Switch between hybrid and browser_only modes
Diagnosing via Logs
Hybrid mode indicators:
Cached N cookies (cf_clearance: ...)
// Cookies working properly
Browser-only mode indicators:
Cloudflare verified successfully
Opened persistent browser
Closed browser
// Normal lifecycle
Title Contamination
If extracted titles show wrong text (e.g., “Related Articles” instead of actual title), set CF_WAIT_SELECTOR to the main title element.
{
"settings": {
"CF_WAIT_SELECTOR": "h1.article-title"
}
}
This captures HTML before related content loads, preventing contamination.