Skip to main content
Only enable Cloudflare bypass when the site explicitly requires it. Always test WITHOUT --browser first.

Detection Indicators

Your site needs Cloudflare bypass if you see:
  • “Checking your browser” or “Just a moment” messages
  • 403/503 HTTP errors with Cloudflare branding
  • Challenge pages before content loads
VPS/Cloud Server IP Reputation Issue:If running on AWS, DigitalOcean, Hetzner, or any cloud provider, Cloudflare may block your server’s IP even with browser bypass enabled. Cloud/datacenter IPs are often flagged as high-risk.Solution: Combine --browser with residential proxies:
./scrapai crawl spider --project proj --proxy-type residential --browser
See Proxy Configuration for details.

Display Requirements

Cloudflare bypass requires a visible browser (not headless). Cloudflare detects and blocks headless browsers.
Platform support:
  • Windows: Uses native display automatically ✓
  • macOS: Uses native display automatically ✓
  • Linux desktop: Uses native display automatically ✓
  • Linux servers (VPS without GUI): Auto-detects missing display and uses Xvfb (virtual display) ✓
Installing Xvfb on Linux servers:
sudo apt-get install xvfb
The crawler automatically detects your environment and uses Xvfb when no display is available on Linux.

Inspector Usage

1

Start with default HTTP (fast)

Works for most sites:
./scrapai inspect https://example.com --project proj
2

Try browser mode if JS-rendered

For JavaScript-heavy sites:
./scrapai inspect https://example.com --project proj --browser
3

Use Cloudflare bypass only when blocked

For Cloudflare-protected sites:
./scrapai inspect https://example.com --project proj --browser
Smart resource usage: Start with default HTTP (fast). Escalate to --browser if content is JS-rendered. Use --browser only when seeing challenge pages or 403/503 errors.

Strategies

Browser verification once per 10 minutes, then fast HTTP with cached cookies. 20-100x faster than browser-only mode.
Do NOT set CONCURRENT_REQUESTS - uses Scrapy default of 16 for optimal performance.
spider.json
{
  "settings": {
    "CLOUDFLARE_ENABLED": true,
    "CLOUDFLARE_STRATEGY": "hybrid",
    "CLOUDFLARE_COOKIE_REFRESH_THRESHOLD": 600,
    "CF_MAX_RETRIES": 5,
    "CF_RETRY_INTERVAL": 1,
    "CF_POST_DELAY": 5
  }
}
How it works:
  1. Browser verifies Cloudflare once and caches cookies
  2. Subsequent requests use fast HTTP with cached cookies
  3. Auto-refreshes cookies every 10 minutes
  4. Falls back to browser if cookies become invalid

Browser-Only Mode (Legacy)

Only use if hybrid mode fails. Browser for every request. Much slower.Requires CONCURRENT_REQUESTS: 1 to prevent browser conflicts.
spider.json
{
  "settings": {
    "CLOUDFLARE_ENABLED": true,
    "CLOUDFLARE_STRATEGY": "browser_only",
    "CONCURRENT_REQUESTS": 1
  }
}

Settings Reference

SettingDefaultDescription
CLOUDFLARE_ENABLEDfalseEnable CF bypass
CLOUDFLARE_STRATEGY”hybrid""hybrid” or “browser_only”
CLOUDFLARE_COOKIE_REFRESH_THRESHOLD600Seconds before cookie refresh
CF_MAX_RETRIES5Max verification attempts
CF_RETRY_INTERVAL1Seconds between retries
CF_POST_DELAY5Seconds after successful verification
CF_WAIT_SELECTORCSS selector to wait for before extracting
CF_WAIT_TIMEOUT10Max seconds to wait for selector
CF_PAGE_TIMEOUT120000Page navigation timeout (ms)
CONCURRENT_REQUESTSMust be 1 for browser-only mode

Complete Spider Example

spider.json
{
  "name": "mysite",
  "allowed_domains": ["example.com"],
  "start_urls": ["https://www.example.com/articles"],
  "rules": [
    {
      "allow": ["/article/[^/]+$"],
      "callback": "parse_article",
      "follow": false,
      "priority": 100
    },
    {
      "allow": ["/articles/"],
      "callback": null,
      "follow": true,
      "priority": 50
    }
  ],
  "settings": {
    "CLOUDFLARE_ENABLED": true,
    "CLOUDFLARE_STRATEGY": "hybrid",
    "CLOUDFLARE_COOKIE_REFRESH_THRESHOLD": 600,
    "CF_MAX_RETRIES": 5,
    "CF_RETRY_INTERVAL": 1,
    "CF_POST_DELAY": 5,
    "CF_WAIT_SELECTOR": "h1.title-med-1",
    "DOWNLOAD_DELAY": 2
  }
}

Timeouts & Hang Prevention

Browser operation timeout: 300 seconds (5 minutes) per operation to prevent infinite hangs.
If a browser operation exceeds 300 seconds, the crawl fails with a TimeoutError instead of hanging forever. This protects against:
  • Browser subprocess hangs
  • Network stalls
  • Infinite CF challenge loops
  • Cross-thread asyncio deadlocks
Typical operation times:
  • CF verification: 10-60 seconds
  • Page load: 5-30 seconds
  • Cookie refresh: 10-30 seconds
If you consistently hit the 300s timeout, investigate:
  • Network connectivity issues
  • Site blocking your IP/region
  • Browser/Chrome subprocess problems
  • System resource constraints (CPU/memory)

Troubleshooting

Crawl Hangs at “Getting/refreshing CF cookies”

Symptoms: Browser opens but never navigates. Logs show “Getting/refreshing CF cookies” but no progress. Possible causes:
  1. Asyncio event loop mismatch (fixed in latest version)
  2. Browser subprocess issues - Chrome/nodriver incompatible with thread-based event loop
  3. Display/X11 issues on Linux servers
  4. Network/firewall blocking browser traffic
Solutions:
1

Update to latest version

Ensure you’re on latest version with timeout fix
2

Verify browser opens

Check browser actually opens (not headless failing)
3

Check display (Linux servers)

Verify Xvfb is installed: sudo apt-get install xvfb
4

Test with inspector

Test with --browser flag on inspector first:
./scrapai inspect https://example.com --project proj --browser
5

Check system resources

Verify CPU, memory, and disk space availability

Works on One Machine But Not Another

Environmental factors affecting browser subprocesses:
  • Python/asyncio version differences
  • Display environment (X11 vs Wayland vs headless)
  • Chrome/Chromium version and availability
  • System resources and timing (race conditions)
  • Network conditions (DNS, latency, firewalls)
  • Security software interfering with browser
Debugging steps:
1

Test inspector on both machines

./scrapai inspect https://example.com --project proj --browser
2

Check Chrome installation

google-chrome --version
3

Verify display (Linux)

echo $DISPLAY  # Should show :99 with Xvfb
4

Review logs for errors

Check logs for specific error messages
5

Try different strategy

Switch between hybrid and browser_only modes

Diagnosing via Logs

Hybrid mode indicators:
Cached N cookies (cf_clearance: ...)
// Cookies working properly
Browser-only mode indicators:
Cloudflare verified successfully
Opened persistent browser
Closed browser
// Normal lifecycle

Title Contamination

If extracted titles show wrong text (e.g., “Related Articles” instead of actual title), set CF_WAIT_SELECTOR to the main title element.
{
  "settings": {
    "CF_WAIT_SELECTOR": "h1.article-title"
  }
}
This captures HTML before related content loads, preventing contamination.