Cloudflare Bypass

Only enable Cloudflare bypass when the site explicitly requires it. Always test WITHOUT --browser first.

Detection Indicators

Your site needs Cloudflare bypass if you see:

“Checking your browser” or “Just a moment” messages
403/503 HTTP errors with Cloudflare branding
Challenge pages before content loads

VPS/Cloud Server IP Reputation Issue:If running on AWS, DigitalOcean, Hetzner, or any cloud provider, Cloudflare may block your server’s IP even with browser bypass enabled. Cloud/datacenter IPs are often flagged as high-risk.Solution: Combine --browser with residential proxies:

./scrapai crawl spider --project proj --proxy-type residential --browser

See Proxy Configuration for details.

Display Requirements

Cloudflare bypass requires a visible browser (not headless). Cloudflare detects and blocks headless browsers.

Platform support:

Windows: Uses native display automatically ✓
macOS: Uses native display automatically ✓
Linux desktop: Uses native display automatically ✓
Linux servers (VPS without GUI): Auto-detects missing display and uses Xvfb (virtual display) ✓

Installing Xvfb on Linux servers:

sudo apt-get install xvfb

sudo yum install xorg-x11-server-Xvfb

The crawler automatically detects your environment and uses Xvfb when no display is available on Linux.

Inspector Usage

Start with default HTTP (fast)

Works for most sites:

./scrapai inspect https://example.com --project proj

Try browser mode if JS-rendered

For JavaScript-heavy sites:

./scrapai inspect https://example.com --project proj --browser

Use Cloudflare bypass only when blocked

For Cloudflare-protected sites:

./scrapai inspect https://example.com --project proj --browser

Strategies

Hybrid Mode (Recommended)

Verify the Cloudflare challenge once, then serve everything else over fast HTTP with the cached cookie. 20-100x faster than browser-only mode. How verification works (reactive hold-and-verify — no timer):

The first request to a host has no cookie, so it verifies once. All other requests for that host hold at a single gate while ONE verification runs — the browser is never driven by concurrent requests at the same time.
Once a cookie exists, requests go out over HTTP with it (via curl_cffi TLS impersonation).
If an HTTP response comes back blocked, requests hold again, ONE request re-verifies, and everyone retries with the fresh cookie. There is no time-based refresh — cookies are re-fetched only when missing or when a response is actually blocked. If a response is still blocked immediately after a fresh verify, that’s treated as a real block (IP/rate limit) and surfaced, not retried forever.

Cookies are cached per spider + host. Each hostname (www., hemeroteca., an API subdomain, …) verifies and caches independently — a cookie for one subdomain is never reused for another.

Cloudflare verification always routes through the shared browser service — one warm browser shared across all crawls, so the challenge is solved once and reused instead of every crawl spawning its own Chrome. If the service isn’t running it is started automatically; if it can’t be reached the request fails and Scrapy retries it. See the Browser Service guide.

Do NOT set CONCURRENT_REQUESTS - uses Scrapy default of 16 for optimal performance.

spider.json

{
  "settings": {
    "CLOUDFLARE_ENABLED": true,
    "CLOUDFLARE_STRATEGY": "hybrid"
  }
}

Browser-Only Mode (Legacy)

Much slower - uses browser for every request. Requires CONCURRENT_REQUESTS: 1 to prevent browser conflicts.

spider.json

{
  "settings": {
    "CLOUDFLARE_ENABLED": true,
    "CLOUDFLARE_STRATEGY": "browser_only",
    "CONCURRENT_REQUESTS": 1
  }
}

Settings Reference

Setting	Default	Description
`CLOUDFLARE_ENABLED`	false	Enable browser mode (Cloudflare intent)
`BROWSER_ENABLED`	false	Enable browser mode for plain JS-render sites; alias of `CLOUDFLARE_ENABLED`
`CURL_CFFI_ENABLED`	false	Use `curl_cffi` TLS impersonation instead of the browser (try first; takes precedence over browser settings)
`CLOUDFLARE_STRATEGY`	”hybrid"	"hybrid” (fast) or “browser_only” (slow)
`CLOUDFLARE_HEADLESS`	false	Run browser headless (true = no GUI, worse stealth)
`CF_POST_DELAY`	5	Seconds to wait after a successful verification
`CF_WAIT_SELECTOR`	—	CSS selector to wait for before extracting (browser fetch)
`CF_WAIT_TIMEOUT`	10	Max seconds to wait for `CF_WAIT_SELECTOR`
`CF_MAX_RETRIES`	5	Kept for API compatibility (not used by CloakBrowser)
`CF_RETRY_INTERVAL`	1	Kept for API compatibility (not used by CloakBrowser)
`CONCURRENT_REQUESTS`	16	Must be 1 for browser-only mode

Complete Spider Example

spider.json

{
  "name": "mysite",
  "allowed_domains": ["example.com"],
  "start_urls": ["https://www.example.com/articles"],
  "rules": [
    {
      "allow": ["/article/[^/]+$"],
      "callback": "parse_article",
      "follow": false,
      "priority": 100
    },
    {
      "allow": ["/articles/"],
      "callback": null,
      "follow": true,
      "priority": 50
    }
  ],
  "settings": {
    "CLOUDFLARE_ENABLED": true,
    "CLOUDFLARE_STRATEGY": "hybrid",
    "CF_POST_DELAY": 5,
    "CF_WAIT_SELECTOR": "h1.title-med-1",
    "DOWNLOAD_DELAY": 2
  }
}

Timeouts & Hang Prevention

Browser operation timeout: 300 seconds (5 minutes) per operation to prevent infinite hangs.

Typical operation times:

CF verification: 10-60 seconds
Page load: 5-30 seconds
Re-verify after a blocked response: 10-30 seconds

If you consistently hit the 300s timeout, investigate:

Network connectivity issues
Site blocking your IP/region
Browser/Chrome subprocess problems
System resource constraints (CPU/memory)

Troubleshooting

Crawl Hangs at “Getting/refreshing CF cookies”

Symptoms: Browser opens but never navigates. Solutions:

Update to latest version

Ensure you’re on latest version with timeout fix

Verify browser opens

Check browser actually opens (not headless failing)

Check display (Linux servers)

Verify Xvfb is installed: sudo apt-get install xvfb

Test with inspector

Test with --browser flag on inspector first:

./scrapai inspect https://example.com --project proj --browser

Check system resources

Verify CPU, memory, and disk space availability

Works on One Machine But Not Another

Debugging steps:

Test inspector on both machines

./scrapai inspect https://example.com --project proj --browser

Check Chrome installation

google-chrome --version

Verify display (Linux)

echo $DISPLAY  # Should show :99 with Xvfb

Review logs for errors

Check logs for specific error messages

Try different strategy

Switch between hybrid and browser_only modes

Diagnosing via Logs

Hybrid mode indicators:

[spider|host] Cached N cookies: cf_clearance, ...
// Cookie cached for this host, HTTP requests working

[spider] Blocked on host - holding to reverify
// A response came back blocked; all requests hold while ONE re-verifies

Still blocked after reverify: https://...
// Blocked immediately after a fresh cookie => real block (IP/rate limit)

Browser-only mode indicators:

Cloudflare verified successfully
Opened persistent browser
Closed browser
// Normal lifecycle

Title Contamination

If extracted titles show wrong text, set CF_WAIT_SELECTOR to the main title element to capture HTML before related content loads.

{
  "settings": {
    "CF_WAIT_SELECTOR": "h1.article-title"
  }
}

Browser Service

Shared warm browser that solves CF once per site

Proxy Escalation

Combine with smart proxy usage

Checkpoint Resume

Pause and resume long crawls

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

Cloudflare Bypass

Detection Indicators

Display Requirements

Inspector Usage

Strategies

Hybrid Mode (Recommended)

Browser-Only Mode (Legacy)

Settings Reference

Complete Spider Example

Timeouts & Hang Prevention

Troubleshooting

Crawl Hangs at “Getting/refreshing CF cookies”

Works on One Machine But Not Another

Diagnosing via Logs

Title Contamination

Browser Service

Proxy Escalation

Checkpoint Resume

​Detection Indicators

​Display Requirements

​Inspector Usage

​Strategies

​Hybrid Mode (Recommended)

​Browser-Only Mode (Legacy)

​Settings Reference

​Complete Spider Example

​Timeouts & Hang Prevention

​Troubleshooting

​Crawl Hangs at “Getting/refreshing CF cookies”

​Works on One Machine But Not Another

​Diagnosing via Logs

​Title Contamination

​Related Guides

Browser Service

Proxy Escalation

Checkpoint Resume

Detection Indicators

Display Requirements

Inspector Usage

Strategies

Hybrid Mode (Recommended)

Browser-Only Mode (Legacy)

Settings Reference

Complete Spider Example

Timeouts & Hang Prevention

Troubleshooting

Crawl Hangs at “Getting/refreshing CF cookies”

Works on One Machine But Not Another

Diagnosing via Logs

Title Contamination

Related Guides