Skip to main content
Crawl commands execute spiders from the database. ScrapAI supports two modes: test mode (limited items, saved to database) and production mode (full crawl, exported to files with checkpoint support).

crawl

Run a spider by name.
./scrapai crawl <spider> --project <name> [options]

Arguments & Options

spider
string
required
Spider name (from database).
--project
string
required
Project name containing the spider.
--limit, -l
integer
Limit number of items to scrape. Enables test mode when specified.
--output, -o
string
Custom output file path (default: timestamped file in data/<project>/<spider>/crawls/).
--timeout, -t
integer
Maximum runtime in seconds. Triggers graceful shutdown when exceeded.
--proxy-type
choice
default:"auto"
Proxy strategy:
  • auto: Smart escalation with expert-in-the-loop (default)
  • datacenter: Use datacenter proxy explicitly
  • residential: Use residential proxy explicitly
--browser
boolean
Use browser automation for JavaScript-rendered sites and Cloudflare bypass. Enables hybrid mode: browser for initial challenge, then fast HTTP requests.
--reset-deltafetch
boolean
Clear DeltaFetch cache to re-crawl previously scraped URLs.
--save-html
boolean
Save raw HTML in output files (production mode only). Increases file size.
--scrapy-args
string
Additional Scrapy arguments. Format: "-s SETTING=value -L DEBUG".

Test Mode (with —limit)

./scrapai crawl bbc_co_uk --project news --limit 5
Behavior:
  • Stops after N items
  • Saves to database (scraped_items table)
  • No HTML content stored
  • View with show command
  • No checkpoint support
Output:
🚀 Running DB spider: bbc_co_uk
🔄 Proxy mode: auto (smart escalation with expert-in-the-loop)
🧪 Test mode: Saving to database (limit: 5 items)
   Use './scrapai show bbc_co_uk' to verify results

[Scrapy crawl output...]

2026-02-28 15:30:42 [scrapy.core.engine] INFO: Spider closed (closespider_itemcount)
Use test mode to verify spider configuration before running a full production crawl.

Production Mode (no limit)

./scrapai crawl bbc_co_uk --project news
Behavior:
  • Crawls all matching URLs
  • Exports to timestamped JSONL: data/news/bbc_co_uk/crawls/crawl_28022026_153042.jsonl
  • Includes full HTML content
  • Database disabled (performance)
  • Checkpoint enabled (Ctrl+C to pause, resume with same command)
Output:
🚀 Running DB spider: bbc_co_uk
🔄 Proxy mode: auto (smart escalation with expert-in-the-loop)
📁 Production mode: Exporting to files (database disabled)
💾 Checkpoint enabled: data/news/bbc_co_uk/checkpoint
   Press Ctrl+C to pause, run same command to resume
   Output: data/news/bbc_co_uk/crawls/crawl_28022026_153042.jsonl (includes HTML)

[Scrapy crawl output...]

2026-02-28 17:45:30 [scrapy.core.engine] INFO: Spider closed (finished)
 Checkpoint cleaned up (successful completion)

Checkpoint Pause/Resume

Press Ctrl+C to pause. Checkpoint is saved at data/news/bbc_co_uk/checkpoint/. Resume by running the same command:
./scrapai crawl bbc_co_uk --project news
🚀 Running DB spider: bbc_co_uk
💾 Checkpoint enabled: data/news/bbc_co_uk/checkpoint
   Resuming from previous crawl...
Checkpoint stores:
  • URL queue state
  • Visited URLs (for deduplication)
  • Spider state variables
  • Proxy type used (if proxy type changes, checkpoint is cleared)

Proxy Modes

Auto (Smart Escalation)

./scrapai crawl myspider --project myproject --proxy-type auto
Default behavior. Starts with direct connections, escalates to datacenter proxy if blocked (403/429). Residential proxy requires explicit opt-in.

Datacenter Proxy

./scrapai crawl myspider --project myproject --proxy-type datacenter
Requires .env configuration:
DATACENTER_PROXY_USERNAME=your_username
DATACENTER_PROXY_PASSWORD=your_password
DATACENTER_PROXY_HOST=proxy.example.com
DATACENTER_PROXY_PORT=10000

Residential Proxy

./scrapai crawl myspider --project myproject --proxy-type residential
Expensive, use sparingly. Requires .env configuration.
Changing proxy type mid-crawl clears the checkpoint to ensure all URLs are retried with the new proxy.

Timeout

./scrapai crawl myspider --project myproject --timeout 7200  # 2 hours
Triggers graceful shutdown when reached (finishes in-flight requests, saves checkpoint).

Cloudflare Bypass

For spiders with CLOUDFLARE_ENABLED: true:
./scrapai crawl cloudflare_spider --project myproject
Linux: Requires xvfb (sudo apt-get install xvfb). Automatically wraps with xvfb-run -a. macOS/Windows: Uses native browser with system display.

Sitemap Spider

For spiders with USE_SITEMAP: true, crawls from XML sitemap instead of following links.

crawl-all

Run all active spiders in a project sequentially.
./scrapai crawl-all --project <name> [--limit <N>]
--project
string
required
Project name.
--limit, -l
integer
Limit items per spider (enables test mode).
./scrapai crawl-all --project news --limit 10
For parallel execution, use GNU parallel with bin/parallel-crawl <project>.

Output Formats

JSONL (Production)

Each line is a JSON object:
{"url": "https://bbc.co.uk/news/article-1", "title": "Article Title", "content": "...", "html": "<!DOCTYPE html>...", "scraped_at": "2026-02-28T15:30:42"}
{"url": "https://bbc.co.uk/news/article-2", "title": "Another Article", "content": "...", "html": "<!DOCTYPE html>...", "scraped_at": "2026-02-28T15:31:15"}

Database (Test Mode)

Stored in scraped_items table:
SELECT id, url, title, scraped_at 
FROM scraped_items 
WHERE spider_id = (SELECT id FROM spiders WHERE name = 'bbc_co_uk');

Performance Tips

Concurrent Requests: Default 8. Higher = faster, but risks bans.
{"settings": {"CONCURRENT_REQUESTS": 16}}
Download Delay: Default 0. Recommended 1-3s (aggressive), 5+s (polite).
{"settings": {"DOWNLOAD_DELAY": 2}}
Incremental Crawling: Enable DeltaFetch to skip scraped URLs (80-90% bandwidth reduction).
{"settings": {"DELTAFETCH_ENABLED": true}}

Troubleshooting

Spider Not Found

 Spider 'myspider' not found in database.
Import the spider first:
./scrapai spiders import myspider.json --project myproject

Permission Denied

PermissionError: [Errno 13] Permission denied: 'data/news/spider/crawls/crawl.jsonl'
Check permissions: chmod -R u+w data/

Cloudflare Bypass Failed

⚠️  WARNING: Cloudflare bypass enabled but no display available and xvfb not installed
Linux: sudo apt-get install xvfb

Checkpoint Corruption

If resume fails:
rm -rf data/<project>/<spider>/checkpoint
./scrapai crawl <spider> --project <project>

Next Steps

View Scraped Data

Inspect and export crawl results

Queue Management

Batch process multiple websites