Skip to main content
Crawl commands execute spiders from the database. ScrapAI supports two modes: test mode (limited items, saved to database) and production mode (full crawl, exported to files with checkpoint support).

crawl

Run a spider by name.

Syntax

./scrapai crawl <spider> --project <name> [options]

Arguments

spider
string
required
Spider name (from database).

Options

--project
string
required
Project name containing the spider.
--limit, -l
integer
Limit number of items to scrape. Enables test mode when specified.
--output, -o
string
Custom output file path. If not specified, uses timestamped filename in data/<project>/<spider>/crawls/.
--timeout, -t
integer
Maximum runtime in seconds. Triggers graceful shutdown when exceeded.
--proxy-type
choice
default:"auto"
Proxy strategy:
  • auto: Smart escalation with expert-in-the-loop (default)
  • datacenter: Use datacenter proxy explicitly
  • residential: Use residential proxy explicitly
--browser
boolean
Use browser automation for JavaScript-rendered sites and Cloudflare bypass. Enables hybrid mode: browser for initial challenge, then fast HTTP requests.
--reset-deltafetch
boolean
Clear DeltaFetch cache to re-crawl all URLs. Useful when you need to refresh previously scraped content.
--save-html
boolean
Save raw HTML content in output files. Increases file size but useful for debugging or post-processing. Only applies to JSONL exports (production mode).
--scrapy-args
string
Additional Scrapy command-line arguments to pass through. Format: "-s SETTING=value -L DEBUG". Advanced users only.

Test Mode (with —limit)

Limited crawl for testing and verification:
./scrapai crawl bbc_co_uk --project news --limit 5
Behavior:
  • Stops after scraping N items
  • Saves items to database (scraped_items table)
  • No HTML content stored (smaller database)
  • Use show command to view results
  • No checkpoint (starts fresh each time)
Output:
🚀 Running DB spider: bbc_co_uk
🔄 Proxy mode: auto (smart escalation with expert-in-the-loop)
🧪 Test mode: Saving to database (limit: 5 items)
   Use './scrapai show bbc_co_uk' to verify results

[Scrapy crawl output...]

2026-02-28 15:30:42 [scrapy.core.engine] INFO: Spider closed (closespider_itemcount)
Use test mode to verify spider configuration before running a full production crawl.

Production Mode (no limit)

Full crawl with checkpoint support:
./scrapai crawl bbc_co_uk --project news
Behavior:
  • Crawls all matching URLs
  • Exports to timestamped JSONL file: data/news/bbc_co_uk/crawls/crawl_28022026_153042.jsonl
  • Includes full HTML content
  • Database writes disabled (performance)
  • Checkpoint enabled: Press Ctrl+C to pause, run same command to resume
Output:
🚀 Running DB spider: bbc_co_uk
🔄 Proxy mode: auto (smart escalation with expert-in-the-loop)
📁 Production mode: Exporting to files (database disabled)
💾 Checkpoint enabled: data/news/bbc_co_uk/checkpoint
   Press Ctrl+C to pause, run same command to resume
   Output: data/news/bbc_co_uk/crawls/crawl_28022026_153042.jsonl (includes HTML)

[Scrapy crawl output...]

2026-02-28 17:45:30 [scrapy.core.engine] INFO: Spider closed (finished)
 Checkpoint cleaned up (successful completion)

Checkpoint Pause/Resume

Press Ctrl+C during a production crawl to pause:
# Start crawl
./scrapai crawl bbc_co_uk --project news

# Press Ctrl+C after 30 minutes
^C
2026-02-28 16:00:15 [scrapy.core.engine] INFO: Spider closed (shutdown)
Checkpoint is saved at data/news/bbc_co_uk/checkpoint/. Resume by running the same command:
./scrapai crawl bbc_co_uk --project news
🚀 Running DB spider: bbc_co_uk
💾 Checkpoint enabled: data/news/bbc_co_uk/checkpoint
   Resuming from previous crawl...
Checkpoint stores:
  • URL queue state
  • Visited URLs (for deduplication)
  • Spider state variables
  • Proxy type used (if proxy type changes, checkpoint is cleared)

Proxy Modes

Auto (Smart Escalation)

Default behavior. Starts with direct connections, escalates to proxy if blocked:
./scrapai crawl myspider --project myproject --proxy-type auto
  • First request: Direct connection
  • If blocked (403/429): Retry via datacenter proxy
  • Domain remembered for subsequent crawls
  • Residential proxy requires explicit opt-in

Datacenter Proxy

Use datacenter proxy for all requests:
./scrapai crawl myspider --project myproject --proxy-type datacenter
Requires proxy configuration in .env:
DATACENTER_PROXY_USERNAME=your_username
DATACENTER_PROXY_PASSWORD=your_password
DATACENTER_PROXY_HOST=proxy.example.com
DATACENTER_PROXY_PORT=10000

Residential Proxy

Use residential proxy for all requests (expensive, use sparingly):
./scrapai crawl myspider --project myproject --proxy-type residential
Requires residential proxy configuration in .env.
Changing proxy type mid-crawl clears the checkpoint to ensure all URLs are retried with the new proxy.

Timeout

Set maximum runtime for long crawls:
# Max 2 hours (7200 seconds)
./scrapai crawl myspider --project myproject --timeout 7200
Output:
⏱️  Max runtime: 2.0 hours (graceful stop)
When timeout is reached, Scrapy triggers graceful shutdown (finishes in-flight requests, saves checkpoint).

Cloudflare Bypass

For spiders with CLOUDFLARE_ENABLED: true in settings:
./scrapai crawl cloudflare_spider --project myproject
Linux (headless server):
🖥️  Headless environment detected - using xvfb for Cloudflare bypass
Requires xvfb installed:
sudo apt-get install xvfb
ScrapAI automatically wraps the command with xvfb-run -a when needed. macOS/Windows:
🖥️  Display available - using native browser for Cloudflare bypass
Uses system display for browser automation.

Sitemap Spider

For spiders with USE_SITEMAP: true in settings:
🗺️  Using sitemap spider
Crawls from XML sitemap instead of following links. Faster for sites with good sitemaps.

crawl-all

Run all active spiders in a project sequentially.

Syntax

./scrapai crawl-all --project <name> [--limit <N>]

Options

--project
string
required
Project name.
--limit, -l
integer
Limit items per spider (test mode).

Example

./scrapai crawl-all --project news --limit 10
Output:
🚀 Running all spiders for project: news
🕷️  Spiders: bbc_co_uk, cnn_com, reuters_com

==================================================
Running: bbc_co_uk
==================================================
🚀 Running DB spider: bbc_co_uk
[...]

==================================================
Running: cnn_com
==================================================
🚀 Running DB spider: cnn_com
[...]

==================================================
Running: reuters_com
==================================================
🚀 Running DB spider: reuters_com
[...]
For parallel execution, use GNU parallel with bin/parallel-crawl <project>.

Output Formats

JSONL (Production)

Each line is a JSON object:
{"url": "https://bbc.co.uk/news/article-1", "title": "Article Title", "content": "...", "html": "<!DOCTYPE html>...", "scraped_at": "2026-02-28T15:30:42"}
{"url": "https://bbc.co.uk/news/article-2", "title": "Another Article", "content": "...", "html": "<!DOCTYPE html>...", "scraped_at": "2026-02-28T15:31:15"}

Database (Test Mode)

Stored in scraped_items table:
SELECT id, url, title, scraped_at 
FROM scraped_items 
WHERE spider_id = (SELECT id FROM spiders WHERE name = 'bbc_co_uk');

Performance Tips

Concurrent Requests

Adjust in spider settings:
{
  "settings": {
    "CONCURRENT_REQUESTS": 16
  }
}
Default: 8. Higher values = faster crawls, but risk IP bans.

Download Delay

Polite crawling delay between requests:
{
  "settings": {
    "DOWNLOAD_DELAY": 2
  }
}
Default: 0. Recommended: 1-3 seconds for aggressive crawling, 5+ for polite crawling.

Incremental Crawling

Enable DeltaFetch to skip already-scraped URLs:
{
  "settings": {
    "DELTAFETCH_ENABLED": true
  }
}
Reduces bandwidth by 80-90% on routine re-crawls.

Troubleshooting

Spider Not Found

 Spider 'myspider' not found in database.
Solution: Import the spider first:
./scrapai spiders import myspider.json --project myproject
./scrapai spiders list --project myproject

Permission Denied (Output File)

PermissionError: [Errno 13] Permission denied: 'data/news/spider/crawls/crawl.jsonl'
Solution: Check DATA_DIR permissions:
ls -la data/
chmod -R u+w data/

Cloudflare Bypass Failed

⚠️  WARNING: Cloudflare bypass enabled but no display available and xvfb not installed
Solution (Linux):
sudo apt-get install xvfb

Checkpoint Corruption

If resume fails with errors:
rm -rf data/<project>/<spider>/checkpoint
./scrapai crawl <spider> --project <project>

Next Steps