Crawl commands execute spiders from the database. ScrapAI supports two modes: test mode (limited items, saved to database) and production mode (full crawl, exported to files with checkpoint support).
crawl
Run a spider by name.
Syntax
./scrapai crawl <spider> --project <name> [options]
Arguments
Spider name (from database).
Options
Project name containing the spider.
Limit number of items to scrape. Enables test mode when specified.
Custom output file path. If not specified, uses timestamped filename in data/<project>/<spider>/crawls/.
Maximum runtime in seconds. Triggers graceful shutdown when exceeded.
Proxy strategy:
auto: Smart escalation with expert-in-the-loop (default)
datacenter: Use datacenter proxy explicitly
residential: Use residential proxy explicitly
Use browser automation for JavaScript-rendered sites and Cloudflare bypass. Enables hybrid mode: browser for initial challenge, then fast HTTP requests.
Clear DeltaFetch cache to re-crawl all URLs. Useful when you need to refresh previously scraped content.
Save raw HTML content in output files. Increases file size but useful for debugging or post-processing. Only applies to JSONL exports (production mode).
Additional Scrapy command-line arguments to pass through. Format: "-s SETTING=value -L DEBUG". Advanced users only.
Test Mode (with —limit)
Limited crawl for testing and verification:
./scrapai crawl bbc_co_uk --project news --limit 5
Behavior:
- Stops after scraping N items
- Saves items to database (
scraped_items table)
- No HTML content stored (smaller database)
- Use
show command to view results
- No checkpoint (starts fresh each time)
Output:
🚀 Running DB spider: bbc_co_uk
🔄 Proxy mode: auto (smart escalation with expert-in-the-loop)
🧪 Test mode: Saving to database (limit: 5 items)
Use './scrapai show bbc_co_uk' to verify results
[Scrapy crawl output...]
2026-02-28 15:30:42 [scrapy.core.engine] INFO: Spider closed (closespider_itemcount)
Use test mode to verify spider configuration before running a full production crawl.
Production Mode (no limit)
Full crawl with checkpoint support:
./scrapai crawl bbc_co_uk --project news
Behavior:
- Crawls all matching URLs
- Exports to timestamped JSONL file:
data/news/bbc_co_uk/crawls/crawl_28022026_153042.jsonl
- Includes full HTML content
- Database writes disabled (performance)
- Checkpoint enabled: Press Ctrl+C to pause, run same command to resume
Output:
🚀 Running DB spider: bbc_co_uk
🔄 Proxy mode: auto (smart escalation with expert-in-the-loop)
📁 Production mode: Exporting to files (database disabled)
💾 Checkpoint enabled: data/news/bbc_co_uk/checkpoint
Press Ctrl+C to pause, run same command to resume
Output: data/news/bbc_co_uk/crawls/crawl_28022026_153042.jsonl (includes HTML)
[Scrapy crawl output...]
2026-02-28 17:45:30 [scrapy.core.engine] INFO: Spider closed (finished)
✓ Checkpoint cleaned up (successful completion)
Checkpoint Pause/Resume
Press Ctrl+C during a production crawl to pause:
# Start crawl
./scrapai crawl bbc_co_uk --project news
# Press Ctrl+C after 30 minutes
^C
2026-02-28 16:00:15 [scrapy.core.engine] INFO: Spider closed (shutdown)
Checkpoint is saved at data/news/bbc_co_uk/checkpoint/. Resume by running the same command:
./scrapai crawl bbc_co_uk --project news
🚀 Running DB spider: bbc_co_uk
💾 Checkpoint enabled: data/news/bbc_co_uk/checkpoint
Resuming from previous crawl...
Checkpoint stores:
- URL queue state
- Visited URLs (for deduplication)
- Spider state variables
- Proxy type used (if proxy type changes, checkpoint is cleared)
Proxy Modes
Auto (Smart Escalation)
Default behavior. Starts with direct connections, escalates to proxy if blocked:
./scrapai crawl myspider --project myproject --proxy-type auto
- First request: Direct connection
- If blocked (403/429): Retry via datacenter proxy
- Domain remembered for subsequent crawls
- Residential proxy requires explicit opt-in
Datacenter Proxy
Use datacenter proxy for all requests:
./scrapai crawl myspider --project myproject --proxy-type datacenter
Requires proxy configuration in .env:
DATACENTER_PROXY_USERNAME=your_username
DATACENTER_PROXY_PASSWORD=your_password
DATACENTER_PROXY_HOST=proxy.example.com
DATACENTER_PROXY_PORT=10000
Residential Proxy
Use residential proxy for all requests (expensive, use sparingly):
./scrapai crawl myspider --project myproject --proxy-type residential
Requires residential proxy configuration in .env.
Changing proxy type mid-crawl clears the checkpoint to ensure all URLs are retried with the new proxy.
Timeout
Set maximum runtime for long crawls:
# Max 2 hours (7200 seconds)
./scrapai crawl myspider --project myproject --timeout 7200
Output:
⏱️ Max runtime: 2.0 hours (graceful stop)
When timeout is reached, Scrapy triggers graceful shutdown (finishes in-flight requests, saves checkpoint).
Cloudflare Bypass
For spiders with CLOUDFLARE_ENABLED: true in settings:
./scrapai crawl cloudflare_spider --project myproject
Linux (headless server):
🖥️ Headless environment detected - using xvfb for Cloudflare bypass
Requires xvfb installed:
sudo apt-get install xvfb
ScrapAI automatically wraps the command with xvfb-run -a when needed.
macOS/Windows:
🖥️ Display available - using native browser for Cloudflare bypass
Uses system display for browser automation.
Sitemap Spider
For spiders with USE_SITEMAP: true in settings:
Crawls from XML sitemap instead of following links. Faster for sites with good sitemaps.
crawl-all
Run all active spiders in a project sequentially.
Syntax
./scrapai crawl-all --project <name> [--limit <N>]
Options
Limit items per spider (test mode).
Example
./scrapai crawl-all --project news --limit 10
Output:
🚀 Running all spiders for project: news
🕷️ Spiders: bbc_co_uk, cnn_com, reuters_com
==================================================
Running: bbc_co_uk
==================================================
🚀 Running DB spider: bbc_co_uk
[...]
==================================================
Running: cnn_com
==================================================
🚀 Running DB spider: cnn_com
[...]
==================================================
Running: reuters_com
==================================================
🚀 Running DB spider: reuters_com
[...]
For parallel execution, use GNU parallel with bin/parallel-crawl <project>.
JSONL (Production)
Each line is a JSON object:
{"url": "https://bbc.co.uk/news/article-1", "title": "Article Title", "content": "...", "html": "<!DOCTYPE html>...", "scraped_at": "2026-02-28T15:30:42"}
{"url": "https://bbc.co.uk/news/article-2", "title": "Another Article", "content": "...", "html": "<!DOCTYPE html>...", "scraped_at": "2026-02-28T15:31:15"}
Database (Test Mode)
Stored in scraped_items table:
SELECT id, url, title, scraped_at
FROM scraped_items
WHERE spider_id = (SELECT id FROM spiders WHERE name = 'bbc_co_uk');
Concurrent Requests
Adjust in spider settings:
{
"settings": {
"CONCURRENT_REQUESTS": 16
}
}
Default: 8. Higher values = faster crawls, but risk IP bans.
Download Delay
Polite crawling delay between requests:
{
"settings": {
"DOWNLOAD_DELAY": 2
}
}
Default: 0. Recommended: 1-3 seconds for aggressive crawling, 5+ for polite crawling.
Incremental Crawling
Enable DeltaFetch to skip already-scraped URLs:
{
"settings": {
"DELTAFETCH_ENABLED": true
}
}
Reduces bandwidth by 80-90% on routine re-crawls.
Troubleshooting
Spider Not Found
❌ Spider 'myspider' not found in database.
Solution: Import the spider first:
./scrapai spiders import myspider.json --project myproject
./scrapai spiders list --project myproject
Permission Denied (Output File)
PermissionError: [Errno 13] Permission denied: 'data/news/spider/crawls/crawl.jsonl'
Solution: Check DATA_DIR permissions:
ls -la data/
chmod -R u+w data/
Cloudflare Bypass Failed
⚠️ WARNING: Cloudflare bypass enabled but no display available and xvfb not installed
Solution (Linux):
sudo apt-get install xvfb
Checkpoint Corruption
If resume fails with errors:
rm -rf data/<project>/<spider>/checkpoint
./scrapai crawl <spider> --project <project>
Next Steps