Crawl commands execute spiders from the database. ScrapAI supports two modes: test mode (limited items, saved to database) and production mode (full crawl, exported to files with checkpoint support).
crawl
Run a spider by name.
./scrapai crawl < spide r > --project < nam e > [options]
Arguments & Options
Spider name (from database).
Project name containing the spider.
Limit number of items to scrape. Enables test mode when specified.
Custom output file path (default: timestamped file in data/<project>/<spider>/crawls/).
Maximum runtime in seconds. Triggers graceful shutdown when exceeded.
Proxy strategy:
auto: Smart escalation with expert-in-the-loop (default)
datacenter: Use datacenter proxy explicitly
residential: Use residential proxy explicitly
Use browser automation for JavaScript-rendered sites and Cloudflare bypass. Enables hybrid mode: browser for initial challenge, then fast HTTP requests.
Clear DeltaFetch cache to re-crawl previously scraped URLs.
Save raw HTML in output files (production mode only). Increases file size.
Additional Scrapy arguments. Format: "-s SETTING=value -L DEBUG".
Test Mode (with —limit)
./scrapai crawl bbc_co_uk --project news --limit 5
Behavior:
Stops after N items
Saves to database (scraped_items table)
No HTML content stored
View with show command
No checkpoint support
Output:
🚀 Running DB spider: bbc_co_uk
🔄 Proxy mode: auto (smart escalation with expert-in-the-loop )
🧪 Test mode: Saving to database (limit: 5 items )
Use './scrapai show bbc_co_uk' to verify results
[Scrapy crawl output...]
2026-02-28 15:30:42 [scrapy.core.engine] INFO: Spider closed ( closespider_itemcount )
Use test mode to verify spider configuration before running a full production crawl.
Production Mode (no limit)
./scrapai crawl bbc_co_uk --project news
Behavior:
Crawls all matching URLs
Exports to timestamped JSONL: data/news/bbc_co_uk/crawls/crawl_28022026_153042.jsonl
Includes full HTML content
Database disabled (performance)
Checkpoint enabled (Ctrl+C to pause, resume with same command)
Output:
🚀 Running DB spider: bbc_co_uk
🔄 Proxy mode: auto (smart escalation with expert-in-the-loop )
📁 Production mode: Exporting to files (database disabled )
💾 Checkpoint enabled: data/news/bbc_co_uk/checkpoint
Press Ctrl+C to pause, run same command to resume
Output: data/news/bbc_co_uk/crawls/crawl_28022026_153042.jsonl (includes HTML )
[Scrapy crawl output...]
2026-02-28 17:45:30 [scrapy.core.engine] INFO: Spider closed ( finished )
✓ Checkpoint cleaned up (successful completion )
Checkpoint Pause/Resume
Press Ctrl+C to pause. Checkpoint is saved at data/news/bbc_co_uk/checkpoint/. Resume by running the same command:
./scrapai crawl bbc_co_uk --project news
🚀 Running DB spider: bbc_co_uk
💾 Checkpoint enabled: data/news/bbc_co_uk/checkpoint
Resuming from previous crawl...
Checkpoint stores:
URL queue state
Visited URLs (for deduplication)
Spider state variables
Proxy type used (if proxy type changes, checkpoint is cleared)
Proxy Modes
Auto (Smart Escalation)
./scrapai crawl myspider --project myproject --proxy-type auto
Default behavior. Starts with direct connections, escalates to datacenter proxy if blocked (403/429). Residential proxy requires explicit opt-in.
Datacenter Proxy
./scrapai crawl myspider --project myproject --proxy-type datacenter
Requires .env configuration:
DATACENTER_PROXY_USERNAME = your_username
DATACENTER_PROXY_PASSWORD = your_password
DATACENTER_PROXY_HOST = proxy.example.com
DATACENTER_PROXY_PORT = 10000
Residential Proxy
./scrapai crawl myspider --project myproject --proxy-type residential
Expensive, use sparingly. Requires .env configuration.
Changing proxy type mid-crawl clears the checkpoint to ensure all URLs are retried with the new proxy.
Timeout
./scrapai crawl myspider --project myproject --timeout 7200 # 2 hours
Triggers graceful shutdown when reached (finishes in-flight requests, saves checkpoint).
Cloudflare Bypass
For spiders with CLOUDFLARE_ENABLED: true:
./scrapai crawl cloudflare_spider --project myproject
Linux: Requires xvfb (sudo apt-get install xvfb). Automatically wraps with xvfb-run -a.
macOS/Windows: Uses native browser with system display.
Sitemap Spider
For spiders with USE_SITEMAP: true, crawls from XML sitemap instead of following links.
crawl-all
Run all active spiders in a project sequentially.
./scrapai crawl-all --project < nam e > [--limit < N > ]
Limit items per spider (enables test mode).
./scrapai crawl-all --project news --limit 10
For parallel execution, use GNU parallel with bin/parallel-crawl <project>.
JSONL (Production)
Each line is a JSON object:
{ "url" : "https://bbc.co.uk/news/article-1" , "title" : "Article Title" , "content" : "..." , "html" : "<!DOCTYPE html>..." , "scraped_at" : "2026-02-28T15:30:42" }
{ "url" : "https://bbc.co.uk/news/article-2" , "title" : "Another Article" , "content" : "..." , "html" : "<!DOCTYPE html>..." , "scraped_at" : "2026-02-28T15:31:15" }
Database (Test Mode)
Stored in scraped_items table:
SELECT id, url , title, scraped_at
FROM scraped_items
WHERE spider_id = ( SELECT id FROM spiders WHERE name = 'bbc_co_uk' );
Concurrent Requests: Default 8. Higher = faster, but risks bans.
{ "settings" : { "CONCURRENT_REQUESTS" : 16 }}
Download Delay: Default 0. Recommended 1-3s (aggressive), 5+s (polite).
{ "settings" : { "DOWNLOAD_DELAY" : 2 }}
Incremental Crawling: Enable DeltaFetch to skip scraped URLs (80-90% bandwidth reduction).
{ "settings" : { "DELTAFETCH_ENABLED" : true }}
Troubleshooting
Spider Not Found
❌ Spider 'myspider' not found in database.
Import the spider first:
./scrapai spiders import myspider.json --project myproject
Permission Denied
PermissionError: [Errno 13] Permission denied: 'data/news/spider/crawls/crawl.jsonl'
Check permissions: chmod -R u+w data/
Cloudflare Bypass Failed
⚠️ WARNING: Cloudflare bypass enabled but no display available and xvfb not installed
Linux: sudo apt-get install xvfb
Checkpoint Corruption
If resume fails:
rm -rf data/ < projec t > / < spide r > /checkpoint
./scrapai crawl < spide r > --project < projec t >
Next Steps
View Scraped Data Inspect and export crawl results
Queue Management Batch process multiple websites