Overview
Theparallel-crawl script uses GNU parallel to run multiple ScrapAI spiders concurrently. It automatically detects system resources (CPU cores, available memory) and calculates optimal parallelism based on spider types (regular vs. Cloudflare-enabled).
Quick Start
Install GNU Parallel
Run All Spiders in Project
Run Specific Spiders
How It Works
Frombin/parallel-crawl:1-134:
Resource Calculation
Memory-Based Parallelism
The script allocates memory per spider type:- Regular spiders: 200 MB each
- Cloudflare spiders: 500 MB each (browser automation overhead)
- Mixed fleet: Weighted average
CPU-Based Parallelism
Final Parallelism
Example Output
Advanced Usage
Custom Parallelism
Override auto-detection:Timeout Control
Failure Handling
Frombin/parallel-crawl:127:
Progress Monitoring
Job Log
Resource Management
Memory Limits
Why 200MB for regular spiders?- Scrapy framework: ~50 MB
- Downloaded pages in memory: ~100 MB
- Extraction libraries: ~50 MB
- Above base: 200 MB
- Browser process (Chromium): ~200 MB
- Rendering overhead: ~100 MB
CPU Scheduling
GNU parallel uses fair CPU scheduling:- Jobs share CPU time equally
- I/O-bound tasks (most scrapers) yield CPU automatically
- Network-bound tasks have minimal CPU impact
Disk I/O
Each spider writes to separate output file:Patterns and Best Practices
Small Fleet (< 10 spiders)
Medium Fleet (10-50 spiders)
Large Fleet (50+ spiders)
Split by type:Memory-Constrained Systems
Comparison with Airflow
| Feature | Parallel-Crawl | Airflow |
|---|---|---|
| Setup | None (just GNU parallel) | Docker + configuration |
| Scheduling | Cron jobs | Built-in scheduler |
| Monitoring | Terminal output + logs | Web UI + graphs |
| Parallelism | Auto-detected | Manual configuration |
| Retry logic | Manual (rerun command) | Automatic with backoff |
| Use case | Ad-hoc batch crawls | Production scheduling |
- One-time crawls of many sites
- Testing spider fleet
- Resource-constrained environments
- Simple cron-based scheduling
- Production deployments
- Complex dependencies between spiders
- Team collaboration
- Historical execution tracking