Overview
Theparallel-crawl script uses GNU parallel to run multiple ScrapAI spiders concurrently. It automatically detects system resources (CPU cores, available memory) and calculates optimal parallelism based on spider types (regular vs. Cloudflare-enabled).
Quick Start
Install GNU Parallel
Run All Spiders in Project
Run Specific Spiders
How It Works
Frombin/parallel-crawl:1-134:
Resource Calculation
The script automatically calculates optimal parallelism:- Memory allocation: Regular spiders (200 MB), Cloudflare spiders (500 MB)
- CPU limit: 80% of available cores
- Final parallelism: Minimum of memory and CPU limits, clamped between 2-20
Example Output
Advanced Usage
Custom Parallelism
Override auto-detection:Timeout Control
Failure Handling
Frombin/parallel-crawl:127:
Progress Monitoring
Job Log
Resource Management
- Regular spiders: 200 MB allocation
- Cloudflare spiders: 500 MB allocation (includes browser overhead)
- Output files: Separate paths per spider prevent I/O contention
Best Practices
By Fleet Size
Memory-Constrained Systems
Comparison with Airflow
| Feature | Parallel-Crawl | Airflow |
|---|---|---|
| Setup | None (just GNU parallel) | Docker + configuration |
| Scheduling | Cron jobs | Built-in scheduler |
| Monitoring | Terminal output + logs | Web UI + graphs |
| Parallelism | Auto-detected | Manual configuration |
| Retry logic | Manual (rerun command) | Automatic with backoff |
| Use case | Ad-hoc batch crawls | Production scheduling |
- One-time crawls of many sites
- Testing spider fleet
- Resource-constrained environments
- Simple cron-based scheduling
- Production deployments
- Complex dependencies between spiders
- Team collaboration
- Historical execution tracking
Integration with Cron
Daily Crawl of All Spiders
Weekday vs. Weekend
Staggered Batches
Troubleshooting
GNU Parallel Not Found
Out of Memory Errors
Symptom: Spiders crash with “Killed” or OOM errors. Solution: Reduce parallelism or split fleet.Some Spiders Timeout
Symptom: “SIGTERM” or timeout messages. Solution: Increase timeout or exclude slow spiders.Jobs Not Starting
Check if parallel is actually running:See Also
Airflow Integration
Production scheduling with Apache Airflow
Checkpoint Resume
Pause and resume long crawls