Skip to main content
Crawl hundreds of websites in parallel with automatic resource management and intelligent parallelism detection.

Overview

The parallel-crawl script uses GNU parallel to run multiple ScrapAI spiders concurrently. It automatically detects system resources (CPU cores, available memory) and calculates optimal parallelism based on spider types (regular vs. Cloudflare-enabled).

Quick Start

Install GNU Parallel

brew install parallel

Run All Spiders in Project

bin/parallel-crawl myproject

Run Specific Spiders

bin/parallel-crawl myproject spider1 spider2 spider3

How It Works

From bin/parallel-crawl:1-134:
#!/bin/bash
# Parallel crawler using GNU parallel

set -euo pipefail

PROJECT="$1"
shift

# Get spider list
if [ $# -eq 0 ]; then
    SPIDERS=$(./scrapai spiders list --project "$PROJECT" | grep '•' | awk '{print $2}')
else
    SPIDERS="$@"
fi

# Count Cloudflare-enabled spiders
CF_COUNT=$(python3 -c "
from core.db import get_db
from core.models import Spider

db = next(get_db())
names = sys.argv[1:]
count = 0
for name in names:
    spider = db.query(Spider).filter(Spider.name == name).first()
    if spider:
        for s in spider.settings:
            if s.key == 'CLOUDFLARE_ENABLED' and str(s.value).lower() in ('true', '1'):
                count += 1
                break
print(count)
" $SPIDERS)

# Auto-detect parallelism from system resources
CPU_CORES=$(nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
AVAILABLE_MEM_MB=$(free -m | awk '/^Mem:/ {print $7}')

# Memory per spider: regular 200MB, Cloudflare 500MB
if [ "$CF_COUNT" -eq 0 ]; then
    MEM_PER_SPIDER=200
elif [ "$CF_COUNT" -eq "$SPIDER_COUNT" ]; then
    MEM_PER_SPIDER=500
else
    MEM_PER_SPIDER=$(( (REGULAR_COUNT * 200 + CF_COUNT * 500) / SPIDER_COUNT ))
fi

MEM_PARALLEL=$(( (AVAILABLE_MEM_MB - 2048) / MEM_PER_SPIDER ))
CPU_PARALLEL=$(( CPU_CORES * 80 / 100 ))
PARALLEL=$(( MEM_PARALLEL < CPU_PARALLEL ? MEM_PARALLEL : CPU_PARALLEL ))

# Run crawls in parallel
echo "$SPIDERS" | tr ' ' '\n' | parallel \
    -j "$PARALLEL" \
    --timeout 8h \
    --halt soon,fail=50% \
    --line-buffer \
    --tagstring "[{.}]" \
    "./scrapai crawl {} --project $PROJECT"

Resource Calculation

The script automatically calculates optimal parallelism:
  • Memory allocation: Regular spiders (200 MB), Cloudflare spiders (500 MB)
  • CPU limit: 80% of available cores
  • Final parallelism: Minimum of memory and CPU limits, clamped between 2-20

Example Output

$ bin/parallel-crawl news

==========================================
Parallel Crawler
==========================================
Project:  news
Spiders:  47 (12 CF + 35 regular)
Parallel: 8 jobs
Timeout:  8h per spider
==========================================

Continue? (y/N): y

Starting parallel crawl...

[bbc_co_uk]  Starting crawl...
[guardian]   Starting crawl...
[reuters]    Starting crawl...
[cnn]        Starting crawl...
[bbc_co_uk]  ✓ Crawled 1,247 pages
[guardian]   ✓ Crawled 892 pages
[ap_news]    Starting crawl...
[reuters]    ✓ Crawled 2,103 pages
...

Advanced Usage

Custom Parallelism

Override auto-detection:
# Force 4 parallel jobs
echo "$SPIDERS" | tr ' ' '\n' | parallel -j 4 \
    "./scrapai crawl {} --project myproject"

Timeout Control

# 2-hour timeout per spider
parallel --timeout 2h ...

# No timeout (dangerous for stuck spiders)
parallel --timeout 0 ...

Failure Handling

From bin/parallel-crawl:127:
--halt soon,fail=50%
Stops all jobs if 50% or more fail. Prevents wasting resources on broken configuration. Other halt strategies:
--halt now,fail=1     # Stop immediately on first failure
--halt soon,fail=10%  # Stop if 10% fail
--halt never          # Continue even if all fail

Progress Monitoring

# Add progress bar
parallel --progress ...

# Show ETA
parallel --eta ...

# Both
parallel --progress --eta ...

Job Log

# Log all job completions
parallel --joblog crawl_log.txt ...

# Resume from log (skip completed jobs)
parallel --joblog crawl_log.txt --resume ...

Resource Management

  • Regular spiders: 200 MB allocation
  • Cloudflare spiders: 500 MB allocation (includes browser overhead)
  • Output files: Separate paths per spider prevent I/O contention

Best Practices

By Fleet Size

# Small fleet (< 10 spiders): Run all at once
bin/parallel-crawl myproject

# Medium fleet (10-50): Prioritize important spiders first
bin/parallel-crawl myproject priority_spider1 priority_spider2

# Large fleet (50+): Split by type
bin/parallel-crawl myproject $(./scrapai spiders list --project myproject | grep -i cloudflare | awk '{print $2}')
bin/parallel-crawl myproject $(./scrapai spiders list --project myproject | grep -v cloudflare | awk '{print $2}')

Memory-Constrained Systems

# Reduce parallelism or run sequentially
echo "$SPIDERS" | tr ' ' '\n' | parallel -j 2 ...

Comparison with Airflow

FeatureParallel-CrawlAirflow
SetupNone (just GNU parallel)Docker + configuration
SchedulingCron jobsBuilt-in scheduler
MonitoringTerminal output + logsWeb UI + graphs
ParallelismAuto-detectedManual configuration
Retry logicManual (rerun command)Automatic with backoff
Use caseAd-hoc batch crawlsProduction scheduling
When to use parallel-crawl:
  • One-time crawls of many sites
  • Testing spider fleet
  • Resource-constrained environments
  • Simple cron-based scheduling
When to use Airflow:
  • Production deployments
  • Complex dependencies between spiders
  • Team collaboration
  • Historical execution tracking

Integration with Cron

Daily Crawl of All Spiders

# crontab -e
0 2 * * * cd /path/to/scrapai-cli && bin/parallel-crawl news >> logs/crawl.log 2>&1
Runs at 2am daily.

Weekday vs. Weekend

# Weekdays: full crawl
0 2 * * 1-5 cd /path/to/scrapai-cli && bin/parallel-crawl news

# Weekends: high-priority only
0 2 * * 0,6 cd /path/to/scrapai-cli && bin/parallel-crawl news priority1 priority2

Staggered Batches

# Multiple batches throughout the day
0 2 * * * cd /path/to/scrapai-cli && bin/parallel-crawl news $(cat batch1.txt)
0 8 * * * cd /path/to/scrapai-cli && bin/parallel-crawl news $(cat batch2.txt)

Troubleshooting

GNU Parallel Not Found

 GNU parallel is not installed

# Install on macOS
brew install parallel

# Install on Linux
sudo apt-get install parallel

Out of Memory Errors

Symptom: Spiders crash with “Killed” or OOM errors. Solution: Reduce parallelism or split fleet.
# Check available memory
free -h

# Reduce parallelism
echo "$SPIDERS" | tr ' ' '\n' | parallel -j 2 ...

Some Spiders Timeout

Symptom: “SIGTERM” or timeout messages. Solution: Increase timeout or exclude slow spiders.
# Increase timeout
parallel --timeout 12h ...

# Run slow spiders separately
bin/parallel-crawl news fast_spider1 fast_spider2
./scrapai crawl slow_spider --project news  # Run alone

Jobs Not Starting

Check if parallel is actually running:
ps aux | grep parallel
Check logs:
tail -f ~/.parallel/tmp/*

See Also

Airflow Integration

Production scheduling with Apache Airflow

Checkpoint Resume

Pause and resume long crawls