Parallel Crawling

Crawl hundreds of websites in parallel with automatic resource management and intelligent parallelism detection.

Overview

The parallel-crawl script uses GNU parallel to run multiple ScrapAI spiders concurrently. It automatically detects system resources (CPU cores, available memory) and calculates optimal parallelism based on spider types (regular vs. Cloudflare-enabled).

Quick Start

Install GNU Parallel

brew install parallel

Run All Spiders in Project

bin/parallel-crawl myproject

Run Specific Spiders

bin/parallel-crawl myproject spider1 spider2 spider3

How It Works

From bin/parallel-crawl:1-134:

#!/bin/bash
# Parallel crawler using GNU parallel

set -euo pipefail

PROJECT="$1"
shift

# Get spider list
if [ $# -eq 0 ]; then
    SPIDERS=$(./scrapai spiders list --project "$PROJECT" | grep '•' | awk '{print $2}')
else
    SPIDERS="$@"
fi

# Count Cloudflare-enabled spiders
CF_COUNT=$(python3 -c "
from core.db import get_db
from core.models import Spider

db = next(get_db())
names = sys.argv[1:]
count = 0
for name in names:
    spider = db.query(Spider).filter(Spider.name == name).first()
    if spider:
        for s in spider.settings:
            if s.key == 'CLOUDFLARE_ENABLED' and str(s.value).lower() in ('true', '1'):
                count += 1
                break
print(count)
" $SPIDERS)

# Auto-detect parallelism from system resources
CPU_CORES=$(nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
AVAILABLE_MEM_MB=$(free -m | awk '/^Mem:/ {print $7}')

# Memory per spider: regular 200MB, Cloudflare 500MB
if [ "$CF_COUNT" -eq 0 ]; then
    MEM_PER_SPIDER=200
elif [ "$CF_COUNT" -eq "$SPIDER_COUNT" ]; then
    MEM_PER_SPIDER=500
else
    MEM_PER_SPIDER=$(( (REGULAR_COUNT * 200 + CF_COUNT * 500) / SPIDER_COUNT ))
fi

MEM_PARALLEL=$(( (AVAILABLE_MEM_MB - 2048) / MEM_PER_SPIDER ))
CPU_PARALLEL=$(( CPU_CORES * 80 / 100 ))
PARALLEL=$(( MEM_PARALLEL < CPU_PARALLEL ? MEM_PARALLEL : CPU_PARALLEL ))

# Run crawls in parallel
echo "$SPIDERS" | tr ' ' '\n' | parallel \
    -j "$PARALLEL" \
    --timeout 8h \
    --halt soon,fail=50% \
    --line-buffer \
    --tagstring "[{.}]" \
    "./scrapai crawl {} --project $PROJECT"

Resource Calculation

The script automatically calculates optimal parallelism:

Memory allocation: Regular spiders (200 MB), Cloudflare spiders (500 MB)
CPU limit: 80% of available cores
Final parallelism: Minimum of memory and CPU limits, clamped between 2-20

Example Output

$ bin/parallel-crawl news

==========================================
Parallel Crawler
==========================================
Project:  news
Spiders:  47 (12 CF + 35 regular)
Parallel: 8 jobs
Timeout:  8h per spider
==========================================

Continue? (y/N): y

Starting parallel crawl...

[bbc_co_uk]  Starting crawl...
[guardian]   Starting crawl...
[reuters]    Starting crawl...
[cnn]        Starting crawl...
[bbc_co_uk]  ✓ Crawled 1,247 pages
[guardian]   ✓ Crawled 892 pages
[ap_news]    Starting crawl...
[reuters]    ✓ Crawled 2,103 pages
...

Advanced Usage

Custom Parallelism

Override auto-detection:

# Force 4 parallel jobs
echo "$SPIDERS" | tr ' ' '\n' | parallel -j 4 \
    "./scrapai crawl {} --project myproject"

Timeout Control

# 2-hour timeout per spider
parallel --timeout 2h ...

# No timeout (dangerous for stuck spiders)
parallel --timeout 0 ...

Failure Handling

From bin/parallel-crawl:127:

--halt soon,fail=50%

Stops all jobs if 50% or more fail. Prevents wasting resources on broken configuration. Other halt strategies:

--halt now,fail=1     # Stop immediately on first failure
--halt soon,fail=10%  # Stop if 10% fail
--halt never          # Continue even if all fail

Progress Monitoring

# Add progress bar
parallel --progress ...

# Show ETA
parallel --eta ...

# Both
parallel --progress --eta ...

Job Log

# Log all job completions
parallel --joblog crawl_log.txt ...

# Resume from log (skip completed jobs)
parallel --joblog crawl_log.txt --resume ...

Resource Management

Regular spiders: 200 MB allocation
Cloudflare spiders: 500 MB allocation (includes browser overhead)
Output files: Separate paths per spider prevent I/O contention

Best Practices

By Fleet Size

# Small fleet (< 10 spiders): Run all at once
bin/parallel-crawl myproject

# Medium fleet (10-50): Prioritize important spiders first
bin/parallel-crawl myproject priority_spider1 priority_spider2

# Large fleet (50+): Split by type
bin/parallel-crawl myproject $(./scrapai spiders list --project myproject | grep -i cloudflare | awk '{print $2}')
bin/parallel-crawl myproject $(./scrapai spiders list --project myproject | grep -v cloudflare | awk '{print $2}')

Memory-Constrained Systems

# Reduce parallelism or run sequentially
echo "$SPIDERS" | tr ' ' '\n' | parallel -j 2 ...

Comparison with Airflow

Feature	Parallel-Crawl	Airflow
Setup	None (just GNU parallel)	Docker + configuration
Scheduling	Cron jobs	Built-in scheduler
Monitoring	Terminal output + logs	Web UI + graphs
Parallelism	Auto-detected	Manual configuration
Retry logic	Manual (rerun command)	Automatic with backoff
Use case	Ad-hoc batch crawls	Production scheduling

When to use parallel-crawl:

One-time crawls of many sites
Testing spider fleet
Resource-constrained environments
Simple cron-based scheduling

When to use Airflow:

Production deployments
Complex dependencies between spiders
Team collaboration
Historical execution tracking

Integration with Cron

Daily Crawl of All Spiders

# crontab -e
0 2 * * * cd /path/to/scrapai-cli && bin/parallel-crawl news >> logs/crawl.log 2>&1

Runs at 2am daily.

Weekday vs. Weekend

# Weekdays: full crawl
0 2 * * 1-5 cd /path/to/scrapai-cli && bin/parallel-crawl news

# Weekends: high-priority only
0 2 * * 0,6 cd /path/to/scrapai-cli && bin/parallel-crawl news priority1 priority2

Staggered Batches

# Multiple batches throughout the day
0 2 * * * cd /path/to/scrapai-cli && bin/parallel-crawl news $(cat batch1.txt)
0 8 * * * cd /path/to/scrapai-cli && bin/parallel-crawl news $(cat batch2.txt)

Troubleshooting

GNU Parallel Not Found

❌ GNU parallel is not installed

# Install on macOS
brew install parallel

# Install on Linux
sudo apt-get install parallel

Out of Memory Errors

Symptom: Spiders crash with “Killed” or OOM errors. Solution: Reduce parallelism or split fleet.

# Check available memory
free -h

# Reduce parallelism
echo "$SPIDERS" | tr ' ' '\n' | parallel -j 2 ...

Some Spiders Timeout

Symptom: “SIGTERM” or timeout messages. Solution: Increase timeout or exclude slow spiders.

# Increase timeout
parallel --timeout 12h ...

# Run slow spiders separately
bin/parallel-crawl news fast_spider1 fast_spider2
./scrapai crawl slow_spider --project news  # Run alone

Jobs Not Starting

Check if parallel is actually running:

ps aux | grep parallel

Check logs:

tail -f ~/.parallel/tmp/*

Airflow Integration

Production scheduling with Apache Airflow

Checkpoint Resume

Pause and resume long crawls

​Overview

​Quick Start

​Install GNU Parallel

​Run All Spiders in Project

​Run Specific Spiders

​How It Works

​Resource Calculation

​Example Output

​Advanced Usage

​Custom Parallelism

​Timeout Control

​Failure Handling

​Progress Monitoring

​Job Log

​Resource Management

​Best Practices

​By Fleet Size

​Memory-Constrained Systems

​Comparison with Airflow

​Integration with Cron

​Daily Crawl of All Spiders

​Weekday vs. Weekend

​Staggered Batches

​Troubleshooting

​GNU Parallel Not Found

​Out of Memory Errors

​Some Spiders Timeout

​Jobs Not Starting

​See Also