Apache Airflow Integration

Schedule and monitor ScrapAI spiders at scale with Apache Airflow. Each spider becomes a DAG with automatic discovery, project-based organization, and optional S3 upload.

Overview

The Airflow integration provides:

Automatic DAG generation from your spider database
Project-based organization with filtering and access control
Scheduled crawls with configurable intervals
Real-time monitoring with logs and execution history
S3 upload with gzip compression (optional)

Architecture

┌─────────────────────┐
│   Airflow Web UI    │  Port 8080
│   (Browse/Trigger)  │
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│  Airflow Scheduler  │  Reads DAG files
│  (Manages Schedule) │  every few minutes
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│  DAG Generator      │  Queries ScrapAI DB
│  (Python script)    │  Generates DAGs dynamically
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│  ScrapAI Database   │  Your spider configs
│  (PostgreSQL)       │
└─────────────────────┘
           │
┌──────────▼──────────┐
│  Bash Operator      │  Executes:
│  (Run Task)         │  ./scrapai crawl {name}
└─────────────────────┘

Quick Start

1. Configure Environment

Add to your .env file:

# Airflow admin credentials
_AIRFLOW_WWW_USER_USERNAME=admin
_AIRFLOW_WWW_USER_PASSWORD=your_secure_password

# Set Airflow UID to match your user
AIRFLOW_UID=$(id -u)

# ScrapAI Database connection (PostgreSQL required)
# Note: Must use connection string format, not individual variables
DATABASE_URL=postgresql://user:password@host.docker.internal:5432/scrapai

2. Start Airflow

docker compose -f docker-compose.airflow.yml up -d

Wait 1-2 minutes for initialization.

3. Access Web UI

Open http://localhost:8080 and log in with your credentials. You’ll see DAGs for each spider in your database, named {project}_{spider_name}.

DAG Generation

DAG Naming Convention

Pattern: {project}_{spider_name} Examples:

news_bbc_co_uk
climate_team_climate_news
default_example_spider (if no project set)

DAG Configuration

Each DAG includes:

dag = DAG(
    dag_id=f"{project}_{spider_name}",
    schedule_interval=None,  # Manual triggering by default
    tags=['scrapai', f'project:{project}', 'spider'],
    catchup=False,
    max_active_runs=1,  # Prevent concurrent runs
)

Task Structure

Each DAG has 2-3 tasks:

crawl_spider: Runs ./scrapai crawl {spider_name} --project {project} --timeout 28800
- 8-hour graceful timeout
- 9-hour hard kill as fallback
verify_results: Runs ./scrapai show {spider_name} --project {project} --limit 5
- Verifies data was extracted
- Shows sample of results
upload_to_s3 (optional): Compresses and uploads to S3
- Only runs if S3 credentials are configured
- Gzip compression before upload
- Preserves folder structure

Scheduling Spiders

By default, spiders have no schedule (manual triggering only). To add scheduling:

Option 1: Database Column

Add a schedule_interval column to your spiders table:

ALTER TABLE spiders ADD COLUMN schedule_interval VARCHAR(50);

-- Set daily schedule for a spider
UPDATE spiders SET schedule_interval = '0 0 * * *' WHERE name = 'bbc_co_uk';

Option 2: Edit DAG Generator

Modify airflow/dags/scrapai_spider_dags.py:

# Custom schedule logic
if spider.name.startswith('news_'):
    schedule_interval = '@daily'
elif spider.name.startswith('research_'):
    schedule_interval = '@weekly'
else:
    schedule_interval = None

Common Schedules

Interval	Cron Expression	Description
`@hourly`	`0 * * * *`	Every hour at minute 0
`@daily`	`0 0 * * *`	Daily at midnight
`@weekly`	`0 0 * * 0`	Weekly on Sunday
Custom	`0 /6 * *`	Every 6 hours
Custom	`0 9 * * 1-5`	Weekdays at 9am

Project-Based Organization

Filtering by Project

Go to Airflow UI → DAGs page
Click a project tag: project:your_project_name
See only that project’s spiders

Environment Variable Filter

Limit which projects appear in Airflow:

# In .env
AIRFLOW_PROJECT_FILTER=news,research,climate

Only spiders from those projects will generate DAGs.

Triggering Crawls

Via Web UI

Go to DAGs page
Find your spider DAG
Click the “Play” button (▶)
Monitor progress in real-time

Via CLI

# Trigger a specific spider
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  airflow dags trigger {project}_{spider_name}

# Example
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  airflow dags trigger news_bbc_co_uk

Via REST API

curl -X POST \
  http://localhost:8080/api/v1/dags/{project}_{spider_name}/dagRuns \
  -H "Content-Type: application/json" \
  -u "admin:your_password" \
  -d '{"conf": {}}'

Monitoring

View Execution Logs

Click DAG name
Select a DAG run (date/time)
Click task (green/red box)
Click “Log” button

Execution History

Last run status (success/fail)
Run duration
Success rate over time
Task dependencies graph
Records scraped from verify task output

S3 Integration

Configuration

Add to .env:

S3_ACCESS_KEY=your_access_key
S3_SECRET_KEY=your_secret_key
S3_ENDPOINT=https://s3.amazonaws.com
S3_BUCKET=scrapai-crawls

Upload Behavior

From airflow/dags/scrapai_spider_dags.py:61-140:

def upload_to_s3(spider_name: str, project: str, **context):
    # Find latest crawl file (includes project in path)
    data_dir = SCRAPAI_PATH / 'data' / project / spider_name
    crawl_files = sorted(glob(str(data_dir / '**' / 'crawl_*.jsonl'), recursive=True), reverse=True)
    latest_file = crawl_files[0]

    # Compress to .jsonl.gz
    with open(latest_path, 'rb') as f_in:
        with gzip.open(gz_path, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)

    # Preserve folder structure: project/spider_name/date/filename.gz
    relative_path = gz_path.relative_to(SCRAPAI_PATH / 'data')
    s3_key = str(relative_path)

    # Upload
    s3_client.upload_file(str(gz_path), s3_bucket, s3_key)

    # Clean up local files after successful upload
    gz_path.unlink()
    latest_path.unlink()

Access Control (RBAC)

Creating Project-Specific Roles

Go to Security → List Roles
Click ”+” to add new role
Name: project_news_admin
Select permissions:
- can_read on DAG:news_*
- can_edit on DAG:news_*
- can_trigger on DAG:news_*

Creating Users

Go to Security → List Users
Click ”+” to add new user
Assign role: project_news_admin

Permission Levels

Role	Can View	Can Trigger	Can Edit	Can Delete
Admin	All DAGs	Yes	Yes	Yes
Project Admin	Project DAGs	Yes	Yes	Yes
Project User	Project DAGs	Yes	Yes	No
Viewer	Project DAGs	No	No	No

Alerting

Email Notifications

Edit DEFAULT_DAG_ARGS in scrapai_spider_dags.py:50-58:

DEFAULT_DAG_ARGS = {
    'owner': 'scrapai',
    'email': ['your-email@example.com'],
    'email_on_failure': True,
    'email_on_retry': False,
    # ... other settings ...
}

Configure SMTP

Add to docker-compose.airflow.yml environment:

AIRFLOW__SMTP__SMTP_HOST: smtp.gmail.com
AIRFLOW__SMTP__SMTP_PORT: 587
AIRFLOW__SMTP__SMTP_USER: your-email@gmail.com
AIRFLOW__SMTP__SMTP_PASSWORD: your-app-password
AIRFLOW__SMTP__SMTP_MAIL_FROM: your-email@gmail.com

Custom Alerts

Add custom task after verify:

notify_task = BashOperator(
    task_id='send_notification',
    bash_command=f'curl -X POST https://your-webhook.com/notify \\
        -d "{{\"spider\": \"{spider.name}\", \"status\": \"complete\"}}"',
)

crawl_task >> verify_task >> notify_task

Management Commands

# Start Airflow
docker compose -f docker-compose.airflow.yml up -d

# Stop Airflow
docker compose -f docker-compose.airflow.yml down

# View logs
docker compose -f docker-compose.airflow.yml logs -f airflow-scheduler

# Restart scheduler (to pick up DAG changes)
docker compose -f docker-compose.airflow.yml restart airflow-scheduler

# List all DAGs
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  airflow dags list

# Pause/unpause DAG
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  airflow dags pause {dag_id}

docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  airflow dags unpause {dag_id}

# Reset everything (WARNING: deletes all Airflow data)
docker compose -f docker-compose.airflow.yml down -v

Troubleshooting

DAGs Not Showing Up

Check DAG file for errors:

docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  python /opt/airflow/dags/scrapai_spider_dags.py

Check scheduler logs:

docker compose -f docker-compose.airflow.yml logs airflow-scheduler

Verify database connection:

docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  python -c "from core.db import SessionLocal; print(SessionLocal())"

Spider Crawls Failing

Check task logs in Airflow UI:

Click failed task (red box)
Click “Log” button
Look for error messages

Test spider manually:

# SSH into container
docker compose -f docker-compose.airflow.yml exec airflow-webserver bash

# Try running spider
cd /opt/scrapai
source .venv/bin/activate
./scrapai crawl {spider_name} --project {project}

Database Connection Issues

Use host.docker.internal in DATABASE_URL:

# In .env (correct format)
DATABASE_URL=postgresql://user:password@host.docker.internal:5432/scrapai

Test connectivity from container:

docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  ping -c 3 host.docker.internal

Parallel Crawling

Run multiple spiders simultaneously with GNU parallel

Security

Security validation and agent safety features

​Overview

​Architecture

​Quick Start

​1. Configure Environment

​2. Start Airflow

​3. Access Web UI

​DAG Generation

​DAG Naming Convention

​DAG Configuration

​Task Structure

​Scheduling Spiders

​Option 1: Database Column

​Option 2: Edit DAG Generator

​Common Schedules

​Project-Based Organization

​Filtering by Project

​Environment Variable Filter

​Triggering Crawls

​Via Web UI

​Via CLI

​Via REST API

​Monitoring

​View Execution Logs

​Execution History

​S3 Integration

​Configuration

​Upload Behavior

​Access Control (RBAC)

​Creating Project-Specific Roles

​Creating Users

​Permission Levels

​Alerting

​Email Notifications

​Configure SMTP

​Custom Alerts

​Management Commands

​Troubleshooting

​DAGs Not Showing Up

​Spider Crawls Failing

​Database Connection Issues

​See Also