Schedule and monitor ScrapAI spiders at scale with Apache Airflow. Each spider becomes a DAG with automatic discovery, project-based organization, and optional S3 upload.
Overview
The Airflow integration provides:
- Automatic DAG generation from your spider database
- Project-based organization with filtering and access control
- Scheduled crawls with configurable intervals
- Real-time monitoring with logs and execution history
- S3 upload with gzip compression (optional)
Architecture
┌─────────────────────┐
│ Airflow Web UI │ Port 8080
│ (Browse/Trigger) │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ Airflow Scheduler │ Reads DAG files
│ (Manages Schedule) │ every few minutes
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ DAG Generator │ Queries ScrapAI DB
│ (Python script) │ Generates DAGs dynamically
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ ScrapAI Database │ Your spider configs
│ (PostgreSQL) │
└─────────────────────┘
│
┌──────────▼──────────┐
│ Bash Operator │ Executes:
│ (Run Task) │ ./scrapai crawl {name}
└─────────────────────┘
Quick Start
Add to your .env file:
# Airflow admin credentials
_AIRFLOW_WWW_USER_USERNAME=admin
_AIRFLOW_WWW_USER_PASSWORD=your_secure_password
# Set Airflow UID to match your user
AIRFLOW_UID=$(id -u)
# ScrapAI Database connection (PostgreSQL required)
# Note: Must use connection string format, not individual variables
DATABASE_URL=postgresql://user:password@host.docker.internal:5432/scrapai
2. Start Airflow
docker compose -f docker-compose.airflow.yml up -d
Wait 1-2 minutes for initialization.
3. Access Web UI
Open http://localhost:8080 and log in with your credentials.
You’ll see DAGs for each spider in your database, named {project}_{spider_name}.
DAG Generation
DAGs are generated dynamically from your spider database. The generator runs on scheduler refresh (every few minutes).
DAG Naming Convention
Pattern: {project}_{spider_name}
Examples:
news_bbc_co_uk
climate_team_climate_news
default_example_spider (if no project set)
DAG Configuration
Each DAG includes:
dag = DAG(
dag_id=f"{project}_{spider_name}",
schedule_interval=None, # Manual triggering by default
tags=['scrapai', f'project:{project}', 'spider'],
catchup=False,
max_active_runs=1, # Prevent concurrent runs
)
Code examples in this guide are simplified for clarity. The actual bash commands in the DAG include path changes (cd {SCRAPAI_PATH}) and virtual environment activation (source .venv/bin/activate).
Task Structure
Each DAG has 2-3 tasks:
-
crawl_spider: Runs
./scrapai crawl {spider_name} --project {project} --timeout 28800
- 8-hour graceful timeout
- 9-hour hard kill as fallback
-
verify_results: Runs
./scrapai show {spider_name} --project {project} --limit 5
- Verifies data was extracted
- Shows sample of results
-
upload_to_s3 (optional): Compresses and uploads to S3
- Only runs if S3 credentials are configured
- Gzip compression before upload
- Preserves folder structure
Scheduling Spiders
By default, spiders have no schedule (manual triggering only). To add scheduling:
Option 1: Database Column
Add a schedule_interval column to your spiders table:
ALTER TABLE spiders ADD COLUMN schedule_interval VARCHAR(50);
-- Set daily schedule for a spider
UPDATE spiders SET schedule_interval = '0 0 * * *' WHERE name = 'bbc_co_uk';
Option 2: Edit DAG Generator
Modify airflow/dags/scrapai_spider_dags.py:
# Custom schedule logic
if spider.name.startswith('news_'):
schedule_interval = '@daily'
elif spider.name.startswith('research_'):
schedule_interval = '@weekly'
else:
schedule_interval = None
Common Schedules
| Interval | Cron Expression | Description |
|---|
@hourly | 0 * * * * | Every hour at minute 0 |
@daily | 0 0 * * * | Daily at midnight |
@weekly | 0 0 * * 0 | Weekly on Sunday |
| Custom | 0 */6 * * * | Every 6 hours |
| Custom | 0 9 * * 1-5 | Weekdays at 9am |
Project-Based Organization
Filtering by Project
- Go to Airflow UI → DAGs page
- Click a project tag:
project:your_project_name
- See only that project’s spiders
Environment Variable Filter
Limit which projects appear in Airflow:
# In .env
AIRFLOW_PROJECT_FILTER=news,research,climate
Only spiders from those projects will generate DAGs.
Triggering Crawls
Via Web UI
- Go to DAGs page
- Find your spider DAG
- Click the “Play” button (▶)
- Monitor progress in real-time
Via CLI
# Trigger a specific spider
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
airflow dags trigger {project}_{spider_name}
# Example
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
airflow dags trigger news_bbc_co_uk
Via REST API
curl -X POST \
http://localhost:8080/api/v1/dags/{project}_{spider_name}/dagRuns \
-H "Content-Type: application/json" \
-u "admin:your_password" \
-d '{"conf": {}}'
Monitoring
View Execution Logs
- Click DAG name
- Select a DAG run (date/time)
- Click task (green/red box)
- Click “Log” button
Execution History
Each DAG shows:
- Last run status (success/fail)
- Run duration
- Success rate over time
- Task dependencies graph
Stats Available
- Duration: How long each crawl took
- Records scraped: From verify task output
- Failures: Which spiders are broken
- Trends: Performance over time
S3 Integration
Upload crawl results to S3-compatible storage with automatic gzip compression.
Configuration
Add to .env:
S3_ACCESS_KEY=your_access_key
S3_SECRET_KEY=your_secret_key
S3_ENDPOINT=https://s3.amazonaws.com
S3_BUCKET=scrapai-crawls
The DAG generator automatically enables S3 upload if all credentials are present.
Upload Behavior
From airflow/dags/scrapai_spider_dags.py:61-140:
def upload_to_s3(spider_name: str, project: str, **context):
# Find latest crawl file (includes project in path)
data_dir = SCRAPAI_PATH / 'data' / project / spider_name
crawl_files = sorted(glob(str(data_dir / '**' / 'crawl_*.jsonl'), recursive=True), reverse=True)
latest_file = crawl_files[0]
# Compress to .jsonl.gz
with open(latest_path, 'rb') as f_in:
with gzip.open(gz_path, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
# Preserve folder structure: project/spider_name/date/filename.gz
relative_path = gz_path.relative_to(SCRAPAI_PATH / 'data')
s3_key = str(relative_path)
# Upload
s3_client.upload_file(str(gz_path), s3_bucket, s3_key)
# Clean up local files after successful upload
gz_path.unlink()
latest_path.unlink()
Compression savings: Typically 70-90% for JSONL text data.
S3 path structure: s3://bucket/project/spider_name/YYYY-MM-DD/crawl_HHMMSS.jsonl.gz
Access Control (RBAC)
Creating Project-Specific Roles
- Go to Security → List Roles
- Click ”+” to add new role
- Name:
project_news_admin
- Select permissions:
can_read on DAG:news_*
can_edit on DAG:news_*
can_trigger on DAG:news_*
Creating Users
- Go to Security → List Users
- Click ”+” to add new user
- Assign role:
project_news_admin
Permission Levels
| Role | Can View | Can Trigger | Can Edit | Can Delete |
|---|
| Admin | All DAGs | Yes | Yes | Yes |
| Project Admin | Project DAGs | Yes | Yes | Yes |
| Project User | Project DAGs | Yes | Yes | No |
| Viewer | Project DAGs | No | No | No |
Programmatic Access Control
Uncomment in airflow/dags/scrapai_spider_dags.py:193-196:
dag = DAG(
# ... other settings ...
access_control={
f'{project}_admin': {'can_read', 'can_edit', 'can_delete'},
f'{project}_user': {'can_read', 'can_edit'},
},
)
Then create matching roles in Airflow UI.
Alerting
Email Notifications
Edit DEFAULT_DAG_ARGS in scrapai_spider_dags.py:50-58:
DEFAULT_DAG_ARGS = {
'owner': 'scrapai',
'email': ['your-email@example.com'],
'email_on_failure': True,
'email_on_retry': False,
# ... other settings ...
}
Add to docker-compose.airflow.yml environment:
AIRFLOW__SMTP__SMTP_HOST: smtp.gmail.com
AIRFLOW__SMTP__SMTP_PORT: 587
AIRFLOW__SMTP__SMTP_USER: your-email@gmail.com
AIRFLOW__SMTP__SMTP_PASSWORD: your-app-password
AIRFLOW__SMTP__SMTP_MAIL_FROM: your-email@gmail.com
Custom Alerts
Add custom task after verify:
notify_task = BashOperator(
task_id='send_notification',
bash_command=f'curl -X POST https://your-webhook.com/notify \\
-d "{{\"spider\": \"{spider.name}\", \"status\": \"complete\"}}"',
)
crawl_task >> verify_task >> notify_task
Management Commands
# Start Airflow
docker compose -f docker-compose.airflow.yml up -d
# Stop Airflow
docker compose -f docker-compose.airflow.yml down
# View logs
docker compose -f docker-compose.airflow.yml logs -f airflow-scheduler
# Restart scheduler (to pick up DAG changes)
docker compose -f docker-compose.airflow.yml restart airflow-scheduler
# List all DAGs
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
airflow dags list
# Pause/unpause DAG
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
airflow dags pause {dag_id}
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
airflow dags unpause {dag_id}
# Reset everything (WARNING: deletes all Airflow data)
docker compose -f docker-compose.airflow.yml down -v
Troubleshooting
DAGs Not Showing Up
Check DAG file for errors:
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
python /opt/airflow/dags/scrapai_spider_dags.py
Check scheduler logs:
docker compose -f docker-compose.airflow.yml logs airflow-scheduler
Verify database connection:
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
python -c "from core.db import SessionLocal; print(SessionLocal())"
Spider Crawls Failing
Check task logs in Airflow UI:
- Click failed task (red box)
- Click “Log” button
- Look for error messages
Test spider manually:
# SSH into container
docker compose -f docker-compose.airflow.yml exec airflow-webserver bash
# Try running spider
cd /opt/scrapai
source .venv/bin/activate
./scrapai crawl {spider_name} --project {project}
Database Connection Issues
Use host.docker.internal in DATABASE_URL:
# In .env (correct format)
DATABASE_URL=postgresql://user:password@host.docker.internal:5432/scrapai
Test connectivity from container:
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
ping -c 3 host.docker.internal
Best Practices
Resource Management
- Set
max_active_runs=1 to prevent concurrent runs
- Use
execution_timeout to prevent runaway tasks
- Monitor memory usage for large crawls
Scheduling Strategy
- High-frequency sites (news):
@hourly or 0 */6 * * *
- Daily updates:
@daily (midnight) or 0 9 * * * (9am)
- Weekly archives:
0 0 * * 0 (Sunday midnight)
- Manual only:
None (on-demand triggering)
Monitoring
- Set up email alerts for failures
- Review execution times weekly
- Check success rates for broken spiders
- Monitor S3 storage growth
See Also