Overview
The Airflow integration provides:- Automatic DAG generation from your spider database
- Project-based organization with filtering and access control
- Scheduled crawls with configurable intervals
- Real-time monitoring with logs and execution history
- S3 upload with gzip compression (optional)
Architecture
Quick Start
1. Configure Environment
Add to your.env file:
2. Start Airflow
3. Access Web UI
Open http://localhost:8080 and log in with your credentials. You’ll see DAGs for each spider in your database, named{project}_{spider_name}.
DAG Generation
DAG Naming Convention
Pattern:{project}_{spider_name}
Examples:
news_bbc_co_ukclimate_team_climate_newsdefault_example_spider(if no project set)
DAG Configuration
Each DAG includes:Task Structure
Each DAG has 2-3 tasks:-
crawl_spider: Runs
./scrapai crawl {spider_name} --project {project} --timeout 28800- 8-hour graceful timeout
- 9-hour hard kill as fallback
-
verify_results: Runs
./scrapai show {spider_name} --project {project} --limit 5- Verifies data was extracted
- Shows sample of results
-
upload_to_s3 (optional): Compresses and uploads to S3
- Only runs if S3 credentials are configured
- Gzip compression before upload
- Preserves folder structure
Scheduling Spiders
By default, spiders have no schedule (manual triggering only). To add scheduling:Option 1: Database Column
Add aschedule_interval column to your spiders table:
Option 2: Edit DAG Generator
Modifyairflow/dags/scrapai_spider_dags.py:
Common Schedules
| Interval | Cron Expression | Description |
|---|---|---|
@hourly | 0 * * * * | Every hour at minute 0 |
@daily | 0 0 * * * | Daily at midnight |
@weekly | 0 0 * * 0 | Weekly on Sunday |
| Custom | 0 */6 * * * | Every 6 hours |
| Custom | 0 9 * * 1-5 | Weekdays at 9am |
Project-Based Organization
Filtering by Project
- Go to Airflow UI → DAGs page
- Click a project tag:
project:your_project_name - See only that project’s spiders
Environment Variable Filter
Limit which projects appear in Airflow:Triggering Crawls
Via Web UI
- Go to DAGs page
- Find your spider DAG
- Click the “Play” button (▶)
- Monitor progress in real-time
Via CLI
Via REST API
Monitoring
View Execution Logs
- Click DAG name
- Select a DAG run (date/time)
- Click task (green/red box)
- Click “Log” button
Execution History
- Last run status (success/fail)
- Run duration
- Success rate over time
- Task dependencies graph
- Records scraped from verify task output
S3 Integration
Configuration
Add to.env:
Upload Behavior
Fromairflow/dags/scrapai_spider_dags.py:61-140:
Access Control (RBAC)
Creating Project-Specific Roles
- Go to Security → List Roles
- Click ”+” to add new role
- Name:
project_news_admin - Select permissions:
can_readonDAG:news_*can_editonDAG:news_*can_triggeronDAG:news_*
Creating Users
- Go to Security → List Users
- Click ”+” to add new user
- Assign role:
project_news_admin
Permission Levels
| Role | Can View | Can Trigger | Can Edit | Can Delete |
|---|---|---|---|---|
| Admin | All DAGs | Yes | Yes | Yes |
| Project Admin | Project DAGs | Yes | Yes | Yes |
| Project User | Project DAGs | Yes | Yes | No |
| Viewer | Project DAGs | No | No | No |
Alerting
Email Notifications
EditDEFAULT_DAG_ARGS in scrapai_spider_dags.py:50-58:
Configure SMTP
Add todocker-compose.airflow.yml environment:
Custom Alerts
Add custom task after verify:Management Commands
Troubleshooting
DAGs Not Showing Up
Check DAG file for errors:Spider Crawls Failing
Check task logs in Airflow UI:- Click failed task (red box)
- Click “Log” button
- Look for error messages
Database Connection Issues
Use host.docker.internal in DATABASE_URL:See Also
Parallel Crawling
Run multiple spiders simultaneously with GNU parallel
Security
Security validation and agent safety features