Skip to main content
ScrapAI stores spiders as database rows, not Python files. This architectural choice enables powerful management patterns impossible with file-based scrapers.

The Traditional Approach: File-Based Spiders

In a typical Scrapy project:
Traditional Scrapy Project
project/
├── spiders/
│   ├── bbc_spider.py
│   ├── cnn_spider.py
│   ├── nytimes_spider.py
│   ├── guardian_spider.py
│   └── ... 96 more files
├── settings.py
└── scrapy.cfg
Problems at scale:

No Central Inventory

Which spiders exist? Grep the filesystem. Which are active? Check each file.

Hard to Batch Update

Change DOWNLOAD_DELAY across 100 spiders? Edit 100 files or write a script.

No Metadata

When was this spider created? By whom? For what project? Add comments and hope.

Code Drift

5 developers write spiders in 5 different styles. No consistency, harder to review.

The ScrapAI Approach: Database-First

Three tables: spiders (metadata, domains, URLs), spider_rules (URL patterns, callbacks), spider_settings (key-value configs). One generic DatabaseSpider class loads any config at runtime.

Benefits of Database-First

1. Central Inventory

./scrapai spiders list --project news
Or direct SQL: SELECT name, active, created_at FROM spiders WHERE project = 'news'. No filesystem traversal, no parsing Python files.

2. Batch Updates

-- Increase delay for all news spiders
UPDATE spider_settings
SET value = '2'
WHERE key = 'DOWNLOAD_DELAY'
  AND spider_id IN (SELECT id FROM spiders WHERE project = 'news');
With files: edit 100 Python files. With database: one SQL query.

3. Rich Metadata

Every spider tracks creation time, update time, project, and activity status:
-- Count spiders per project
SELECT project, COUNT(*) FROM spiders GROUP BY project;

-- Find stale spiders (not updated in 90 days)
SELECT name, updated_at FROM spiders WHERE updated_at < NOW() - INTERVAL '90 days';

4. Import/Export as Data

./scrapai spiders export bbc_co_uk --project news > bbc_spider.json
./scrapai spiders import bbc_spider.json --project news-backup
Spiders are JSON configs. Export, share, version control, backup/restore like any data file.

5. Consistency and Validation

All configs validated via Pydantic: strict naming (^[a-zA-Z0-9_-]+$), HTTP/HTTPS URLs only, private IPs blocked. No malformed configs reach the database.

Querying the Database

ScrapAI provides a safe SQL query interface:
Read-Only Queries
# List all spiders
./scrapai db query "SELECT name, project, active FROM spiders"

# Count items per spider
./scrapai db query "SELECT spider_id, COUNT(*) FROM scraped_items GROUP BY spider_id"

# Find spiders with Cloudflare enabled
./scrapai db query "SELECT s.name FROM spiders s JOIN spider_settings ss ON s.id = ss.spider_id WHERE ss.key = 'CLOUDFLARE_ENABLED' AND ss.value = 'true'"
UPDATE/DELETE queries require confirmation via --confirm flag with row count to prevent accidental data loss.
./scrapai db query "DELETE FROM spiders WHERE name = 'test_spider'" --confirm 1

Real-World Patterns

Pattern 1: Fleet Health Check

Run test crawls on all spiders monthly:
Test All Spiders
# Get active spiders
./scrapai db query "SELECT name FROM spiders WHERE active = true AND project = 'news'" > active_spiders.txt

# Test each spider (5 sample URLs)
while read spider; do
  echo "Testing $spider..."
  ./scrapai crawl "$spider" --project news --limit 5
done < active_spiders.txt

# Check for extraction failures
./scrapai db query "SELECT spider_id, COUNT(*) FROM scraped_items WHERE scraped_at > NOW() - INTERVAL '1 day' GROUP BY spider_id HAVING COUNT(*) = 0"

Pattern 2: Bulk Configuration Changes

Rate Limit All Spiders
-- Increase delay for aggressive sites
UPDATE spider_settings 
SET value = '3'
WHERE key = 'DOWNLOAD_DELAY' 
  AND spider_id IN (
    SELECT id FROM spiders 
    WHERE allowed_domains::text LIKE '%amazon%' 
       OR allowed_domains::text LIKE '%ebay%'
  );

Pattern 3: Spider Versioning

Export before making changes:
Backup Before Update
# Export current config
./scrapai spiders export bbc_co_uk --project news > bbc_v1.json

# Make changes (via agent or manual edit)
# ... modify rules, settings, etc.

# Test new version
./scrapai crawl bbc_co_uk --project news --limit 10

# Rollback if needed
./scrapai spiders import bbc_v1.json --project news

Pattern 4: Multi-Project Management

Project Isolation
# Create separate projects for different use cases
./scrapai spiders import news_spider.json --project production
./scrapai spiders import news_spider.json --project staging
./scrapai spiders import news_spider.json --project testing

# Each project has independent configs and data
./scrapai crawl news_spider --project production
./scrapai crawl news_spider --project testing --limit 5

PostgreSQL vs SQLite

SQLite (default): Single-user, simple deployment, auto-enabled WAL mode. Good for < 100 spiders. PostgreSQL (production): Multi-user teams, 100+ spiders, high concurrency. Migrate with:
# Update .env with PostgreSQL URL
./scrapai db migrate
./scrapai db transfer sqlite:///scrapai.db

Next Steps

Spider Schema

Detailed schema reference for Spider, SpiderRule, and SpiderSetting

CLI Reference

Commands for spider management: list, import, export, delete