The Traditional Approach: File-Based Spiders
In a typical Scrapy project:Traditional Scrapy Project
No Central Inventory
Which spiders exist? Grep the filesystem. Which are active? Check each file.
Hard to Batch Update
Change DOWNLOAD_DELAY across 100 spiders? Edit 100 files or write a script.
No Metadata
When was this spider created? By whom? For what project? Add comments and hope.
Code Drift
5 developers write spiders in 5 different styles. No consistency, harder to review.
The ScrapAI Approach: Database-First
Three tables: spiders (metadata, domains, URLs), spider_rules (URL patterns, callbacks), spider_settings (key-value configs). One genericDatabaseSpider class loads any config at runtime.
Benefits of Database-First
1. Central Inventory
SELECT name, active, created_at FROM spiders WHERE project = 'news'. No filesystem traversal, no parsing Python files.
2. Batch Updates
3. Rich Metadata
Every spider tracks creation time, update time, project, and activity status:4. Import/Export as Data
5. Consistency and Validation
All configs validated via Pydantic: strict naming (^[a-zA-Z0-9_-]+$), HTTP/HTTPS URLs only, private IPs blocked. No malformed configs reach the database.
Querying the Database
ScrapAI provides a safe SQL query interface:Read-Only Queries
Real-World Patterns
Pattern 1: Fleet Health Check
Run test crawls on all spiders monthly:Test All Spiders
Pattern 2: Bulk Configuration Changes
Rate Limit All Spiders
Pattern 3: Spider Versioning
Export before making changes:Backup Before Update
Pattern 4: Multi-Project Management
Project Isolation
PostgreSQL vs SQLite
SQLite (default): Single-user, simple deployment, auto-enabled WAL mode. Good for < 100 spiders. PostgreSQL (production): Multi-user teams, 100+ spiders, high concurrency. Migrate with:Next Steps
Spider Schema
Detailed schema reference for Spider, SpiderRule, and SpiderSetting
CLI Reference
Commands for spider management: list, import, export, delete