Skip to main content
ScrapAI stores spiders as database rows, not Python files. This architectural choice enables powerful management patterns impossible with file-based scrapers.

The Traditional Approach: File-Based Spiders

In a typical Scrapy project:
Traditional Scrapy Project
project/
├── spiders/
│   ├── bbc_spider.py
│   ├── cnn_spider.py
│   ├── nytimes_spider.py
│   ├── guardian_spider.py
│   └── ... 96 more files
├── settings.py
└── scrapy.cfg
Problems at scale:

No Central Inventory

Which spiders exist? Grep the filesystem. Which are active? Check each file.

Hard to Batch Update

Change DOWNLOAD_DELAY across 100 spiders? Edit 100 files or write a script.

No Metadata

When was this spider created? By whom? For what project? Add comments and hope.

Code Drift

5 developers write spiders in 5 different styles. No consistency, harder to review.

The ScrapAI Approach: Database-First

ScrapAI Database Schema
spiders
├── id (primary key)
├── name (unique, indexed)
├── allowed_domains (JSON)
├── start_urls (JSON)
├── source_url (text)
├── active (boolean)
├── project (string)
├── callbacks_config (JSON)
├── created_at (timestamp)
└── updated_at (timestamp)

spider_rules
├── id (primary key)
├── spider_id (foreign key)
├── allow_patterns (JSON)
├── deny_patterns (JSON)
├── callback (string)
├── follow (boolean)
└── priority (integer)

spider_settings
├── id (primary key)
├── spider_id (foreign key)
├── key (string)
├── value (string)
└── type (string)
One spider class, many configurations. The DatabaseSpider loads any config from the database at runtime.

Benefits of Database-First

1. Central Inventory

List all spiders with a single query:
CLI
./scrapai spiders list --project news
SQL
SELECT name, active, created_at, updated_at 
FROM spiders 
WHERE project = 'news'
ORDER BY created_at DESC;
Output:
Name            Active  Created              Updated
─────────────────────────────────────────────────────
bbc_co_uk       Yes     2026-01-15 10:23:11  2026-02-20 14:55:32
cnn_com         Yes     2026-01-15 11:10:45  2026-01-15 11:10:45
guardian_co_uk  No      2026-01-14 09:12:33  2026-02-18 16:22:10
No filesystem traversal, no parsing Python files, no guessing.

2. Batch Updates

Change settings across multiple spiders:
-- Increase delay for all news spiders
UPDATE spider_settings 
SET value = '2'
WHERE key = 'DOWNLOAD_DELAY' 
  AND spider_id IN (SELECT id FROM spiders WHERE project = 'news');
With files: Write a script to parse and edit 100 Python files. Hope regex doesn’t break syntax. With database: One SQL query. Done.

3. Rich Metadata

Every spider tracks:
Temporal Queries
-- Spiders created in the last 30 days
SELECT name, created_at 
FROM spiders 
WHERE created_at > NOW() - INTERVAL '30 days';

-- Spiders not updated in 90 days (stale?)
SELECT name, updated_at 
FROM spiders 
WHERE updated_at < NOW() - INTERVAL '90 days';

4. Import/Export as Data

Spiders are JSON-serializable data structures:
{
  "name": "bbc_co_uk",
  "source_url": "https://bbc.co.uk/",
  "allowed_domains": ["bbc.co.uk"],
  "start_urls": ["https://www.bbc.co.uk/"],
  "rules": [
    {
      "allow": ["/news/articles/.*"],
      "callback": "parse_article"
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 1
  }
}
Use cases:
  • Share spider configs between projects
  • Version control configs (not code)
  • Backup/restore spider configurations
  • Distribute configs to other teams

5. Consistency and Validation

All configs go through the same validation pipeline:
Validation via Pydantic
from pydantic import BaseModel, Field, validator
import re

class SpiderConfig(BaseModel):
    name: str = Field(pattern=r"^[a-zA-Z0-9_-]+$")  # Strict naming
    allowed_domains: list[str] = Field(min_items=1)
    start_urls: list[str] = Field(min_items=1)
    rules: list[RuleConfig] = []
    settings: dict = {}
    
    @validator("start_urls")
    def validate_urls(cls, urls):
        for url in urls:
            if not url.startswith(("http://", "https://")):
                raise ValueError(f"Invalid URL: {url}")
            # Block private IPs
            if any(ip in url for ip in ["127.0.0.1", "localhost", "10.", "192.168."]):
                raise ValueError(f"Private IP blocked: {url}")
        return urls
Result: Uniform structure, validated inputs, no malformed configs.

Database Schema Deep Dive

Spider Table

id
integer
required
Primary key, auto-increment
name
string
required
Unique spider identifier. Must match ^[a-zA-Z0-9_-]+$.Example: bbc_co_uk, amazon_electronics
allowed_domains
JSON
required
List of domains the spider can crawl. Scrapy’s domain restriction.Example: ["bbc.co.uk", "www.bbc.co.uk"]
start_urls
JSON
required
Initial URLs to start crawling from.Example: ["https://www.bbc.co.uk/news"]
source_url
string
The original URL provided by the user when creating the spider.Example: https://bbc.co.uk/
active
boolean
default:"true"
Enable/disable spider without deletion. Inactive spiders are skipped.
project
string
default:"default"
Project grouping for multi-project setups.Example: news, ecommerce, research
callbacks_config
JSON
Custom callback definitions for non-article content (products, jobs, listings).
{
  "parse_product": {
    "extract": {
      "name": {"css": "h1.product-name::text"},
      "price": {"css": "span.price::text"}
    }
  }
}
created_at
timestamp
When the spider was created (auto-set).
updated_at
timestamp
Last update timestamp (auto-updated on change).

SpiderRule Table

Maps to Scrapy’s Rule and LinkExtractor:
Scrapy Equivalent
Rule(
    LinkExtractor(
        allow=[r"/news/articles/.*"],
        deny=[r"/news/.*#comments"]
    ),
    callback="parse_article",
    follow=False
)
Database representation:
allow_patterns
JSON
Regex patterns to match URLs for crawling.Example: ["/news/articles/.*", "/sport/.*/articles/.*"]
deny_patterns
JSON
Regex patterns to exclude URLs.Example: ["/news/.*#comments", "/gallery/.*"]
callback
string
Callback method name (parse_article, parse_product, etc.) or null for link following only.
follow
boolean
default:"true"
Whether to follow links matched by this rule.
priority
integer
default:"0"
Rule execution order (higher priority first).

SpiderSetting Table

Key-value pairs for Scrapy settings:
KeyValue ExampleDescription
EXTRACTOR_ORDER["newspaper", "trafilatura"]Extraction fallback order
DOWNLOAD_DELAY2Seconds between requests
CONCURRENT_REQUESTS16Parallel request limit
CLOUDFLARE_ENABLEDtrueEnable Cloudflare bypass
ROBOTSTXT_OBEYtrueRespect robots.txt

Querying the Database

ScrapAI provides a safe SQL query interface:
Read-Only Queries
# List all spiders
./scrapai db query "SELECT name, project, active FROM spiders"

# Count items per spider
./scrapai db query "SELECT spider_id, COUNT(*) FROM scraped_items GROUP BY spider_id"

# Find spiders with Cloudflare enabled
./scrapai db query "SELECT s.name FROM spiders s JOIN spider_settings ss ON s.id = ss.spider_id WHERE ss.key = 'CLOUDFLARE_ENABLED' AND ss.value = 'true'"
UPDATE/DELETE queries require confirmation via --confirm flag with row count to prevent accidental data loss.
./scrapai db query "DELETE FROM spiders WHERE name = 'test_spider'" --confirm 1

Real-World Patterns

Pattern 1: Fleet Health Check

Run test crawls on all spiders monthly:
Test All Spiders
# Get active spiders
./scrapai db query "SELECT name FROM spiders WHERE active = true AND project = 'news'" > active_spiders.txt

# Test each spider (5 sample URLs)
while read spider; do
  echo "Testing $spider..."
  ./scrapai crawl "$spider" --project news --limit 5
done < active_spiders.txt

# Check for extraction failures
./scrapai db query "SELECT spider_id, COUNT(*) FROM scraped_items WHERE scraped_at > NOW() - INTERVAL '1 day' GROUP BY spider_id HAVING COUNT(*) = 0"

Pattern 2: Bulk Configuration Changes

Rate Limit All Spiders
-- Increase delay for aggressive sites
UPDATE spider_settings 
SET value = '3'
WHERE key = 'DOWNLOAD_DELAY' 
  AND spider_id IN (
    SELECT id FROM spiders 
    WHERE allowed_domains::text LIKE '%amazon%' 
       OR allowed_domains::text LIKE '%ebay%'
  );

Pattern 3: Spider Versioning

Export before making changes:
Backup Before Update
# Export current config
./scrapai spiders export bbc_co_uk --project news > bbc_v1.json

# Make changes (via agent or manual edit)
# ... modify rules, settings, etc.

# Test new version
./scrapai crawl bbc_co_uk --project news --limit 10

# Rollback if needed
./scrapai spiders import bbc_v1.json --project news

Pattern 4: Multi-Project Management

Project Isolation
# Create separate projects for different use cases
./scrapai spiders import news_spider.json --project production
./scrapai spiders import news_spider.json --project staging
./scrapai spiders import news_spider.json --project testing

# Each project has independent configs and data
./scrapai crawl news_spider --project production
./scrapai crawl news_spider --project testing --limit 5

PostgreSQL vs SQLite

Best for:
  • Single-user development
  • Small to medium scale (< 100 spiders)
  • Simple deployment (no external database)
Performance:
  • Read-heavy: Excellent (with WAL mode)
  • Write-heavy: Good (sequential writes)
  • Concurrent access: Limited (single writer)
Configuration:
.env
DATABASE_URL=sqlite:///scrapai.db
ScrapAI auto-enables WAL mode for better concurrency:
cursor.execute("PRAGMA journal_mode=WAL")

Key Takeaways

Spiders are Data

Not files, not code—structured data in a database. Query, update, export like any other dataset.

One Spider Class

The DatabaseSpider loads any config at runtime. No code generation, no Python files per site.

Rich Metadata

Track creation time, update time, project, activity status. Impossible with files.

Batch Operations

Change settings across 100 spiders with one SQL query. Update, disable, export in bulk.

Next Steps