For Developers

scrapai is a database-backed scraping orchestration layer built on Scrapy. AI agents generate JSON spider configs instead of writing Python files.

Analysis and Review

During site analysis, agents write detailed notes in sections.md documenting URL patterns, site structure, and extraction strategy. Review the analysis, correct assumptions, and refine the approach before finalizing configs.

Full Control

Write configs by hand, edit generated ones, override settings per spider, or write custom callbacks with your own CSS/XPath selectors:

./scrapai spiders import my_config.json --project myproject

Team Benefits

All configs follow the same schema. Uniform structure across the fleet means easier code review, debugging, and onboarding. One developer can pick up another’s spider without decoding personal style choices.

Architecture

scrapai is an orchestration layer on top of Scrapy. Instead of writing a Python spider file per website, an AI agent generates a JSON config and stores it in a database. A single generic spider (DatabaseSpider) loads any config at runtime.

You (plain English) → AI Agent → JSON config → Database → Scrapy crawl
                       (once)                               (forever)

Component Overview

Component	What it does
`scrapai`	Entry point, auto-activates venv, delegates to CLI
`cli/`	Click-based CLI: spiders, queue, crawl, show, export, inspect
`spiders/database_spider.py`	Generic spider that loads config from database at runtime
`spiders/sitemap_spider.py`	Sitemap-based spider for sites with XML sitemaps
`core/extractors.py`	Extraction chain: newspaper, trafilatura, custom CSS, Playwright
`core/models.py`	SQLAlchemy models: Spider, SpiderRule, SpiderSetting, ScrapedItem
`handlers/cloudflare_handler.py`	Cloudflare bypass with cookie caching
`middlewares.py`	SmartProxyMiddleware, direct-to-proxy escalation
`pipelines.py`	Batched database writes and JSONL export
`alembic/`	Database migrations

Codebase

Small and readable: ~4,000 lines of code. Built on Scrapy, SQLAlchemy, Alembic — tools you already know. Read the whole thing in an afternoon. Measured with pygount, counting actual code lines only (no blanks, no comments, no docstrings). Tests, examples, and docs excluded.

Metric	Count
Files	37
Code Lines	4,028
Comment Lines	895
Comment %	14%

Compare this to other scraping frameworks:

Scrapling: 5,875 lines (21% comments)
crawl4ai: 26,850 lines (21% comments)

scrapai is intentionally small. The complexity lives in Scrapy, SQLAlchemy, and the extraction libraries. Our contribution is the orchestration.

Writing Spider Configs

Here’s what an AI-generated spider config looks like:

{
  "name": "bbc_co_uk",
  "allowed_domains": ["bbc.co.uk"],
  "start_urls": ["https://www.bbc.co.uk/news"],
  "rules": [
    {
      "allow": ["/news/articles/[^/]+$"],
      "callback": "parse_article",
      "follow": false
    },
    {
      "allow": ["/news/?$"],
      "follow": true
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 2
  }
}

You can write this by hand, no AI needed. See Spider Schema for the complete specification.

Custom Extractors

For non-article content (products, jobs, listings), write custom callbacks with field-level selectors:

{
  "callbacks": {
    "parse_product": {
      "extract": {
        "title": {"css": "h1.product-name::text"},
        "price": {
          "css": "span.price::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d.]+)"},
            {"type": "cast", "to": "float"}
          ]
        },
        "rating": {
          "css": "div.stars::attr(data-rating)",
          "processors": [{"type": "cast", "to": "float"}]
        }
      }
    }
  }
}

See Custom Callbacks for complete examples.

Database Schema

All configuration lives in PostgreSQL (or SQLite for development):

Spider Table

CREATE TABLE spiders (
    id INTEGER PRIMARY KEY,
    name VARCHAR(255) UNIQUE NOT NULL,
    project VARCHAR(255) NOT NULL,
    allowed_domains JSON NOT NULL,
    start_urls JSON NOT NULL,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

SpiderRule Table

CREATE TABLE spider_rules (
    id INTEGER PRIMARY KEY,
    spider_id INTEGER REFERENCES spiders(id),
    allow_patterns JSON,
    deny_patterns JSON,
    callback VARCHAR(255),
    follow BOOLEAN,
    priority INTEGER
);

SpiderSetting Table

CREATE TABLE spider_settings (
    id INTEGER PRIMARY KEY,
    spider_id INTEGER REFERENCES spiders(id),
    key VARCHAR(255) NOT NULL,
    value TEXT NOT NULL
);

ScrapedItem Table

CREATE TABLE scraped_items (
    id INTEGER PRIMARY KEY,
    spider_name VARCHAR(255),
    project VARCHAR(255),
    url TEXT,
    title TEXT,
    content TEXT,
    author VARCHAR(255),
    published_date TIMESTAMP,
    metadata_json JSON,
    scraped_at TIMESTAMP
);

Extending scrapai

Adding a New Extractor

Create a new extractor class in core/extractors.py:

from core.extractors import BaseExtractor

class MyCustomExtractor(BaseExtractor):
    def extract(self, response):
        return {
            'title': response.css('h1::text').get(),
            'content': response.css('article::text').getall(),
            'author': response.css('.author::text').get(),
            'published_date': None
        }

{
  "settings": {
    "EXTRACTOR_ORDER": ["my_custom", "newspaper", "trafilatura"]
  }
}

Adding Custom Middleware

Add middleware to middlewares.py:

class MyCustomMiddleware:
    def process_request(self, request, spider):
        # Modify request before it's sent
        return None

    def process_response(self, request, response, spider):
        # Modify response after it's received
        return response

Enable it in scrapy_settings.py:

DOWNLOADER_MIDDLEWARES = {
    'middlewares.MyCustomMiddleware': 350,
}

Adding CLI Commands

Add commands to cli/:

# cli/mycommand.py
import click

@click.command()
@click.argument('spider_name')
@click.option('--project', required=True)
def mycommand(spider_name, project):
    """My custom command"""
    click.echo(f"Running command for {spider_name}")

from cli.mycommand import mycommand

cli.add_command(mycommand)

Storage Modes

Test mode (--limit N): saves to database, inspect via show command

./scrapai crawl myspider --project news --limit 10
./scrapai show myspider --project news

Production mode (no limit): exports to timestamped JSONL files, enables checkpoint

./scrapai crawl myspider --project news
# Creates: data/news/myspider/2026-03-01_143022.jsonl

Migrating Existing Scrapers

Point the agent at your existing Python scripts (Scrapy spiders, BeautifulSoup, Scrapling, whatever) and it’ll read them, understand the extraction logic, and write the equivalent scrapai JSON config.

You: "Migrate my spider at scripts/bbc_spider.py to scrapai"
Agent: [Reads Python, extracts URL patterns and selectors, writes JSON config, tests, saves to database]

Your existing scrapers keep running while you verify. No big bang migration required.

Security

All input is validated through Pydantic schemas. Spider configs, URLs, and settings are validated before touching the database or crawler. SQL queries use parameterized bindings. scrapai uses a config-only architecture where agents write JSON, not code. See Security-First Design for the full security model.

Contributing

Contributions welcome. Areas where help would be particularly valuable:

Structural Change Detection

Automatic detection of website structural changes

Extraction Modules

Additional extraction modules (images, tables, PDFs)

Anti-Bot Support

Anti-bot support beyond Cloudflare

Authentication

Authentication and session management

Development Setup

git clone https://github.com/discourselab/scrapai-cli.git
cd scrapai-cli
./scrapai setup
./scrapai verify

Running Tests

# Run all tests
pytest

# Run specific test file
pytest tests/test_extractors.py

# Run with coverage
pytest --cov=core --cov=cli --cov-report=html

Code Style

We follow PEP 8 with these exceptions:

Line length: 120 characters
Docstrings: Google style

# Format code
black .

# Check linting
flake8 core/ cli/ spiders/

Limitations

Current limitations (pull requests welcome):

Authentication: No login support, no paywall bypass, no persistent sessions
Advanced anti-bot: We handle Cloudflare. Not DataDome, PerimeterX, Akamai, or CAPTCHA-solving services
Interactive content: No form submission, no click-based pagination

The codebase is designed to be extended. The crawling infrastructure is done; what’s missing is mostly parsing logic for additional content types.

Architecture

Technical architecture and design decisions

Spider Schema

Complete JSON schema reference

Custom Callbacks

Write custom field extractors

Security

Security model and validation

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

Analysis and Review

Full Control

Team Benefits

Architecture

Component Overview

Codebase

Writing Spider Configs

Custom Extractors

Database Schema

Spider Table

SpiderRule Table

SpiderSetting Table

ScrapedItem Table

Extending scrapai

Adding a New Extractor

Adding Custom Middleware

Adding CLI Commands

Storage Modes

Migrating Existing Scrapers

Security

Contributing

Structural Change Detection

Extraction Modules

Anti-Bot Support

Authentication

Development Setup

Running Tests

Code Style

Limitations

Architecture

Spider Schema

Custom Callbacks

Security

​Analysis and Review

​Full Control

​Team Benefits

​Architecture

​Component Overview

​Codebase

​Writing Spider Configs

​Custom Extractors

​Database Schema

​Spider Table

​SpiderRule Table

​SpiderSetting Table

​ScrapedItem Table

​Extending scrapai

​Adding a New Extractor

​Adding Custom Middleware

​Adding CLI Commands

​Storage Modes

​Migrating Existing Scrapers

​Security

​Contributing

Structural Change Detection

Extraction Modules

Anti-Bot Support

Authentication

​Development Setup

​Running Tests

​Code Style

​Limitations

​Related Documentation

Architecture

Spider Schema

Custom Callbacks

Security

Analysis and Review

Full Control

Team Benefits

Architecture

Component Overview

Codebase

Writing Spider Configs

Custom Extractors

Database Schema

Spider Table

SpiderRule Table

SpiderSetting Table

ScrapedItem Table

Extending scrapai

Adding a New Extractor

Adding Custom Middleware

Adding CLI Commands

Storage Modes

Migrating Existing Scrapers

Security

Contributing

Development Setup

Running Tests

Code Style

Limitations

Related Documentation