Skip to main content
ScrapAI is a database-backed scraping orchestration layer built on Scrapy. AI agents generate JSON spider configs instead of writing Python files.

Analysis and Review

During site analysis, agents write detailed notes in sections.md documenting URL patterns, site structure, and extraction strategy. Review the analysis, correct assumptions, and refine the approach before finalizing configs.

Full Control

Write configs by hand, edit generated ones, override settings per spider, or write custom callbacks with your own CSS/XPath selectors:
./scrapai spiders import my_config.json --project myproject

Team Benefits

All configs follow the same schema. Uniform structure across the fleet means easier code review, debugging, and onboarding. One developer can pick up another’s spider without decoding personal style choices.

Architecture

ScrapAI is an orchestration layer on top of Scrapy. Instead of writing a Python spider file per website, an AI agent generates a JSON config and stores it in a database. A single generic spider (DatabaseSpider) loads any config at runtime.
You (plain English) → AI Agent → JSON config → Database → Scrapy crawl
                       (once)                               (forever)

Component Overview

ComponentWhat it does
scrapaiEntry point, auto-activates venv, delegates to CLI
cli/Click-based CLI: spiders, queue, crawl, show, export, inspect
spiders/database_spider.pyGeneric spider that loads config from database at runtime
spiders/sitemap_spider.pySitemap-based spider for sites with XML sitemaps
core/extractors.pyExtraction chain: newspaper, trafilatura, custom CSS, Playwright
core/models.pySQLAlchemy models: Spider, SpiderRule, SpiderSetting, ScrapedItem
handlers/cloudflare_handler.pyCloudflare bypass with cookie caching
middlewares.pySmartProxyMiddleware, direct-to-proxy escalation
pipelines.pyBatched database writes and JSONL export
alembic/Database migrations
airflow/Production scheduling with Apache Airflow

Codebase

Small and readable: ~4,000 lines of code. Built on Scrapy, SQLAlchemy, Alembic — tools you already know. Read the whole thing in an afternoon. Measured with pygount, counting actual code lines only (no blanks, no comments, no docstrings). Tests, examples, and docs excluded.
MetricCount
Files37
Code Lines4,028
Comment Lines895
Comment %14%
Compare this to other scraping frameworks:
  • Scrapling: 5,875 lines (21% comments)
  • crawl4ai: 26,850 lines (21% comments)
ScrapAI is intentionally small. The complexity lives in Scrapy, SQLAlchemy, and the extraction libraries. Our contribution is the orchestration.

Writing Spider Configs

Here’s what an AI-generated spider config looks like:
{
  "name": "bbc_co_uk",
  "allowed_domains": ["bbc.co.uk"],
  "start_urls": ["https://www.bbc.co.uk/news"],
  "rules": [
    {
      "allow": ["/news/articles/[^/]+$"],
      "callback": "parse_article",
      "follow": false
    },
    {
      "allow": ["/news/?$"],
      "follow": true
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 2
  }
}
You can write this by hand, no AI needed. See Spider Schema for the complete specification.

Custom Extractors

For non-article content (products, jobs, listings), write custom callbacks with field-level selectors:
{
  "callbacks": {
    "parse_product": {
      "extract": {
        "title": {"css": "h1.product-name::text"},
        "price": {
          "css": "span.price::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d.]+)"},
            {"type": "cast", "to": "float"}
          ]
        },
        "rating": {
          "css": "div.stars::attr(data-rating)",
          "processors": [{"type": "cast", "to": "float"}]
        }
      }
    }
  }
}
See Custom Callbacks for complete examples.

Database Schema

All configuration lives in PostgreSQL (or SQLite for development):

Spider Table

CREATE TABLE spiders (
    id INTEGER PRIMARY KEY,
    name VARCHAR(255) UNIQUE NOT NULL,
    project VARCHAR(255) NOT NULL,
    allowed_domains JSON NOT NULL,
    start_urls JSON NOT NULL,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

SpiderRule Table

CREATE TABLE spider_rules (
    id INTEGER PRIMARY KEY,
    spider_id INTEGER REFERENCES spiders(id),
    allow_patterns JSON,
    deny_patterns JSON,
    callback VARCHAR(255),
    follow BOOLEAN,
    priority INTEGER
);

SpiderSetting Table

CREATE TABLE spider_settings (
    id INTEGER PRIMARY KEY,
    spider_id INTEGER REFERENCES spiders(id),
    key VARCHAR(255) NOT NULL,
    value TEXT NOT NULL
);

ScrapedItem Table

CREATE TABLE scraped_items (
    id INTEGER PRIMARY KEY,
    spider_name VARCHAR(255),
    project VARCHAR(255),
    url TEXT,
    title TEXT,
    content TEXT,
    author VARCHAR(255),
    published_date TIMESTAMP,
    metadata_json JSON,
    scraped_at TIMESTAMP
);

Extending ScrapAI

Adding a New Extractor

Create a new extractor class in core/extractors.py:
from core.extractors import BaseExtractor

class MyCustomExtractor(BaseExtractor):
    def extract(self, response):
        return {
            'title': response.css('h1::text').get(),
            'content': response.css('article::text').getall(),
            'author': response.css('.author::text').get(),
            'published_date': None
        }
Register it in the extraction chain:
{
  "settings": {
    "EXTRACTOR_ORDER": ["my_custom", "newspaper", "trafilatura"]
  }
}

Adding Custom Middleware

Add middleware to middlewares.py:
class MyCustomMiddleware:
    def process_request(self, request, spider):
        # Modify request before it's sent
        return None

    def process_response(self, request, response, spider):
        # Modify response after it's received
        return response
Enable it in scrapy_settings.py:
DOWNLOADER_MIDDLEWARES = {
    'middlewares.MyCustomMiddleware': 350,
}

Adding CLI Commands

Add commands to cli/:
# cli/mycommand.py
import click

@click.command()
@click.argument('spider_name')
@click.option('--project', required=True)
def mycommand(spider_name, project):
    """My custom command"""
    click.echo(f"Running command for {spider_name}")
Register in cli/__init__.py:
from cli.mycommand import mycommand

cli.add_command(mycommand)

Storage Modes

Test mode (--limit N): saves to database, inspect via show command
./scrapai crawl myspider --project news --limit 10
./scrapai show myspider --project news
Production mode (no limit): exports to timestamped JSONL files, enables checkpoint
./scrapai crawl myspider --project news
# Creates: data/news/myspider/2026-03-01_143022.jsonl

Migrating Existing Scrapers

Point the agent at your existing Python scripts (Scrapy spiders, BeautifulSoup, Scrapling, whatever) and it’ll read them, understand the extraction logic, and write the equivalent ScrapAI JSON config.
You: "Migrate my spider at scripts/bbc_spider.py to ScrapAI"
Agent: [Reads Python, extracts URL patterns and selectors, writes JSON config, tests, saves to database]
Your existing scrapers keep running while you verify. No big bang migration required.

Security

All input is validated through Pydantic schemas. Spider configs, URLs, and settings are validated before touching the database or crawler. SQL queries use parameterized bindings. ScrapAI uses a config-only architecture where agents write JSON, not code. See Security-First Design for the full security model.

Contributing

Contributions welcome. Areas where help would be particularly valuable:

Structural Change Detection

Automatic detection of website structural changes

Extraction Modules

Additional extraction modules (images, tables, PDFs)

Anti-Bot Support

Anti-bot support beyond Cloudflare

Authentication

Authentication and session management

Development Setup

git clone https://github.com/discourselab/scrapai-cli.git
cd scrapai-cli
./scrapai setup
./scrapai verify

Running Tests

# Run all tests
pytest

# Run specific test file
pytest tests/test_extractors.py

# Run with coverage
pytest --cov=core --cov=cli --cov-report=html

Code Style

We follow PEP 8 with these exceptions:
  • Line length: 120 characters
  • Docstrings: Google style
# Format code
black .

# Check linting
flake8 core/ cli/ spiders/

Limitations

Current limitations (pull requests welcome):
  • Authentication: No login support, no paywall bypass, no persistent sessions
  • Advanced anti-bot: We handle Cloudflare. Not DataDome, PerimeterX, Akamai, or CAPTCHA-solving services
  • Interactive content: No form submission, no click-based pagination
The codebase is designed to be extended. The crawling infrastructure is done; what’s missing is mostly parsing logic for additional content types.

Architecture

Technical architecture and design decisions

Spider Schema

Complete JSON schema reference

Custom Callbacks

Write custom field extractors

Security

Security model and validation