Architecture

scrapai is an orchestration layer on top of Scrapy. Instead of writing Python spider files, an AI agent generates JSON configs stored in a database. A single generic spider loads any config at runtime.

High-Level Architecture

Simple flow: CLI stores spider configs in database → Scrapy loads config and crawls → Data exported to files or database

Component Breakdown

Entry Point: `scrapai` Script

scrapai Script

#!/usr/bin/env bash
# Auto-activates virtualenv, delegates to CLI
./scrapai crawl bbc_co_uk --project news

The scrapai entry point:

Auto-activates the virtual environment (no manual source venv/bin/activate)
Delegates commands to the Click-based CLI
Handles environment setup and validation

CLI Layer (`cli/`)

Built with Click, the CLI provides commands for:

Spider Management

spiders list, spiders import, spiders delete

Crawling

crawl <spider> with test mode (--limit) and production mode

Data Access

show <spider>, export <spider> (CSV/JSON/JSONL/Parquet)

Queue Management

queue add, queue bulk, queue list, queue next

CLI Structure

cli/
├── __init__.py        # Main CLI entry point
├── spiders.py         # Spider CRUD commands
├── crawl.py           # Crawl execution
├── data.py            # Show and export commands
├── queue.py           # Batch processing queue
└── inspect.py         # URL inspection tool

Database Layer (`core/models.py`, `core/db.py`)

scrapai uses SQLAlchemy with support for both SQLite (default) and PostgreSQL (production).

Core Models

Spider

Stores spider configuration: name, domains, start URLs, project, callbacks

SpiderRule

URL patterns (allow/deny), callback mapping, follow behavior

SpiderSetting

Spider-specific settings (delays, concurrency, extractors)

ScrapedItem

Scraped data: URL, title, content, author, date, metadata

Key Point: Spiders are rows, not files. Adding a website means inserting a row.

SQLite (default) for development and small-scale production. PostgreSQL for multi-user access or high concurrency. Configure via DATABASE_URL in .env.

Spider Layer (`spiders/database_spider.py`)

One spider class for all websites. DatabaseSpider loads config from the database at runtime:

Instantiated with spider_name parameter
Queries database for spider config
Applies domains, URLs, rules, and settings
Scrapy engine starts crawling with loaded config

Extraction Layer (`core/extractors.py`)

scrapai uses a fallback chain of extractors:

newspaper4k

News articles, blogs, standard article layouts

trafilatura

Articles, documentation, text-heavy content

Custom CSS

Non-standard layouts, structured data extraction with custom selectors

All extractors can use CloakBrowser for JS-heavy or Cloudflare-protected sites. Configure extraction order: "EXTRACTOR_ORDER": ["newspaper", "trafilatura"]

Handlers and Middleware

CloudflareHandler (handlers/cloudflare_handler.py)

Bypasses Cloudflare using CloakBrowser. Solves challenge once, extracts cookies, then uses fast HTTP. Enable with "CLOUDFLARE_ENABLED": true.

SmartProxyMiddleware (middlewares.py)

Auto-escalates to proxies on 403/429 errors. Starts direct, remembers blocked domains. Configure datacenter and residential proxies in .env.

Pipeline Layer (`pipelines.py`)

Handles storage with batched writes (50 items per batch). Storage Modes:

Test mode (--limit N): Saves to database for inspection
Production mode: Exports to timestamped JSONL files, enables checkpoint pause/resume

Data Flow: End-to-End

User Runs Crawl Command

./scrapai crawl bbc_co_uk --project news --limit 5

CLI Invokes Scrapy

cli/crawl.py constructs Scrapy command:

process = CrawlerProcess(settings)
process.crawl(DatabaseSpider, spider_name="bbc_co_uk")
process.start()

DatabaseSpider Loads Config

Queries database for bbc_co_uk spider, applies domains/URLs/rules/settings.

Scrapy Engine Starts

Scheduler queues start URLs, Downloader fetches pages, Spider processes responses.

Extraction

For each response:

Try newspaper4k → trafilatura → custom CSS → Playwright
Return ScrapedArticle or None

Pipeline Storage

Items buffered and batch-written to database or JSONL files.

Output Available

./scrapai show bbc_co_uk --project news
./scrapai export bbc_co_uk --project news --format csv

Key Design Decisions

Generic Spider

One spider class loads any config at runtime. No code generation, no Python files per site.

Database as Config Store

Spiders are rows, not files. Change settings across 100 spiders with one SQL query.

Fallback Extraction

Multiple extractors in a chain. If newspaper fails, try trafilatura. If that fails, try custom CSS.

Validation Before Execution

All configs validated through Pydantic schemas. Malformed configs fail before execution.

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

High-Level Architecture

Component Breakdown

Entry Point: `scrapai` Script

CLI Layer (`cli/`)

Spider Management

Crawling

Data Access

Queue Management

Database Layer (`core/models.py`, `core/db.py`)

Core Models

Spider

SpiderRule

SpiderSetting

ScrapedItem

Spider Layer (`spiders/database_spider.py`)

Extraction Layer (`core/extractors.py`)

newspaper4k

trafilatura

Custom CSS

Handlers and Middleware

Pipeline Layer (`pipelines.py`)

Data Flow: End-to-End

Key Design Decisions

Generic Spider

Database as Config Store

Fallback Extraction

Validation Before Execution

Next Steps

Database-First Philosophy

Extractors Guide

​High-Level Architecture

​Component Breakdown

​Entry Point: scrapai Script

​CLI Layer (cli/)

Spider Management

Crawling

Data Access

Queue Management

​Database Layer (core/models.py, core/db.py)

​Core Models

Spider

SpiderRule

SpiderSetting

ScrapedItem

​Spider Layer (spiders/database_spider.py)

​Extraction Layer (core/extractors.py)

newspaper4k

trafilatura

Custom CSS

​Handlers and Middleware

​Pipeline Layer (pipelines.py)

​Data Flow: End-to-End

​Key Design Decisions

Generic Spider

Database as Config Store

Fallback Extraction

Validation Before Execution

​Next Steps

Database-First Philosophy

Extractors Guide

High-Level Architecture

Component Breakdown

Entry Point: `scrapai` Script

CLI Layer (`cli/`)

Database Layer (`core/models.py`, `core/db.py`)

Core Models

Spider Layer (`spiders/database_spider.py`)

Extraction Layer (`core/extractors.py`)

Handlers and Middleware

Pipeline Layer (`pipelines.py`)

Data Flow: End-to-End

Key Design Decisions

Next Steps