Skip to main content
ScrapAI is an orchestration layer on top of Scrapy. Instead of writing Python spider files, an AI agent generates JSON configs stored in a database. A single generic spider loads any config at runtime.

High-Level Architecture

Simple flow: CLI stores spider configs in database → Scrapy loads config and crawls → Data exported to files or database

Component Breakdown

Entry Point: scrapai Script

scrapai Script
#!/usr/bin/env bash
# Auto-activates virtualenv, delegates to CLI
./scrapai crawl bbc_co_uk --project news
The scrapai entry point:
  • Auto-activates the virtual environment (no manual source venv/bin/activate)
  • Delegates commands to the Click-based CLI
  • Handles environment setup and validation

CLI Layer (cli/)

Built with Click, the CLI provides commands for:

Spider Management

spiders list, spiders import, spiders delete

Crawling

crawl <spider> with test mode (--limit) and production mode

Data Access

show <spider>, export <spider> (CSV/JSON/JSONL/Parquet)

Queue Management

queue add, queue bulk, queue list, queue next
CLI Structure
cli/
├── __init__.py        # Main CLI entry point
├── spiders.py         # Spider CRUD commands
├── crawl.py           # Crawl execution
├── data.py            # Show and export commands
├── queue.py           # Batch processing queue
└── inspect.py         # URL inspection tool

Database Layer (core/models.py, core/db.py)

ScrapAI uses SQLAlchemy with support for both SQLite (default) and PostgreSQL (production).

Core Models

Spider

Stores spider configuration: name, domains, start URLs, project, callbacks

SpiderRule

URL patterns (allow/deny), callback mapping, follow behavior

SpiderSetting

Spider-specific settings (delays, concurrency, extractors)

ScrapedItem

Scraped data: URL, title, content, author, date, metadata
Key Point: Spiders are rows, not files. Adding a website means inserting a row.
SQLite (default) for development and small-scale production. PostgreSQL for multi-user access or high concurrency. Configure via DATABASE_URL in .env.

Spider Layer (spiders/database_spider.py)

One spider class for all websites. DatabaseSpider loads config from the database at runtime:
  1. Instantiated with spider_name parameter
  2. Queries database for spider config
  3. Applies domains, URLs, rules, and settings
  4. Scrapy engine starts crawling with loaded config

Extraction Layer (core/extractors.py)

ScrapAI uses a fallback chain of extractors:

newspaper4k

News articles, blogs, standard article layouts

trafilatura

Articles, documentation, text-heavy content

Custom CSS

Non-standard layouts, structured data extraction with custom selectors
All extractors can use CloakBrowser for JS-heavy or Cloudflare-protected sites. Configure extraction order: "EXTRACTOR_ORDER": ["newspaper", "trafilatura"]

Handlers and Middleware

Bypasses Cloudflare using CloakBrowser. Solves challenge once, extracts cookies, then uses fast HTTP. Enable with "CLOUDFLARE_ENABLED": true.
Auto-escalates to proxies on 403/429 errors. Starts direct, remembers blocked domains. Configure datacenter and residential proxies in .env.

Pipeline Layer (pipelines.py)

Handles storage with batched writes (50 items per batch). Storage Modes:
  • Test mode (--limit N): Saves to database for inspection
  • Production mode: Exports to timestamped JSONL files, enables checkpoint pause/resume

Data Flow: End-to-End

1

User Runs Crawl Command

./scrapai crawl bbc_co_uk --project news --limit 5
2

CLI Invokes Scrapy

cli/crawl.py constructs Scrapy command:
process = CrawlerProcess(settings)
process.crawl(DatabaseSpider, spider_name="bbc_co_uk")
process.start()
3

DatabaseSpider Loads Config

Queries database for bbc_co_uk spider, applies domains/URLs/rules/settings.
4

Scrapy Engine Starts

Scheduler queues start URLs, Downloader fetches pages, Spider processes responses.
5

Extraction

For each response:
  • Try newspaper4k → trafilatura → custom CSS → Playwright
  • Return ScrapedArticle or None
6

Pipeline Storage

Items buffered and batch-written to database or JSONL files.
7

Output Available

./scrapai show bbc_co_uk --project news
./scrapai export bbc_co_uk --project news --format csv

Key Design Decisions

Generic Spider

One spider class loads any config at runtime. No code generation, no Python files per site.

Database as Config Store

Spiders are rows, not files. Change settings across 100 spiders with one SQL query.

Fallback Extraction

Multiple extractors in a chain. If newspaper fails, try trafilatura. If that fails, try custom CSS.

Validation Before Execution

All configs validated through Pydantic schemas. Malformed configs fail before execution.

Next Steps

Database-First Philosophy

Learn why spiders live in the database

Extractors Guide

Understand the extraction chain in detail