High-Level Architecture
Simple flow: CLI stores spider configs in database → Scrapy loads config and crawls → Data exported to files or databaseComponent Breakdown
Entry Point: scrapai Script
scrapai Script
scrapai entry point:
- Auto-activates the virtual environment (no manual
source venv/bin/activate) - Delegates commands to the Click-based CLI
- Handles environment setup and validation
CLI Layer (cli/)
Built with Click, the CLI provides commands for:
Spider Management
spiders list, spiders import, spiders deleteCrawling
crawl <spider> with test mode (--limit) and production modeData Access
show <spider>, export <spider> (CSV/JSON/JSONL/Parquet)Queue Management
queue add, queue bulk, queue list, queue nextCLI Structure
Database Layer (core/models.py, core/db.py)
ScrapAI uses SQLAlchemy with support for both SQLite (default) and PostgreSQL (production).
Core Models
Spider
Stores spider configuration: name, domains, start URLs, project, callbacks
SpiderRule
URL patterns (allow/deny), callback mapping, follow behavior
SpiderSetting
Spider-specific settings (delays, concurrency, extractors)
ScrapedItem
Scraped data: URL, title, content, author, date, metadata
SQLite (default) for development and small-scale production. PostgreSQL for multi-user access or high concurrency. Configure via
DATABASE_URL in .env.Spider Layer (spiders/database_spider.py)
One spider class for all websites. DatabaseSpider loads config from the database at runtime:
- Instantiated with
spider_nameparameter - Queries database for spider config
- Applies domains, URLs, rules, and settings
- Scrapy engine starts crawling with loaded config
Extraction Layer (core/extractors.py)
ScrapAI uses a fallback chain of extractors:
newspaper4k
News articles, blogs, standard article layouts
trafilatura
Articles, documentation, text-heavy content
Custom CSS
Non-standard layouts, structured data extraction with custom selectors
"EXTRACTOR_ORDER": ["newspaper", "trafilatura"]
Handlers and Middleware
CloudflareHandler (handlers/cloudflare_handler.py)
CloudflareHandler (handlers/cloudflare_handler.py)
Bypasses Cloudflare using CloakBrowser. Solves challenge once, extracts cookies, then uses fast HTTP. Enable with
"CLOUDFLARE_ENABLED": true.SmartProxyMiddleware (middlewares.py)
SmartProxyMiddleware (middlewares.py)
Auto-escalates to proxies on 403/429 errors. Starts direct, remembers blocked domains. Configure datacenter and residential proxies in
.env.Pipeline Layer (pipelines.py)
Handles storage with batched writes (50 items per batch).
Storage Modes:
- Test mode (
--limit N): Saves to database for inspection - Production mode: Exports to timestamped JSONL files, enables checkpoint pause/resume
Data Flow: End-to-End
DatabaseSpider Loads Config
Queries database for
bbc_co_uk spider, applies domains/URLs/rules/settings.Scrapy Engine Starts
Scheduler queues start URLs, Downloader fetches pages, Spider processes responses.
Extraction
For each response:
- Try newspaper4k → trafilatura → custom CSS → Playwright
- Return
ScrapedArticleor None
Key Design Decisions
Generic Spider
One spider class loads any config at runtime. No code generation, no Python files per site.
Database as Config Store
Spiders are rows, not files. Change settings across 100 spiders with one SQL query.
Fallback Extraction
Multiple extractors in a chain. If newspaper fails, try trafilatura. If that fails, try custom CSS.
Validation Before Execution
All configs validated through Pydantic schemas. Malformed configs fail before execution.
Next Steps
Database-First Philosophy
Learn why spiders live in the database
Extractors Guide
Understand the extraction chain in detail