ScrapAI is an orchestration layer on top of Scrapy. Instead of writing Python spider files, an AI agent generates JSON configs stored in a database. A single generic spider loads any config at runtime.
High-Level Architecture
Component Breakdown
Entry Point: scrapai Script
#!/usr/bin/env bash
# Auto-activates virtualenv, delegates to CLI
./scrapai crawl bbc_co_uk --project news
The scrapai entry point:
Auto-activates the virtual environment (no manual source venv/bin/activate)
Delegates commands to the Click-based CLI
Handles environment setup and validation
CLI Layer (cli/)
Built with Click , the CLI provides commands for:
Spider Management spiders list, spiders import, spiders delete
Crawling crawl <spider> with test mode (--limit) and production mode
Data Access show <spider>, export <spider> (CSV/JSON/JSONL/Parquet)
Queue Management queue add, queue bulk, queue list, queue next
cli /
├── __init__ .py # Main CLI entry point
├── spiders.py # Spider CRUD commands
├── crawl.py # Crawl execution
├── data.py # Show and export commands
├── queue.py # Batch processing queue
└── inspect.py # URL inspection tool
Database Layer (core/models.py, core/db.py)
ScrapAI uses SQLAlchemy with support for both SQLite (default) and PostgreSQL (production).
Core Models
Spider
SpiderRule
SpiderSetting
ScrapedItem
class Spider ( Base ):
__tablename__ = "spiders"
id = Column(Integer, primary_key = True )
name = Column(String, unique = True , index = True )
allowed_domains = Column( JSON ) # ["example.com"]
start_urls = Column( JSON ) # ["https://example.com"]
source_url = Column(String) # Original URL provided by user
active = Column(Boolean) # Enable/disable without deletion
project = Column(String) # Project grouping
callbacks_config = Column( JSON ) # Custom callback definitions
created_at = Column(DateTime)
updated_at = Column(DateTime)
# Relationships
rules = relationship( "SpiderRule" )
settings = relationship( "SpiderSetting" )
items = relationship( "ScrapedItem" )
Key Point : Spiders are rows, not files. Adding a website means inserting a row.class SpiderRule ( Base ):
__tablename__ = "spider_rules"
id = Column(Integer, primary_key = True )
spider_id = Column(Integer, ForeignKey( "spiders.id" ))
allow_patterns = Column( JSON ) # ["/news/articles/.*"]
deny_patterns = Column( JSON ) # ["/news/.*#comments"]
restrict_xpaths = Column( JSON ) # Limit link extraction scope
restrict_css = Column( JSON ) # CSS-based link restriction
callback = Column(String) # "parse_article" or None
follow = Column(Boolean) # Follow links from this rule?
priority = Column(Integer) # Rule execution order
Scrapy Mapping : These map directly to Scrapy’s Rule and LinkExtractor.class SpiderSetting ( Base ):
__tablename__ = "spider_settings"
id = Column(Integer, primary_key = True )
spider_id = Column(Integer, ForeignKey( "spiders.id" ))
key = Column(String) # "DOWNLOAD_DELAY"
value = Column(String) # "2" (stored as string)
type = Column(String) # "int", "float", "bool", "string"
Examples : EXTRACTOR_ORDER, CLOUDFLARE_ENABLED, CONCURRENT_REQUESTSclass ScrapedItem ( Base ):
__tablename__ = "scraped_items"
id = Column(Integer, primary_key = True )
spider_id = Column(Integer, ForeignKey( "spiders.id" ))
url = Column(String, unique = True , index = True )
title = Column(String)
content = Column(Text)
published_date = Column(DateTime)
author = Column(String)
scraped_at = Column(DateTime)
metadata_json = Column( JSON ) # Custom fields go here
Storage Modes :
Test mode (--limit N): Saves to database for inspection
Production mode: Exports to JSONL files, optionally saves to DB
Database Configuration
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
# SQLite (default) or PostgreSQL from .env
DATABASE_URL = os.getenv( "DATABASE_URL" , "sqlite:///scrapai.db" )
engine = create_engine( DATABASE_URL )
# SQLite optimizations
@event.listens_for (engine, "connect" )
def set_sqlite_pragma ( dbapi_conn , connection_record ):
if "sqlite" in DATABASE_URL :
cursor = dbapi_conn.cursor()
cursor.execute( "PRAGMA journal_mode=WAL" ) # Write-Ahead Logging
cursor.execute( "PRAGMA synchronous=NORMAL" )
cursor.execute( "PRAGMA cache_size=-64000" ) # 64MB cache
cursor.close()
SQLite vs PostgreSQL : SQLite is perfect for development and small-scale production. Switch to PostgreSQL for multi-user access or high concurrency.
Spider Layer (spiders/database_spider.py)
The magic happens here: one spider class for all websites .
class DatabaseSpider ( BaseDBSpiderMixin , CrawlSpider ):
name = "database_spider"
def __init__ ( self , spider_name = None , * args , ** kwargs ):
self .spider_name = spider_name
self ._load_config() # Load from database
super (). __init__ ( * args, ** kwargs)
def _load_config ( self ):
"""Load spider configuration from database"""
db = next (get_db())
spider = db.query(Spider).filter(Spider.name == self .spider_name).first()
if not spider:
raise ValueError ( f "Spider ' { self .spider_name } ' not found" )
# Apply config to spider instance
self .allowed_domains = spider.allowed_domains
self .start_urls = spider.start_urls
# Compile rules from database
self .rules = []
for r in spider.rules:
le_kwargs = {}
if r.allow_patterns:
le_kwargs[ "allow" ] = r.allow_patterns
if r.deny_patterns:
le_kwargs[ "deny" ] = r.deny_patterns
self .rules.append(
Rule(LinkExtractor( ** le_kwargs),
callback = r.callback,
follow = r.follow)
)
Execution Flow :
CLI runs: ./scrapai crawl bbc_co_uk --project news
DatabaseSpider instantiated with spider_name="bbc_co_uk"
_load_config() queries the database for bbc_co_uk spider
Config applied: domains, URLs, rules, settings
Scrapy engine starts crawling with the loaded config
ScrapAI uses a fallback chain of extractors:
newspaper4k
trafilatura
Custom CSS
Playwright
class NewspaperExtractor ( BaseExtractor ):
def extract ( self , url , html , title_hint = None ):
article = newspaper.Article(url)
article.download( input_html = html)
article.parse()
return ScrapedArticle(
url = url,
title = article.title or title_hint,
content = article.text,
author = ", " .join(article.authors),
published_date = article.publish_date,
source = "newspaper4k" ,
metadata = {
"top_image" : article.top_image,
"keywords" : article.keywords
}
)
Best for : News articles, blogs, standard article layoutsclass TrafilaturaExtractor ( BaseExtractor ):
def extract ( self , url , html , title_hint = None ):
data = trafilatura.bare_extraction(html, url = url)
return ScrapedArticle(
url = url,
title = data.get( "title" ) or title_hint,
content = data.get( "text" ),
author = data.get( "author" ),
published_date = data.get( "date" ),
source = "trafilatura" ,
metadata = {
"description" : data.get( "description" ),
"categories" : data.get( "categories" )
}
)
Best for : Articles, documentation, text-heavy contentclass CustomExtractor ( BaseExtractor ):
def __init__ ( self , selectors ):
# selectors = {"title": "h1.title", "content": "div.article"}
self .selectors = selectors
def extract ( self , url , html , title_hint = None ):
soup = BeautifulSoup(html, "lxml" )
title = self ._extract_text(soup, self .selectors.get( "title" ))
content = self ._extract_text(soup, self .selectors.get( "content" ))
author = self ._extract_text(soup, self .selectors.get( "author" ))
return ScrapedArticle(
url = url,
title = title or title_hint,
content = content,
author = author,
source = "custom"
)
Best for : Non-standard layouts, structured data extractionasync def _extract_with_playwright_async ( self , url , ...):
from utils.browser import BrowserClient
async with BrowserClient() as browser:
await browser.goto(url)
html = await browser.get_html()
# Try trafilatura on rendered HTML
return TrafilaturaExtractor().extract(url, html)
Best for : JavaScript-rendered content, dynamic pages
Extractor Configuration :
{
"settings" : {
"EXTRACTOR_ORDER" : [ "newspaper" , "trafilatura" ],
"CUSTOM_SELECTORS" : {
"title" : "h1.article-title" ,
"content" : "div.article-body"
}
}
}
Handlers and Middleware
CloudflareHandler (handlers/cloudflare_handler.py)
Bypasses Cloudflare protection using nodriver for browser automation. How it works :
Browser solves the Cloudflare challenge once
Extract session cookies (cf_clearance)
Switch to fast HTTP requests with cached cookies
Refresh cookies every 10 minutes
Performance : ~5-10s for initial challenge, then ~200-500ms per request.{
"settings" : {
"CLOUDFLARE_ENABLED" : true
}
}
SmartProxyMiddleware (middlewares.py)
Automatically escalates to proxies on blocks (403/429 errors). Escalation Flow :
Start with direct connections
On 403/429, retry with datacenter proxy
Remember domain for future requests
Residential proxies require explicit opt-in
Configuration :DATACENTER_PROXY_USERNAME = your_username
DATACENTER_PROXY_PASSWORD = your_password
DATACENTER_PROXY_HOST = proxy.example.com
DATACENTER_PROXY_PORT = 10000
Pipeline Layer (pipelines.py)
The item pipeline handles storage:
class DatabasePipeline :
def __init__ ( self ):
self .items_buffer = []
self .buffer_size = 50 # Batch writes
def process_item ( self , item , spider ):
self .items_buffer.append(item)
if len ( self .items_buffer) >= self .buffer_size:
self ._flush_to_db()
return item
def _flush_to_db ( self ):
with get_db() as db:
for item_data in self .items_buffer:
item = ScrapedItem(
spider_id = item_data[ "spider_id" ],
url = item_data[ "url" ],
title = item_data.get( "title" ),
content = item_data.get( "content" ),
# ... other fields
)
db.add(item)
db.commit()
self .items_buffer.clear()
Storage Modes :
Test Mode
Production Mode
./scrapai crawl bbc_co_uk --project news --limit 10
Saves to database (scraped_items table)
Inspect with: ./scrapai show bbc_co_uk --project news
Export with: ./scrapai export bbc_co_uk --project news --format csv
./scrapai crawl bbc_co_uk --project news
Exports to timestamped JSONL: data/news/bbc_co_uk/crawl_20260228_143022.jsonl
Enables checkpoint pause/resume (Ctrl+C to pause)
Optional database storage via settings
Data Flow: End-to-End
User Runs Crawl Command
./scrapai crawl bbc_co_uk --project news --limit 5
CLI Invokes Scrapy
cli/crawl.py constructs Scrapy command:process = CrawlerProcess(settings)
process.crawl(DatabaseSpider, spider_name = "bbc_co_uk" )
process.start()
DatabaseSpider Loads Config
Queries database for bbc_co_uk spider, applies domains/URLs/rules/settings.
Scrapy Engine Starts
Scheduler queues start URLs, Downloader fetches pages, Spider processes responses.
Extraction
For each response:
Try newspaper4k → trafilatura → custom CSS → Playwright
Return ScrapedArticle or None
Pipeline Storage
Items buffered and batch-written to database or JSONL files.
Output Available
./scrapai show bbc_co_uk --project news
./scrapai export bbc_co_uk --project news --format csv
Key Design Decisions
Generic Spider One spider class loads any config at runtime. No code generation, no Python files per site.
Database as Config Store Spiders are rows, not files. Change settings across 100 spiders with one SQL query.
Fallback Extraction Multiple extractors in a chain. If newspaper fails, try trafilatura. If that fails, try custom CSS.
Validation Before Execution All configs validated through Pydantic schemas. Malformed configs fail before execution.
Next Steps