Skip to main content
ScrapAI is an orchestration layer on top of Scrapy. Instead of writing Python spider files, an AI agent generates JSON configs stored in a database. A single generic spider loads any config at runtime.

High-Level Architecture

Component Breakdown

Entry Point: scrapai Script

scrapai Script
#!/usr/bin/env bash
# Auto-activates virtualenv, delegates to CLI
./scrapai crawl bbc_co_uk --project news
The scrapai entry point:
  • Auto-activates the virtual environment (no manual source venv/bin/activate)
  • Delegates commands to the Click-based CLI
  • Handles environment setup and validation

CLI Layer (cli/)

Built with Click, the CLI provides commands for:

Spider Management

spiders list, spiders import, spiders delete

Crawling

crawl <spider> with test mode (--limit) and production mode

Data Access

show <spider>, export <spider> (CSV/JSON/JSONL/Parquet)

Queue Management

queue add, queue bulk, queue list, queue next
CLI Structure
cli/
├── __init__.py        # Main CLI entry point
├── spiders.py         # Spider CRUD commands
├── crawl.py           # Crawl execution
├── data.py            # Show and export commands
├── queue.py           # Batch processing queue
└── inspect.py         # URL inspection tool

Database Layer (core/models.py, core/db.py)

ScrapAI uses SQLAlchemy with support for both SQLite (default) and PostgreSQL (production).

Core Models

Spider Model
class Spider(Base):
    __tablename__ = "spiders"
    
    id = Column(Integer, primary_key=True)
    name = Column(String, unique=True, index=True)
    allowed_domains = Column(JSON)  # ["example.com"]
    start_urls = Column(JSON)       # ["https://example.com"]
    source_url = Column(String)     # Original URL provided by user
    active = Column(Boolean)        # Enable/disable without deletion
    project = Column(String)        # Project grouping
    callbacks_config = Column(JSON) # Custom callback definitions
    created_at = Column(DateTime)
    updated_at = Column(DateTime)
    
    # Relationships
    rules = relationship("SpiderRule")
    settings = relationship("SpiderSetting")
    items = relationship("ScrapedItem")
Key Point: Spiders are rows, not files. Adding a website means inserting a row.

Database Configuration

core/db.py
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

# SQLite (default) or PostgreSQL from .env
DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///scrapai.db")

engine = create_engine(DATABASE_URL)

# SQLite optimizations
@event.listens_for(engine, "connect")
def set_sqlite_pragma(dbapi_conn, connection_record):
    if "sqlite" in DATABASE_URL:
        cursor = dbapi_conn.cursor()
        cursor.execute("PRAGMA journal_mode=WAL")  # Write-Ahead Logging
        cursor.execute("PRAGMA synchronous=NORMAL")
        cursor.execute("PRAGMA cache_size=-64000")  # 64MB cache
        cursor.close()
SQLite vs PostgreSQL: SQLite is perfect for development and small-scale production. Switch to PostgreSQL for multi-user access or high concurrency.

Spider Layer (spiders/database_spider.py)

The magic happens here: one spider class for all websites.
DatabaseSpider Core
class DatabaseSpider(BaseDBSpiderMixin, CrawlSpider):
    name = "database_spider"
    
    def __init__(self, spider_name=None, *args, **kwargs):
        self.spider_name = spider_name
        self._load_config()  # Load from database
        super().__init__(*args, **kwargs)
    
    def _load_config(self):
        """Load spider configuration from database"""
        db = next(get_db())
        spider = db.query(Spider).filter(Spider.name == self.spider_name).first()
        
        if not spider:
            raise ValueError(f"Spider '{self.spider_name}' not found")
        
        # Apply config to spider instance
        self.allowed_domains = spider.allowed_domains
        self.start_urls = spider.start_urls
        
        # Compile rules from database
        self.rules = []
        for r in spider.rules:
            le_kwargs = {}
            if r.allow_patterns:
                le_kwargs["allow"] = r.allow_patterns
            if r.deny_patterns:
                le_kwargs["deny"] = r.deny_patterns
            
            self.rules.append(
                Rule(LinkExtractor(**le_kwargs), 
                     callback=r.callback, 
                     follow=r.follow)
            )
Execution Flow:
  1. CLI runs: ./scrapai crawl bbc_co_uk --project news
  2. DatabaseSpider instantiated with spider_name="bbc_co_uk"
  3. _load_config() queries the database for bbc_co_uk spider
  4. Config applied: domains, URLs, rules, settings
  5. Scrapy engine starts crawling with the loaded config

Extraction Layer (core/extractors.py)

ScrapAI uses a fallback chain of extractors:
NewspaperExtractor
class NewspaperExtractor(BaseExtractor):
    def extract(self, url, html, title_hint=None):
        article = newspaper.Article(url)
        article.download(input_html=html)
        article.parse()
        
        return ScrapedArticle(
            url=url,
            title=article.title or title_hint,
            content=article.text,
            author=", ".join(article.authors),
            published_date=article.publish_date,
            source="newspaper4k",
            metadata={
                "top_image": article.top_image,
                "keywords": article.keywords
            }
        )
Best for: News articles, blogs, standard article layouts
Extractor Configuration:
Spider Settings
{
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "CUSTOM_SELECTORS": {
      "title": "h1.article-title",
      "content": "div.article-body"
    }
  }
}

Handlers and Middleware

Bypasses Cloudflare protection using nodriver for browser automation.How it works:
  1. Browser solves the Cloudflare challenge once
  2. Extract session cookies (cf_clearance)
  3. Switch to fast HTTP requests with cached cookies
  4. Refresh cookies every 10 minutes
Performance: ~5-10s for initial challenge, then ~200-500ms per request.
Enable Cloudflare Bypass
{
  "settings": {
    "CLOUDFLARE_ENABLED": true
  }
}
Automatically escalates to proxies on blocks (403/429 errors).Escalation Flow:
  1. Start with direct connections
  2. On 403/429, retry with datacenter proxy
  3. Remember domain for future requests
  4. Residential proxies require explicit opt-in
Configuration:
.env
DATACENTER_PROXY_USERNAME=your_username
DATACENTER_PROXY_PASSWORD=your_password
DATACENTER_PROXY_HOST=proxy.example.com
DATACENTER_PROXY_PORT=10000

Pipeline Layer (pipelines.py)

The item pipeline handles storage:
DatabasePipeline
class DatabasePipeline:
    def __init__(self):
        self.items_buffer = []
        self.buffer_size = 50  # Batch writes
    
    def process_item(self, item, spider):
        self.items_buffer.append(item)
        
        if len(self.items_buffer) >= self.buffer_size:
            self._flush_to_db()
        
        return item
    
    def _flush_to_db(self):
        with get_db() as db:
            for item_data in self.items_buffer:
                item = ScrapedItem(
                    spider_id=item_data["spider_id"],
                    url=item_data["url"],
                    title=item_data.get("title"),
                    content=item_data.get("content"),
                    # ... other fields
                )
                db.add(item)
            db.commit()
        self.items_buffer.clear()
Storage Modes:
Test Mode
./scrapai crawl bbc_co_uk --project news --limit 10
  • Saves to database (scraped_items table)
  • Inspect with: ./scrapai show bbc_co_uk --project news
  • Export with: ./scrapai export bbc_co_uk --project news --format csv

Data Flow: End-to-End

1

User Runs Crawl Command

./scrapai crawl bbc_co_uk --project news --limit 5
2

CLI Invokes Scrapy

cli/crawl.py constructs Scrapy command:
process = CrawlerProcess(settings)
process.crawl(DatabaseSpider, spider_name="bbc_co_uk")
process.start()
3

DatabaseSpider Loads Config

Queries database for bbc_co_uk spider, applies domains/URLs/rules/settings.
4

Scrapy Engine Starts

Scheduler queues start URLs, Downloader fetches pages, Spider processes responses.
5

Extraction

For each response:
  • Try newspaper4k → trafilatura → custom CSS → Playwright
  • Return ScrapedArticle or None
6

Pipeline Storage

Items buffered and batch-written to database or JSONL files.
7

Output Available

./scrapai show bbc_co_uk --project news
./scrapai export bbc_co_uk --project news --format csv

Key Design Decisions

Generic Spider

One spider class loads any config at runtime. No code generation, no Python files per site.

Database as Config Store

Spiders are rows, not files. Change settings across 100 spiders with one SQL query.

Fallback Extraction

Multiple extractors in a chain. If newspaper fails, try trafilatura. If that fails, try custom CSS.

Validation Before Execution

All configs validated through Pydantic schemas. Malformed configs fail before execution.

Next Steps