How It Works

scrapai uses AI at build time to generate scraper configurations, then runs them deterministically with Scrapy. You pay the inference cost once per website, not per page.

The Core Principle

When building scrapers at scale, you have three options:

Web Scraping Services

Pay per page, per request, or per API call. Fine for small volumes, expensive at scale.

AI-Powered Runtime

Call an LLM on every page to extract data. Smart but costly—10,000 pages = 10,000 inference calls.

AI Once, Deterministic Forever

Use AI at build time to analyze the site and write extraction rules. Then run with Scrapy—no AI in the loop.

scrapai implements option 3. The cost is per website, not per page. After the initial analysis, you own the scraper and can run it indefinitely without additional AI costs.

The Flow

Describe in Plain English

You tell the agent what you want: "Add https://bbc.co.uk to my news project"

AI Agent Analyzes

The agent fetches sample pages, identifies URL patterns, determines the site structure, and chooses an extraction strategy.

Generate JSON Config

The agent produces a validated JSON configuration with URL rules, extraction settings, and spider metadata.

Store in Database

The config is saved as a database row. No Python files, no code generation—just structured data.

Run with Scrapy

A generic spider (DatabaseSpider) loads the config at runtime and executes the crawl. Same spider for every website.

AI Once, Run Forever

The key insight: the agent writes config, not code.

# Agent analyzes the site and generates config
./scrapai spiders import bbc_spider.json --project news

# Run the spider as many times as you want - no AI costs
./scrapai crawl bbc_co_uk --project news

Every execution is pure Scrapy: fast, deterministic, and free.

Why JSON Configs Instead of AI-Generated Python?

An agent that writes and executes Python has the same power as an unsupervised developer. If it hallucinates, gets prompt-injected by a malicious page, or loses context, it can do real damage.

By constraining the agent to write JSON configs:

Security: Validation Before Execution

All configs go through strict Pydantic validation before they touch the database or crawler:

Spider names restricted to ^[a-zA-Z0-9_-]+$
URLs validated: HTTP/HTTPS only, no private IPs (127.0.0.1, 10.x, 192.168.x)
Callback names validated with reserved names blocked
Settings whitelisted: bounded concurrency (1-32), bounded delays (0-60s)
SQL via SQLAlchemy ORM with parameterized bindings

Malformed configs fail validation before execution.

Safety: No Code Execution

The agent produces data, not code. The worst case is a bad config that extracts wrong fields—caught in the test crawl and trivially fixable.At runtime, Scrapy executes deterministically with no AI in the loop.

Maintainability: Uniform Configs

All configs follow the same schema, validation, and structure.

Example: A Real Spider Config

Here’s what an AI-generated spider config looks like for BBC News:

BBC News Spider Config

{
  "name": "bbc_co_uk",
  "source_url": "https://bbc.co.uk/",
  "allowed_domains": ["bbc.co.uk", "www.bbc.co.uk"],
  "start_urls": ["https://www.bbc.co.uk/"],
  "rules": [
    {
      "allow": ["/news/articles/.*"],
      "deny": ["/news/articles/.*#comments"],
      "callback": "parse_article"
    },
    {
      "allow": ["/sport/.*/articles/.*"],
      "callback": "parse_article"
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 1,
    "CONCURRENT_REQUESTS": 16,
    "ROBOTSTXT_OBEY": true
  }
}

This config defines URL patterns to match, extraction strategies to use, and Scrapy settings. The DatabaseSpider loads it at runtime and executes the crawl.

The Cost Equation

Traditional AI Scraping

Per-page inference: 10,000 pages ×

0.001 per call = **

10 per crawl**Run weekly for a year: $520 per site

scrapai

One-time inference: ~20 pages analyzed = $0.02 onceRun forever: $0.02 total

For 100 websites scraped weekly:

Traditional AI: $52,000/year
scrapai: $2 once

When the Site Changes

When a site redesigns, use AI-assisted maintenance to detect and fix broken spiders. The agent re-analyzes and updates the config—another AI call (a few cents), then back to deterministic execution.

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

The Core Principle

Web Scraping Services

AI-Powered Runtime

AI Once, Deterministic Forever

The Flow

AI Once, Run Forever

Why JSON Configs Instead of AI-Generated Python?

Example: A Real Spider Config

The Cost Equation

Traditional AI Scraping

scrapai

When the Site Changes

Next Steps

Architecture

Database-First Philosophy

​The Core Principle

Web Scraping Services

AI-Powered Runtime

AI Once, Deterministic Forever

​The Flow

​AI Once, Run Forever

​Why JSON Configs Instead of AI-Generated Python?

​Example: A Real Spider Config

​The Cost Equation

Traditional AI Scraping

scrapai

​When the Site Changes

​Next Steps

Architecture

Database-First Philosophy

The Core Principle

The Flow

AI Once, Run Forever

Why JSON Configs Instead of AI-Generated Python?

Example: A Real Spider Config

The Cost Equation

When the Site Changes

Next Steps