Skip to main content
ScrapAI uses AI at build time to generate scraper configurations, then runs them deterministically with Scrapy. You pay the inference cost once per website, not per page.

The Core Principle

When building scrapers at scale, you have three options:

Web Scraping Services

Pay per page, per request, or per API call. Fine for small volumes, expensive at scale.

AI-Powered Runtime

Call an LLM on every page to extract data. Smart but costly—10,000 pages = 10,000 inference calls.

AI Once, Deterministic Forever

Use AI at build time to analyze the site and write extraction rules. Then run with Scrapy—no AI in the loop.
ScrapAI implements option 3. The cost is per website, not per page. After the initial analysis, you own the scraper and can run it indefinitely without additional AI costs.

The Flow

1

Describe in Plain English

You tell the agent what you want: "Add https://bbc.co.uk to my news project"
2

AI Agent Analyzes

The agent fetches sample pages, identifies URL patterns, determines the site structure, and chooses an extraction strategy.
3

Generate JSON Config

The agent produces a validated JSON configuration with URL rules, extraction settings, and spider metadata.
4

Store in Database

The config is saved as a database row. No Python files, no code generation—just structured data.
5

Run with Scrapy

A generic spider (DatabaseSpider) loads the config at runtime and executes the crawl. Same spider for every website.

AI Once, Run Forever

The key insight: the agent writes config, not code.
# Agent analyzes the site and generates config
./scrapai spiders import bbc_spider.json --project news
Every execution is pure Scrapy: fast, deterministic, and free.

Why JSON Configs Instead of AI-Generated Python?

An agent that writes and executes Python has the same power as an unsupervised developer. If it hallucinates, gets prompt-injected by a malicious page, or loses context, it can do real damage.
By constraining the agent to write JSON configs:
All configs go through strict Pydantic validation before they touch the database or crawler:
  • Spider names restricted to ^[a-zA-Z0-9_-]+$
  • URLs validated: HTTP/HTTPS only, no private IPs (127.0.0.1, 10.x, 192.168.x)
  • Callback names validated with reserved names blocked
  • Settings whitelisted: bounded concurrency (1-32), bounded delays (0-60s)
  • SQL via SQLAlchemy ORM with parameterized bindings
Malformed configs fail validation before execution.
The agent produces data, not code. The worst case is a bad config that extracts wrong fields—caught in the test crawl and trivially fixable.At runtime, Scrapy executes deterministically with no AI in the loop.
JSON configs are portable data structures. Export a spider config, import it into another project, share it with your team, or version control it like any other data file.
All configs follow the same schema, validation, and structure.

Example: A Real Spider Config

Here’s what an AI-generated spider config looks like for BBC News:
BBC News Spider Config
{
  "name": "bbc_co_uk",
  "source_url": "https://bbc.co.uk/",
  "allowed_domains": ["bbc.co.uk", "www.bbc.co.uk"],
  "start_urls": ["https://www.bbc.co.uk/"],
  "rules": [
    {
      "allow": ["/news/articles/.*"],
      "deny": ["/news/articles/.*#comments"],
      "callback": "parse_article"
    },
    {
      "allow": ["/sport/.*/articles/.*"],
      "callback": "parse_article"
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 1,
    "CONCURRENT_REQUESTS": 16,
    "ROBOTSTXT_OBEY": true
  }
}
This config defines URL patterns to match, extraction strategies to use, and Scrapy settings. The DatabaseSpider loads it at runtime and executes the crawl.

The Cost Equation

Traditional AI Scraping

Per-page inference: 10,000 pages × 0.001percall=0.001 per call = **10 per crawl**Run weekly for a year: $520 per site

ScrapAI

One-time inference: ~20 pages analyzed = $0.02 onceRun forever: $0.02 total
For 100 websites scraped weekly:
  • Traditional AI: $52,000/year
  • ScrapAI: $2 once

When the Site Changes

When a site redesigns, use AI-assisted maintenance to detect and fix broken spiders. The agent re-analyzes and updates the config—another AI call (a few cents), then back to deterministic execution.

Next Steps

Architecture

Understand the system components and data flow

Database-First Philosophy

Learn why spiders live in the database, not in files