The Core Principle
When building scrapers at scale, you have three options:Web Scraping Services
Pay per page, per request, or per API call. Fine for small volumes, expensive at scale.
AI-Powered Runtime
Call an LLM on every page to extract data. Smart but costly—10,000 pages = 10,000 inference calls.
AI Once, Deterministic Forever
Use AI at build time to analyze the site and write extraction rules. Then run with Scrapy—no AI in the loop.
The Flow
Describe in Plain English
You tell the agent what you want:
"Add https://bbc.co.uk to my news project"AI Agent Analyzes
The agent fetches sample pages, identifies URL patterns, determines the site structure, and chooses an extraction strategy.
Generate JSON Config
The agent produces a validated JSON configuration with URL rules, extraction settings, and spider metadata.
Store in Database
The config is saved as a database row. No Python files, no code generation—just structured data.
AI Once, Run Forever
The key insight: the agent writes config, not code.Why JSON Configs Instead of AI-Generated Python?
By constraining the agent to write JSON configs:Security: Validation Before Execution
Security: Validation Before Execution
All configs go through strict Pydantic validation before they touch the database or crawler:
- Spider names restricted to
^[a-zA-Z0-9_-]+$ - URLs validated: HTTP/HTTPS only, no private IPs (127.0.0.1, 10.x, 192.168.x)
- Callback names validated with reserved names blocked
- Settings whitelisted: bounded concurrency (1-32), bounded delays (0-60s)
- SQL via SQLAlchemy ORM with parameterized bindings
Safety: No Code Execution
Safety: No Code Execution
The agent produces data, not code. The worst case is a bad config that extracts wrong fields—caught in the test crawl and trivially fixable.At runtime, Scrapy executes deterministically with no AI in the loop.
Portability: Export and Share
Portability: Export and Share
Maintainability: Uniform Configs
Maintainability: Uniform Configs
All configs follow the same schema, validation, and structure.
Example: A Real Spider Config
Here’s what an AI-generated spider config looks like for BBC News:BBC News Spider Config
This config defines URL patterns to match, extraction strategies to use, and Scrapy settings. The
DatabaseSpider loads it at runtime and executes the crawl.The Cost Equation
Traditional AI Scraping
Per-page inference: 10,000 pages × 10 per crawl**Run weekly for a year: $520 per site
ScrapAI
One-time inference: ~20 pages analyzed = $0.02 onceRun forever: $0.02 total
- Traditional AI: $52,000/year
- ScrapAI: $2 once
When the Site Changes
When a site redesigns, use AI-assisted maintenance to detect and fix broken spiders. The agent re-analyzes and updates the config—another AI call (a few cents), then back to deterministic execution.Next Steps
Architecture
Understand the system components and data flow
Database-First Philosophy
Learn why spiders live in the database, not in files