The Core Principle
When building scrapers at scale, you have three options:Web Scraping Services
Pay per page, per request, or per API call. Fine for small volumes, expensive at scale.
AI-Powered Runtime
Call an LLM on every page to extract data. Smart but costly—10,000 pages = 10,000 inference calls.
AI Once, Deterministic Forever
Use AI at build time to analyze the site and write extraction rules. Then run with Scrapy—no AI in the loop.
The Flow
Describe in Plain English
You tell the agent what you want:
"Add https://bbc.co.uk to my news project"AI Agent Analyzes
The agent fetches sample pages, identifies URL patterns, determines the site structure, and chooses an extraction strategy.
Generate JSON Config
The agent produces a validated JSON configuration with URL rules, extraction settings, and spider metadata.
Store in Database
The config is saved as a database row. No Python files, no code generation—just structured data.
AI Once, Run Forever
The key insight: the agent writes config, not code.Why JSON Configs Instead of AI-Generated Python?
By constraining the agent to write JSON configs:Security: Validation Before Execution
Security: Validation Before Execution
All configs go through strict Pydantic validation before they touch the database or crawler:
- Spider names restricted to
^[a-zA-Z0-9_-]+$ - URLs validated: HTTP/HTTPS only, no private IPs (127.0.0.1, 10.x, 192.168.x)
- Callback names validated with reserved names blocked
- Settings whitelisted: bounded concurrency (1-32), bounded delays (0-60s)
- SQL via SQLAlchemy ORM with parameterized bindings
Safety: No Code Execution
Safety: No Code Execution
The agent produces data, not code. The worst case is a bad config that extracts wrong fields—caught in the test crawl and trivially fixable.At runtime, Scrapy executes deterministically with no AI in the loop.
Portability: Export and Share
Portability: Export and Share
Maintainability: No Code Drift
Maintainability: No Code Drift
When 5 developers write 100 spiders, you get 5 different styles. ScrapAI produces uniform configs with the same schema, validation, and structure.
Example: A Real Spider Config
Here’s what an AI-generated spider config looks like for BBC News:BBC News Spider Config
This config defines URL patterns to match, extraction strategies to use, and Scrapy settings. The
DatabaseSpider loads it at runtime and executes the crawl.What the Agent Figured Out
From a single instruction ("Add https://bbc.co.uk to my news project"), the agent:
- Discovered URL patterns:
/news/articles/.*for news,/sport/.*/articles/.*for sports - Chose extractors: newspaper4k and trafilatura work well for BBC’s article structure
- Set rate limits: 1-second delay between requests to be respectful
- Filtered noise: Deny comments sections with
#comments
The Cost Equation
Traditional AI Scraping
Per-page inference: 10,000 pages × 10 per crawl**Run weekly for a year: $520 per site
ScrapAI
One-time inference: ~20 pages analyzed = $0.02 onceRun forever: $0.02 total
- Traditional AI: $52,000/year
- ScrapAI: $2 once
When the Site Changes
Websites redesign. Layouts change. Scrapers break. With ScrapAI:Fixing a Broken Spider
Key Takeaways
One-Time Cost
AI inference happens once during spider creation. Every subsequent crawl is pure Scrapy—fast and free.
No Code Generation
The agent writes validated JSON configs, not executable Python. Safer, more portable, easier to maintain.
Database-First
Spiders are database rows, not files. Change settings across 100 spiders with one SQL query.
Production Ready
Every generated spider comes with a test config (5 sample URLs). Run test crawls to verify before production.