You’re Always in the Loop
The agent doesn’t just run off and do things. During site analysis, it writes detailed notes insections.md: what URL patterns it found, what sections the site has, what extraction strategy it chose and why. Plain language, easy to read.
You can review at any point, correct the agent’s assumptions, and bring your expertise into the process.
Hand-Write, Edit, or Override Anything
Write your own JSON configs from scratch. Edit AI-generated ones. Override settings per spider. Write custom callbacks with your own CSS/XPath selectors and data processors.Consistency Across the Fleet
When 5 developers write 100 spiders, you get 5 different styles, naming conventions, and quality levels. ScrapAI produces uniform configs with the same schema, validation, and structure. Easier to review, easier to debug, easier to onboard new people.Architecture
ScrapAI is an orchestration layer on top of Scrapy. Instead of writing a Python spider file per website, an AI agent generates a JSON config and stores it in a database. A single generic spider (DatabaseSpider) loads any config at runtime.
Component Overview
| Component | What it does |
|---|---|
scrapai | Entry point, auto-activates venv, delegates to CLI |
cli/ | Click-based CLI: spiders, queue, crawl, show, export, inspect |
spiders/database_spider.py | Generic spider that loads config from database at runtime |
spiders/sitemap_spider.py | Sitemap-based spider for sites with XML sitemaps |
core/extractors.py | Extraction chain: newspaper, trafilatura, custom CSS, Playwright |
core/models.py | SQLAlchemy models: Spider, SpiderRule, SpiderSetting, ScrapedItem |
handlers/cloudflare_handler.py | Cloudflare bypass with cookie caching |
middlewares.py | SmartProxyMiddleware, direct-to-proxy escalation |
pipelines.py | Batched database writes and JSONL export |
alembic/ | Database migrations |
airflow/ | Production scheduling with Apache Airflow |
Codebase
Small and readable: ~4,000 lines of code. Built on Scrapy, SQLAlchemy, Alembic — tools you already know. Read the whole thing in an afternoon. Measured with pygount, counting actual code lines only (no blanks, no comments, no docstrings). Tests, examples, and docs excluded.| Metric | Count |
|---|---|
| Files | 37 |
| Code Lines | 4,028 |
| Comment Lines | 895 |
| Comment % | 14% |
- Scrapling: 5,875 lines (21% comments)
- crawl4ai: 26,850 lines (21% comments)
Writing Spider Configs
Here’s what an AI-generated spider config looks like:Custom Extractors
For non-article content (products, jobs, listings), write custom callbacks with field-level selectors:Database Schema
All configuration lives in PostgreSQL (or SQLite for development):Spider Table
SpiderRule Table
SpiderSetting Table
ScrapedItem Table
Extending ScrapAI
Adding a New Extractor
Create a new extractor class incore/extractors.py:
Adding Custom Middleware
Add middleware tomiddlewares.py:
scrapy_settings.py:
Adding CLI Commands
Add commands tocli/:
cli/__init__.py:
Storage Modes
Test mode (--limit N): saves to database, inspect via show command
Migrating Existing Scrapers
Point the agent at your existing Python scripts (Scrapy spiders, BeautifulSoup, Scrapling, whatever) and it’ll read them, understand the extraction logic, and write the equivalent ScrapAI JSON config.Security
All input is validated through Pydantic schemas before it touches the database or the crawler:- Spider configs: strict schema validation (
extra="forbid"), spider names restricted to^[a-zA-Z0-9_-]+$, callback names validated with reserved names blocked - URLs: HTTP/HTTPS only, private IP and localhost blocking (127.0.0.1, 10.x, 172.16.x, 192.168.x, 169.254.x), 2048-char limit
- Settings: whitelisted extractor names, bounded concurrency (1-32), bounded delays (0-60s)
- SQL: all queries through SQLAlchemy ORM with parameterized bindings;
db queryvalidates table names against a whitelist; UPDATE/DELETE require row count confirmation
Agent Safety
When you pair an AI agent with a scraping framework, the agent can potentially modify code, run arbitrary commands, and interact with untrusted web content. This isn’t theoretical. In February 2026, an OpenClaw agent deleted 200+ emails after context compaction caused it to lose safety constraints. ScrapAI’s approach: the agent writes config, not code.- With Claude Code, permission rules block all Python file modifications (
Write/Edit/Update/MultiEdit(**/*.py)), sensitive files (.env,secrets/**), web access (WebFetch,WebSearch), and destructive shell commands at the tool level - The agent interacts only through a defined CLI (
./scrapai inspect,./scrapai spiders import, etc.) - JSON configs are validated through Pydantic before import. Malformed configs, SSRF URLs, and injection attempts fail validation
- At runtime, Scrapy executes deterministically with no AI in the loop
./scrapai setup. Other agents get instructions but not enforcement. Only Claude Code guarantees the agent can’t sidestep it.
See Comparison for the full analysis.
Contributing
Contributions welcome. Areas where help would be particularly valuable:Structural Change Detection
Automatic detection of website structural changes
Extraction Modules
Additional extraction modules (images, tables, PDFs)
Anti-Bot Support
Anti-bot support beyond Cloudflare
Authentication
Authentication and session management
Development Setup
Running Tests
Code Style
We follow PEP 8 with these exceptions:- Line length: 120 characters
- Docstrings: Google style
Limitations
Current limitations (pull requests welcome):- Authentication: No login support, no paywall bypass, no persistent sessions
- Advanced anti-bot: We handle Cloudflare. Not DataDome, PerimeterX, Akamai, or CAPTCHA-solving services
- Interactive content: No form submission, no click-based pagination