Analysis and Review
During site analysis, agents write detailed notes insections.md documenting URL patterns, site structure, and extraction strategy. Review the analysis, correct assumptions, and refine the approach before finalizing configs.
Full Control
Write configs by hand, edit generated ones, override settings per spider, or write custom callbacks with your own CSS/XPath selectors:Team Benefits
All configs follow the same schema. Uniform structure across the fleet means easier code review, debugging, and onboarding. One developer can pick up another’s spider without decoding personal style choices.Architecture
ScrapAI is an orchestration layer on top of Scrapy. Instead of writing a Python spider file per website, an AI agent generates a JSON config and stores it in a database. A single generic spider (DatabaseSpider) loads any config at runtime.
Component Overview
| Component | What it does |
|---|---|
scrapai | Entry point, auto-activates venv, delegates to CLI |
cli/ | Click-based CLI: spiders, queue, crawl, show, export, inspect |
spiders/database_spider.py | Generic spider that loads config from database at runtime |
spiders/sitemap_spider.py | Sitemap-based spider for sites with XML sitemaps |
core/extractors.py | Extraction chain: newspaper, trafilatura, custom CSS, Playwright |
core/models.py | SQLAlchemy models: Spider, SpiderRule, SpiderSetting, ScrapedItem |
handlers/cloudflare_handler.py | Cloudflare bypass with cookie caching |
middlewares.py | SmartProxyMiddleware, direct-to-proxy escalation |
pipelines.py | Batched database writes and JSONL export |
alembic/ | Database migrations |
airflow/ | Production scheduling with Apache Airflow |
Codebase
Small and readable: ~4,000 lines of code. Built on Scrapy, SQLAlchemy, Alembic — tools you already know. Read the whole thing in an afternoon. Measured with pygount, counting actual code lines only (no blanks, no comments, no docstrings). Tests, examples, and docs excluded.| Metric | Count |
|---|---|
| Files | 37 |
| Code Lines | 4,028 |
| Comment Lines | 895 |
| Comment % | 14% |
- Scrapling: 5,875 lines (21% comments)
- crawl4ai: 26,850 lines (21% comments)
Writing Spider Configs
Here’s what an AI-generated spider config looks like:Custom Extractors
For non-article content (products, jobs, listings), write custom callbacks with field-level selectors:Database Schema
All configuration lives in PostgreSQL (or SQLite for development):Spider Table
SpiderRule Table
SpiderSetting Table
ScrapedItem Table
Extending ScrapAI
Adding a New Extractor
Create a new extractor class incore/extractors.py:
Adding Custom Middleware
Add middleware tomiddlewares.py:
scrapy_settings.py:
Adding CLI Commands
Add commands tocli/:
cli/__init__.py:
Storage Modes
Test mode (--limit N): saves to database, inspect via show command
Migrating Existing Scrapers
Point the agent at your existing Python scripts (Scrapy spiders, BeautifulSoup, Scrapling, whatever) and it’ll read them, understand the extraction logic, and write the equivalent ScrapAI JSON config.Security
All input is validated through Pydantic schemas. Spider configs, URLs, and settings are validated before touching the database or crawler. SQL queries use parameterized bindings. ScrapAI uses a config-only architecture where agents write JSON, not code. See Security-First Design for the full security model.Contributing
Contributions welcome. Areas where help would be particularly valuable:Structural Change Detection
Automatic detection of website structural changes
Extraction Modules
Additional extraction modules (images, tables, PDFs)
Anti-Bot Support
Anti-bot support beyond Cloudflare
Authentication
Authentication and session management
Development Setup
Running Tests
Code Style
We follow PEP 8 with these exceptions:- Line length: 120 characters
- Docstrings: Google style
Limitations
Current limitations (pull requests welcome):- Authentication: No login support, no paywall bypass, no persistent sessions
- Advanced anti-bot: We handle Cloudflare. Not DataDome, PerimeterX, Akamai, or CAPTCHA-solving services
- Interactive content: No form submission, no click-based pagination
Related Documentation
Architecture
Technical architecture and design decisions
Spider Schema
Complete JSON schema reference
Custom Callbacks
Write custom field extractors
Security
Security model and validation