How It Works
Instead of writing Python spider files manually, an AI agent generates a JSON config and stores it in a database. A single generic spider (DatabaseSpider) loads any config at runtime.
What the Agent Does
The agent replicates what expert Python web scraping engineers do:Identify sections
Discovers content categories (blog, news, reports, etc.) and understands the site organization
Analyze content pages
Opens sample articles and examines the HTML structure to identify title, content, author, date
Write extraction rules
Creates CSS selectors or configures generic extractors (newspaper, trafilatura)
Database-First Spider Management
The problem: Most web scraping is one-off scripts that get rewritten every time you need the same data. ScrapAI’s solution: Write the spider once, save it to a database, reuse it forever. Here’s what an AI-generated spider config looks like:Supported Agents
Claude Code (Recommended)
Claude Code is what we use and test with. The complete workflow instructions fit in ~5k tokens, and./scrapai setup configures permission rules that block the agent from modifying framework code.
Claude Code enforces permission rules at the tool level, blocking all Python file modifications (
Write/Edit/Update/MultiEdit(**/*.py)), sensitive files, web access, and destructive shell commands. This is the only agent with guaranteed enforcement via .claude/settings.local.json.Other Coding Agents
OpenCode, Cursor, Windsurf, Antigravity, and other agents should work with any agent that can read instructions and run shell commands. AnAGENTS.md file is included for these agents.
Claws
ScrapAI works with any Claw that can read instructions and execute shell commands. We tested with NanoClaw for autonomous operation via Telegram. More rigorous testing is in progress with other Claws like PicoClaw, IronClaw, and Nanobot.Agent Safety
When you pair an AI agent with a scraping framework, the agent can potentially modify code, run arbitrary commands, and interact with untrusted web content. ScrapAI’s approach: the agent writes config, not code.Permission rules (Claude Code only)
Permission rules block all Python file modifications (
Write/Edit/Update/MultiEdit(**/*.py)), sensitive files (.env, secrets/**), web access (WebFetch, WebSearch), and destructive commands (Bash(rm:*)) at the tool levelCLI-only interaction
The agent interacts only through a defined CLI (
./scrapai inspect, ./scrapai spiders import, etc.)Strict validation
JSON configs are validated through Pydantic before import. Malformed configs, SSRF URLs, and injection attempts fail validation
What’s Validated
All input is validated through Pydantic schemas before it touches the database or the crawler:- Spider configs: strict schema validation (
extra="forbid"), spider names restricted to^[a-zA-Z0-9_-]+$, callback names validated with reserved names blocked - URLs: HTTP/HTTPS only, private IP and localhost blocking (127.0.0.1, 10.x, 172.16.x, 192.168.x, 169.254.x), 2048-char limit
- Settings: whitelisted extractor names, bounded concurrency (1-32), bounded delays (0-60s)
- SQL: all queries through SQLAlchemy ORM with parameterized bindings;
db queryvalidates table names against a whitelist; UPDATE/DELETE require row count confirmation
Example Interactions
Single Site Analysis
Batch Processing
You’re Always in the Loop
The agent doesn’t just run off and do things. During site analysis, it writes detailed notes insections.md: what URL patterns it found, what sections the site has, what extraction strategy it chose and why.
Plain language, easy to read. You can review at any point, correct the agent’s assumptions, and bring your expertise into the process.