Skip to main content
ScrapAI is designed to work with AI coding agents that read workflow instructions, analyze websites, and produce validated JSON configs through the CLI. The agent becomes your scraping assistant—you describe what you want in plain English, and it handles the technical work.

How It Works

Instead of writing Python spider files manually, an AI agent generates a JSON config and stores it in a database. A single generic spider (DatabaseSpider) loads any config at runtime.
You (plain English) → AI Agent → JSON config → Database → Scrapy crawl
                       (once)                               (forever)
Why JSON configs instead of AI-generated Python? An agent that writes and executes Python has the same power as an unsupervised developer. If it hallucinates, gets prompt-injected by a malicious page, or loses context, it can do real damage. An agent that writes JSON configs produces data, not code. That data goes through strict validation (Pydantic schemas, SSRF checks, reserved name blocking) before it reaches the database. The worst case is a bad config that extracts wrong fields, caught in the test crawl and trivially fixable.

What the Agent Does

The agent replicates what expert Python web scraping engineers do:
1

Inspect the website

Opens the homepage and analyzes the page structure
2

Identify sections

Discovers content categories (blog, news, reports, etc.) and understands the site organization
3

Write URL patterns

Creates rules to match specific sections (e.g., /blog/* for blog posts)
4

Analyze content pages

Opens sample articles and examines the HTML structure to identify title, content, author, date
5

Write extraction rules

Creates CSS selectors or configures generic extractors (newspaper, trafilatura)
6

Test and verify

Runs test crawls on sample pages to verify extraction quality
7

Save to database

Stores the complete spider configuration for reuse
Next time you need to scrape the same website? Just use the existing spider from the database. No rebuilding, no rewriting.

Database-First Spider Management

The problem: Most web scraping is one-off scripts that get rewritten every time you need the same data. ScrapAI’s solution: Write the spider once, save it to a database, reuse it forever. Here’s what an AI-generated spider config looks like:
{
  "name": "bbc_co_uk",
  "allowed_domains": ["bbc.co.uk"],
  "start_urls": ["https://www.bbc.co.uk/news"],
  "rules": [
    {
      "allow": ["/news/articles/[^/]+$"],
      "callback": "parse_article",
      "follow": false
    },
    {
      "allow": ["/news/?$"],
      "follow": true
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 2
  }
}
Adding a new website means adding a new row. Spiders are rows in a database, not Python files on disk.

Supported Agents

Claude Code is what we use and test with. The complete workflow instructions fit in ~5k tokens, and ./scrapai setup configures permission rules that block the agent from modifying framework code.
claude
You: "Add https://bbc.com to my news project"
Agent: [Analyzes site, generates rules, tests extraction, deploys spider]

You: "Here's a CSV with 200 websites, add them all to the queue"
Agent: [Queues them, processes in parallel batches]
Claude Code enforces permission rules at the tool level, blocking all Python file modifications (Write/Edit/Update/MultiEdit(**/*.py)), sensitive files, web access, and destructive shell commands. This is the only agent with guaranteed enforcement via .claude/settings.local.json.

Other Coding Agents

OpenCode, Cursor, Windsurf, Antigravity, and other agents should work with any agent that can read instructions and run shell commands. An AGENTS.md file is included for these agents.
These agents lack Claude Code’s permission enforcement, so review changes carefully. They receive instructions but cannot enforce tool-level blocks.

Claws

ScrapAI works with any Claw that can read instructions and execute shell commands. We tested with NanoClaw for autonomous operation via Telegram. More rigorous testing is in progress with other Claws like PicoClaw, IronClaw, and Nanobot.

Agent Safety

When you pair an AI agent with a scraping framework, the agent can potentially modify code, run arbitrary commands, and interact with untrusted web content. ScrapAI’s approach: the agent writes config, not code.
1

Permission rules (Claude Code only)

Permission rules block all Python file modifications (Write/Edit/Update/MultiEdit(**/*.py)), sensitive files (.env, secrets/**), web access (WebFetch, WebSearch), and destructive commands (Bash(rm:*)) at the tool level
2

CLI-only interaction

The agent interacts only through a defined CLI (./scrapai inspect, ./scrapai spiders import, etc.)
3

Strict validation

JSON configs are validated through Pydantic before import. Malformed configs, SSRF URLs, and injection attempts fail validation
4

Deterministic execution

At runtime, Scrapy executes deterministically with no AI in the loop

What’s Validated

All input is validated through Pydantic schemas before it touches the database or the crawler:
  • Spider configs: strict schema validation (extra="forbid"), spider names restricted to ^[a-zA-Z0-9_-]+$, callback names validated with reserved names blocked
  • URLs: HTTP/HTTPS only, private IP and localhost blocking (127.0.0.1, 10.x, 172.16.x, 192.168.x, 169.254.x), 2048-char limit
  • Settings: whitelisted extractor names, bounded concurrency (1-32), bounded delays (0-60s)
  • SQL: all queries through SQLAlchemy ORM with parameterized bindings; db query validates table names against a whitelist; UPDATE/DELETE require row count confirmation

Example Interactions

Single Site Analysis

You: "Add https://techcrunch.com to my news project"

Agent: I'll analyze TechCrunch and create a spider for your news project.

Phase 1: Analyzing site structure...
- Inspected homepage
- Identified sections: articles, startups, AI news
- Documented URL patterns in sections.md

Phase 2: Testing extractors...
- Generic extractors work well (newspaper + trafilatura)
- Articles have clean semantic HTML

Phase 3: Creating spider config...
- Created test_spider.json with 5 sample URLs
- Created final_spider.json with full configuration

Phase 4: Testing and importing...
- Test crawl completed: 5/5 articles extracted successfully
- Titles, content, authors, and dates all present
- Imported final spider to database

Spider 'techcrunch_com' is ready for production use.
Run: ./scrapai crawl techcrunch_com --project news

Batch Processing

You: "Here's a CSV with 50 news websites. Add them all to the queue and process them."

Agent: I'll bulk-add all 50 websites to the queue and process them in parallel batches.

Added 50 websites to queue.

Processing batch 1 (5 websites)...
[Processes 5 sites in parallel, each through Phase 1-4]

Batch 1 complete: 4 succeeded, 1 failed (Cloudflare challenge)

Processing batch 2 (5 websites)...
[Continues until all batches complete]

Final results: 47 spiders created, 3 failed (retry needed)

You’re Always in the Loop

The agent doesn’t just run off and do things. During site analysis, it writes detailed notes in sections.md: what URL patterns it found, what sections the site has, what extraction strategy it chose and why. Plain language, easy to read. You can review at any point, correct the agent’s assumptions, and bring your expertise into the process.

Next Steps