Skip to main content
Claude Code is the recommended AI agent for ScrapAI. The complete workflow instructions are in CLAUDE.md (~5k tokens), and ./scrapai setup automatically configures permission rules that block the agent from modifying framework code.

Why Claude Code?

Permission enforcement at the tool level. Claude Code is the only agent that can enforce hard blocks on file operations and shell commands. When configured, it cannot write or edit Python files, even if it wants to. Sub-agent parallelization without context compaction. Claude Code can spawn sub-agents (via the Task tool) that work independently. Process multiple websites in one session without losing context. In our internal experiments, we processed 40 websites in a single Claude session before hitting conversation compaction. Other agents (OpenCode, Cursor, Windsurf) receive the same instructions but lack enforcement. They can choose to ignore the rules. Claude Code can’t.

Setup

1

Install ScrapAI

git clone https://github.com/discourselab/scrapai-cli.git
cd scrapai-cli
./scrapai setup
./scrapai verify
./scrapai setup creates the virtual environment, installs dependencies (including browser drivers), initializes SQLite, and configures Claude Code permissions.
2

Launch Claude Code

claude
The agent automatically reads CLAUDE.md on startup and understands the complete workflow.
3

Start building spiders

You: "Add https://bbc.com to my news project"
The agent will analyze the site, generate rules, test extraction, and deploy the spider through all 4 phases.

Permission Rules

./scrapai setup creates .claude/settings.local.json with tool-level enforcement: Blocked: Python file modifications, sensitive files (.env, secrets), web access (WebFetch/WebSearch), destructive commands (rm) Allowed: Reading all files, JSON operations, CLI commands (./scrapai, sqlite3, psql), file search (Glob/Grep), safe git operations The agent can read Python files to understand the framework but cannot modify them.

How the Agent Works

The agent reads CLAUDE.md which contains:
  1. Identity and purpose: “You are ScrapAI, a web scraping assistant”
  2. 4-phase workflow: Analysis → Rules → Import → Test (detailed in workflow documentation)
  3. Critical rules: Always use --project, never skip phases, run commands one at a time
  4. CLI reference: Complete command documentation with examples
  5. Settings reference: Generic extractors, custom selectors, Cloudflare, Playwright

Context Management

The full agent instructions fit in ~5k tokens. Additional docs (Cloudflare, proxies, callbacks, etc.) are loaded only when needed, not upfront. Most of the context window goes to actual site analysis, not reading a manual. When the agent needs specialized knowledge:
  • Cloudflare bypass → reads docs/cloudflare.md
  • Custom callbacks → reads docs/callbacks.md
  • Queue operations → reads docs/queue.md
This keeps context usage low and scraping analysis high.

Allowed Tools

  • ./scrapai CLI: All commands (inspect, analyze, crawl, spiders, queue, db, export)
  • File operations: Read, Write, Edit (JSON/Markdown/text only, no .py files), Glob, Grep
  • Bash: Git operations, npm, docker. Use CLI/tools instead of: curl/wget (use ./scrapai inspect), cat/grep/find (use Read/Grep/Glob)
  • Task (sub-agents): Spawn parallel sub-agents for batch processing (max 5 at a time)

Example Workflows

Single website:
You: "Add https://arstechnica.com to my tech project"
Agent: Analyzes site → Tests extractors → Creates config → Test crawl (5/5) → Imports
Spider 'arstechnica_com' ready: ./scrapai crawl arstechnica_com --project tech
Batch processing (parallel sub-agents):
You: "Process all 12 sites in queue"
Agent: Spawns 5 sub-agents → Batch 1 (5 sites) → Batch 2 (5 sites) → Batch 3 (2 sites)
Result: 11 spiders created, 1 failed (Cloudflare - retry needed)
Fixing broken spider:
You: "The BBC spider is broken. Fix it."
Agent: Re-analyzes site → Updates selectors → Tests on 5 articles → Imports update
Spider 'bbc_co_uk' updated and verified.

Common Questions

No. Agent cannot edit Python files. It configures spider-specific settings in JSON configs only. Framework changes require manual edits.
Guide it: “The blog section is at /articles/, not /blog/. Update the spider.” Agent re-reads notes and regenerates config.
Yes. Hand-edit JSON configs anytime, then: ./scrapai spiders import your_config.json --project yourproject

Troubleshooting

Agent tries to edit Python files: Permission rules working correctly. Remind: “You cannot edit Python files. Use CLI commands only.” Agent skips phases: Remind: “Complete all 4 phases. Don’t skip Phase 2.” Permission rules not working: Check ./scrapai setup completed, verify .claude/settings.local.json exists, restart Claude Code. Remember: only Claude Code enforces permissions.

Next Steps

4-Phase Workflow

Learn the complete analysis → rules → import → test workflow

CLI Reference

Complete CLI command documentation