Claude Code Integration

Claude Code is the recommended AI agent for scrapai. The complete workflow instructions live in CLAUDE.md, and ./scrapai setup automatically configures permission rules that block the agent from modifying framework code.

View the live CLAUDE.md

The canonical agent instructions, always current on main. The sections below summarize it — this file is the source of truth.

Why Claude Code?

Permission enforcement at the tool level. Claude Code is the only agent that can enforce hard blocks on file operations and shell commands. When configured, it cannot write or edit Python files, even if it wants to. Sub-agent parallelization without context compaction. Claude Code can spawn sub-agents (via the Task tool) that work independently. Process multiple websites in one session without losing context. In our internal experiments, we processed 40 websites in a single Claude session before hitting conversation compaction. Other agents (OpenCode, Cursor, Windsurf) receive the same instructions but lack enforcement. They can choose to ignore the rules. Claude Code can’t.

Setup

Install scrapai

git clone https://github.com/discourselab/scrapai-cli.git
cd scrapai-cli
./scrapai setup
./scrapai verify

./scrapai setup creates the virtual environment, installs dependencies (including browser drivers), initializes SQLite, and configures Claude Code permissions.

Launch Claude Code

claude

The agent automatically reads CLAUDE.md on startup and understands the complete workflow.

Start building spiders

You: "Add https://bbc.com to my news project"

The agent will analyze the site, generate rules, test extraction, and deploy the spider through all 4 phases.

Permission Rules

./scrapai setup creates .claude/settings.local.json with tool-level enforcement: Blocked: Python file modifications, sensitive files (.env, secrets), web access (WebFetch/WebSearch), destructive commands (rm) Allowed: Reading all files, JSON operations, CLI commands (./scrapai, sqlite3, psql), file search (Glob/Grep), safe git operations The agent can read Python files to understand the framework but cannot modify them.

How the Agent Works

The agent reads CLAUDE.md (the live file is authoritative — this is a summary), which contains:

Identity and purpose: “You are scrapai, a web scraping assistant”
4-phase workflow: Analysis → Rules → Import → Test (detailed in workflow documentation)
Critical rules: Always use --project, never skip phases, run commands one at a time
CLI reference: Complete command documentation with examples
Settings reference: Generic extractors, custom selectors, Cloudflare, Playwright

Context Management

The full agent instructions stay compact (a few thousand tokens). Additional docs (Cloudflare, proxies, callbacks, etc.) are loaded only when needed, not upfront. Most of the context window goes to actual site analysis, not reading a manual. When the agent needs specialized knowledge:

Cloudflare bypass → reads docs/cloudflare.md
Custom callbacks → reads docs/callbacks.md
Queue operations → reads docs/queue.md

This keeps context usage low and scraping analysis high.

Allowed Tools

./scrapai CLI: All commands (inspect, analyze, crawl, spiders, queue, db, export)
File operations: Read, Write, Edit (JSON/Markdown/text only, no .py files), Glob, Grep
Bash: Git operations, npm, docker. Use CLI/tools instead of: curl/wget (use ./scrapai inspect), cat/grep/find (use Read/Grep/Glob)
Task (sub-agents): Spawn parallel sub-agents for batch processing (max 5 at a time)

Example Workflows

Single website:

You: "Add https://arstechnica.com to my tech project"
Agent: Analyzes site → Tests extractors → Creates config → Test crawl (5/5) → Imports
Spider 'arstechnica_com' ready: ./scrapai crawl arstechnica_com --project tech

Batch processing (parallel sub-agents):

You: "Process all 12 sites in queue"
Agent: Spawns 5 sub-agents → Batch 1 (5 sites) → Batch 2 (5 sites) → Batch 3 (2 sites)
Result: 11 spiders created, 1 failed (Cloudflare - retry needed)

Fixing broken spider:

You: "The BBC spider is broken. Fix it."
Agent: Re-analyzes site → Updates selectors → Tests on 5 articles → Imports update
Spider 'bbc_co_uk' updated and verified.

Common Questions

Can the agent modify Scrapy settings or fix framework bugs?

No. Agent cannot edit Python files. It configures spider-specific settings in JSON configs only. Framework changes require manual edits.

What if the agent gets stuck?

Guide it: “The blog section is at /articles/, not /blog/. Update the spider.” Agent re-reads notes and regenerates config.

Can I override the agent's decisions?

Yes. Hand-edit JSON configs anytime, then: ./scrapai spiders import your_config.json --project yourproject

Troubleshooting

Agent tries to edit Python files: Permission rules working correctly. Remind: “You cannot edit Python files. Use CLI commands only.” Agent skips phases: Remind: “Complete all 4 phases. Don’t skip Phase 2.” Permission rules not working: Check ./scrapai setup completed, verify .claude/settings.local.json exists, restart Claude Code. Remember: only Claude Code enforces permissions.

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

Claude Code Integration

View the live CLAUDE.md

Why Claude Code?

Setup

Permission Rules

How the Agent Works

Context Management

Allowed Tools

Example Workflows

Common Questions

Troubleshooting

Next Steps

4-Phase Workflow

CLI Reference

View the live CLAUDE.md

​Why Claude Code?

​Setup

​Permission Rules

​How the Agent Works

​Context Management

​Allowed Tools

​Example Workflows

​Common Questions

​Troubleshooting

​Next Steps

4-Phase Workflow

CLI Reference

Why Claude Code?

Setup

Permission Rules

How the Agent Works

Context Management

Allowed Tools

Example Workflows

Common Questions

Troubleshooting

Next Steps