CLAUDE.md (~5k tokens), and ./scrapai setup automatically configures permission rules that block the agent from modifying framework code.
Why Claude Code?
Permission enforcement at the tool level. Claude Code is the only agent that can enforce hard blocks on file operations and shell commands. When configured, it cannot write or edit Python files, even if it wants to. Other agents (OpenCode, Cursor, Windsurf) receive the same instructions but lack enforcement. They can choose to ignore the rules. Claude Code can’t.Setup
Install ScrapAI
./scrapai setup creates the virtual environment, installs dependencies (including browser drivers), initializes SQLite, and configures Claude Code permissions.Launch Claude Code
CLAUDE.md on startup and understands the complete workflow.Permission Rules
When you run./scrapai setup, it creates .claude/settings.local.json with hard enforcement of allow/deny lists:
.claude/settings.local.json
What’s Blocked
- Python file modifications: Cannot create, edit, or update any
.pyfiles - Core file modifications: Cannot edit the
scrapaientry script or.claude/*settings - Sensitive files: Cannot write
.env, secrets, API keys, passwords - Web access: WebFetch and WebSearch are blocked (agent works with local files only)
- Destructive commands: Cannot run
rmcommands
What’s Allowed
- Reading files: Can read any file (Python, JSON, HTML, sections.md, etc.)
- JSON operations: Can create, read, and edit JSON configs
- CLI commands: Can run
./scrapaicommands,source, database clients (sqlite3,psql), andxvfb-run - File operations: Read, Write, Edit, Update, Glob, Grep on allowed patterns
- File search: Can use
globandgrepto explore the codebase - Git operations: Can run safe git commands (status, diff, log, commit, push)
The agent can still read Python files to understand how the framework works, but it cannot modify them. This is intentional—understanding the code helps it generate better configs.
How the Agent Works
The agent readsCLAUDE.md which contains:
- Identity and purpose: “You are ScrapAI, a web scraping assistant”
- 4-phase workflow: Analysis → Rules → Import → Test (detailed in workflow documentation)
- Critical rules: Always use
--project, never skip phases, run commands one at a time - CLI reference: Complete command documentation with examples
- Settings reference: Generic extractors, custom selectors, Cloudflare, Playwright
Context Management
The full agent instructions fit in ~5k tokens. Additional docs (Cloudflare, proxies, callbacks, etc.) are loaded only when needed, not upfront. Most of the context window goes to actual site analysis, not reading a manual. When the agent needs specialized knowledge:- Cloudflare bypass → reads
docs/cloudflare.md - Custom callbacks → reads
docs/callbacks.md - Queue operations → reads
docs/queue.md
Allowed Tools
The agent has access to these tools:./scrapai CLI (all commands)
./scrapai CLI (all commands)
./scrapai inspect <url>— fetch and save HTML./scrapai analyze <html>— analyze HTML structure, test selectors./scrapai extract-urls --file <html>— extract URLs from saved HTML./scrapai spiders import <json>— import/update spider config./scrapai crawl <name>— run test or production crawl./scrapai show <name>— view scraped items./scrapai export <name>— export data (CSV, JSON, JSONL, Parquet)./scrapai queue add/list/next/complete— queue operations./scrapai db migrate/stats/query— database operations
File operations
File operations
- Read: Read any file (Python, JSON, HTML, etc.)
- Write: Create new files (JSON, Markdown, text files)
- Edit: Modify existing files (JSON, Markdown, text files)
- Glob: Search for files by pattern
- Grep: Search file contents
.py filesBash (limited)
Bash (limited)
- Git operations (status, diff, log, commit, push)
- npm, docker, system commands
fetch, curl, wget (use ./scrapai inspect), grep/rg/awk/sed (use Grep tool), cat/head/tail (use Read tool), find/ls (use Glob tool), mkdir (auto-created), python/python3 (use ./scrapai analyze)Task (parallel subagents)
Task (parallel subagents)
Spawn parallel subagents for batch processing (max 5 at a time).Each subagent processes one website through all 4 phases sequentially.
Example Workflows
Single Website
Batch Processing (Parallel)
Fixing a Broken Spider
Common Questions
Can the agent modify Scrapy settings?
Can the agent modify Scrapy settings?
No. The agent cannot edit
scrapy_settings.py or any Python files. It can only configure spider-specific settings in JSON configs (e.g., DOWNLOAD_DELAY, CONCURRENT_REQUESTS, EXTRACTOR_ORDER).What if I want the agent to fix a bug in the framework?
What if I want the agent to fix a bug in the framework?
You need to do that manually. The agent can identify the bug and suggest a fix, but it cannot apply the fix itself due to permission rules.This is intentional—framework code is privileged and should require human review.
Can the agent create custom extractor classes?
Can the agent create custom extractor classes?
No. The agent works with the existing extractors (newspaper, trafilatura, playwright, custom CSS selectors) and custom callbacks. If you need a new extractor type, you’ll need to add it manually in Python.
What if the agent gets stuck or confused?
What if the agent gets stuck or confused?
You can guide it by providing more context or correcting its assumptions. For example:The agent will re-read its analysis notes, correct the URL pattern, and regenerate the config.
Can I override the agent's decisions?
Can I override the agent's decisions?
Yes. You can hand-edit the JSON configs at any time. The agent-generated configs are starting points, not final products.After editing, run:
Troubleshooting
Agent tries to edit Python files
If you see errors like “Permission denied: cannot edit *.py”, the permission rules are working correctly. The agent should recognize this and work within the constraints. If it keeps trying, remind it:Agent skips phases
The workflow is sequential: Phase 1 → 2 → 3 → 4. If the agent tries to skip steps, remind it:Agent runs commands too quickly
The agent should run commands one at a time and read the output before proceeding. If it chains commands with&&, remind it:
Permission rules not working
If you’re using Claude Code and permission rules aren’t enforced:- Check that
./scrapai setupcompleted successfully - Verify
.claude/settings.local.jsonexists with the permission rules - Restart Claude Code