CLAUDE.md (~5k tokens), and ./scrapai setup automatically configures permission rules that block the agent from modifying framework code.
Why Claude Code?
Permission enforcement at the tool level. Claude Code is the only agent that can enforce hard blocks on file operations and shell commands. When configured, it cannot write or edit Python files, even if it wants to. Sub-agent parallelization without context compaction. Claude Code can spawn sub-agents (via the Task tool) that work independently. Process multiple websites in one session without losing context. In our internal experiments, we processed 40 websites in a single Claude session before hitting conversation compaction. Other agents (OpenCode, Cursor, Windsurf) receive the same instructions but lack enforcement. They can choose to ignore the rules. Claude Code can’t.Setup
Install ScrapAI
./scrapai setup creates the virtual environment, installs dependencies (including browser drivers), initializes SQLite, and configures Claude Code permissions.Launch Claude Code
CLAUDE.md on startup and understands the complete workflow.Permission Rules
./scrapai setup creates .claude/settings.local.json with tool-level enforcement:
Blocked: Python file modifications, sensitive files (.env, secrets), web access (WebFetch/WebSearch), destructive commands (rm)
Allowed: Reading all files, JSON operations, CLI commands (./scrapai, sqlite3, psql), file search (Glob/Grep), safe git operations
The agent can read Python files to understand the framework but cannot modify them.
How the Agent Works
The agent readsCLAUDE.md which contains:
- Identity and purpose: “You are ScrapAI, a web scraping assistant”
- 4-phase workflow: Analysis → Rules → Import → Test (detailed in workflow documentation)
- Critical rules: Always use
--project, never skip phases, run commands one at a time - CLI reference: Complete command documentation with examples
- Settings reference: Generic extractors, custom selectors, Cloudflare, Playwright
Context Management
The full agent instructions fit in ~5k tokens. Additional docs (Cloudflare, proxies, callbacks, etc.) are loaded only when needed, not upfront. Most of the context window goes to actual site analysis, not reading a manual. When the agent needs specialized knowledge:- Cloudflare bypass → reads
docs/cloudflare.md - Custom callbacks → reads
docs/callbacks.md - Queue operations → reads
docs/queue.md
Allowed Tools
- ./scrapai CLI: All commands (inspect, analyze, crawl, spiders, queue, db, export)
- File operations: Read, Write, Edit (JSON/Markdown/text only, no .py files), Glob, Grep
- Bash: Git operations, npm, docker. Use CLI/tools instead of: curl/wget (use
./scrapai inspect), cat/grep/find (use Read/Grep/Glob) - Task (sub-agents): Spawn parallel sub-agents for batch processing (max 5 at a time)
Example Workflows
Single website:Common Questions
Can the agent modify Scrapy settings or fix framework bugs?
Can the agent modify Scrapy settings or fix framework bugs?
No. Agent cannot edit Python files. It configures spider-specific settings in JSON configs only. Framework changes require manual edits.
What if the agent gets stuck?
What if the agent gets stuck?
Guide it: “The blog section is at /articles/, not /blog/. Update the spider.” Agent re-reads notes and regenerates config.
Can I override the agent's decisions?
Can I override the agent's decisions?
Yes. Hand-edit JSON configs anytime, then:
./scrapai spiders import your_config.json --project yourprojectTroubleshooting
Agent tries to edit Python files: Permission rules working correctly. Remind: “You cannot edit Python files. Use CLI commands only.” Agent skips phases: Remind: “Complete all 4 phases. Don’t skip Phase 2.” Permission rules not working: Check./scrapai setup completed, verify .claude/settings.local.json exists, restart Claude Code. Remember: only Claude Code enforces permissions.
Next Steps
4-Phase Workflow
Learn the complete analysis → rules → import → test workflow
CLI Reference
Complete CLI command documentation