Security Model
ScrapAI’s approach: the agent writes config, not code. JSON configs are validated through Pydantic before import, and at runtime, Scrapy executes deterministically with no AI in the loop. Threat model:- Malicious user input: URLs, spider names, settings
- AI agent hallucination: Generates bad configs or tries to modify framework code
- Prompt injection from scraped pages: Untrusted web content influencing agent behavior
- SSRF attacks: Crawling internal/private network resources
- SQL injection: Through spider names or settings
Input Validation
All spider configs go through strict Pydantic validation before touching the database or crawler.Spider Config Schema
extra="forbid": Unknown fields are rejected (prevents injection of arbitrary data)- Type enforcement: Strings must be strings, lists must be lists, etc.
- Required fields: Missing data causes validation error
- Field length limits: Prevents excessively large inputs
Spider Name Validation
SSRF Protection
Server-Side Request Forgery (SSRF) attacks trick the crawler into accessing internal resources.URL Scheme Validation
file://, ftp://, gopher://, dict://, ldap://, and custom schemes.
Localhost Protection
localhost, 0.0.0.0, 127.0.0.1, ::1 (IPv6 loopback)
Private IP Protection
URL Length Limit
Settings Validation
Extractor Whitelist
Concurrency and Delay Limits
Processor Whitelist
Callback Validation
Reserved Names Protection
Cross-Validation
SQL Injection Protection
All database queries use SQLAlchemy ORM with parameterized bindings.Safe Query Pattern
CLI Database Access
db query validates table names against a whitelist; UPDATE/DELETE require row count confirmation.
Agent Safety
ScrapAI’s approach: the agent writes config, not code.Permission Rules (Claude Code)
Configured via./scrapai setup:
.claude/settings.local.json
- Run
./scrapaicommands, database clients, and allowed shell commands - Read any file (Python, JSON, configs, documentation)
- Write, edit, and update JSON configs and analysis files
- Use file search tools (Glob, Grep)
- Modify Python framework code
- Delete files
- Run privileged commands
- Execute arbitrary scripts
Validation Before Import
All configs must pass Pydantic validation before import:Deterministic Runtime
At runtime, no AI models are called, no LLM inference on scraped content, and Scrapy executes deterministically based on validated config.Prompt Injection Resistance
Scraped pages cannot influence the agent:- AI is not in the crawl loop: Scrapy runs without LLM inference
- Configs are static: Once imported, extraction rules don’t change based on page content
- Validation is deterministic: Pydantic schemas don’t depend on context
Security Checklist
Before deploying ScrapAI:- Review all spider configs for localhost/private IPs
- Confirm
extra="forbid"in Pydantic schemas - Verify SQLAlchemy ORM usage (no raw SQL)
- Test SSRF protection with internal hostnames
- Configure Claude Code permissions (if using AI agent)
- Set up database backups
- Enable SSL/TLS for PostgreSQL connections
- Rotate S3 credentials if using S3 upload
- Monitor logs for validation errors
- Set up alerts for repeated validation failures (possible attack)
Reporting Vulnerabilities
Please DO NOT report security vulnerabilities through public GitHub issues. Email us directly: dev@discourselab.ai Include:- Type of vulnerability (SQL injection, command injection, SSRF, etc.)
- Affected component (CLI command, spider, handler)
- Steps to reproduce
- Impact assessment
In Scope
- Injection vulnerabilities (SQL, command, code)
- Path traversal / directory access
- Remote code execution
- Sensitive data exposure
- Server-side request forgery (SSRF)
- Insecure defaults
Out of Scope
- Web scraping ethics (scraping public websites is not a vulnerability)
- Cloudflare bypass techniques (core feature, not a bug)
- Robots.txt violations (user responsibility)
- Outdated dependencies (unless actively exploitable)
See Also
Migration
See validation in action during config import
Custom Callbacks
Write safe extraction logic with validated processors