Security Model
FromREADME.md:213-234:
ScrapAI’s approach: the agent writes config, not code.Threat model:
- With Claude Code, permission rules block
Write(**/*.py),Edit(**/*.py)at the tool level- The agent interacts only through a defined CLI
- JSON configs are validated through Pydantic before import
- At runtime, Scrapy executes deterministically with no AI in the loop
- Malicious user input: URLs, spider names, settings
- AI agent hallucination: Generates bad configs or tries to modify framework code
- Prompt injection from scraped pages: Untrusted web content influencing agent behavior
- SSRF attacks: Crawling internal/private network resources
- SQL injection: Through spider names or settings
Input Validation
All spider configs go through strict Pydantic validation before touching the database or crawler. Fromcore/schemas.py:1-402:
Spider Config Schema
extra="forbid": Unknown fields are rejected (prevents injection of arbitrary data)- Type enforcement: Strings must be strings, lists must be lists, etc.
- Required fields: Missing data causes validation error
- Field length limits: Prevents excessively large inputs
Spider Name Validation
Fromcore/schemas.py:233-245:
- SQL injection: No quotes, semicolons, or SQL keywords
- Path traversal: No slashes or dots
- Command injection: No spaces, pipes, or shell metacharacters
SSRF Protection
Server-Side Request Forgery (SSRF) attacks trick the crawler into accessing internal resources.URL Scheme Validation
Fromcore/schemas.py:247-271:
file://: Local filesystem accessftp://: FTP serversgopher://: Gopher protocol (used in SSRF attacks)dict://: Dictionary server protocolldap://: LDAP directory access- Custom schemes that could invoke handlers
Localhost Protection
Fromcore/schemas.py:273-284:
localhost0.0.0.0127.0.0.1::1(IPv6 loopback)
Private IP Protection
Fromcore/schemas.py:285-319:
- 10.0.0.0/8: Private network (RFC 1918)
- 172.16.0.0/12: Private network (RFC 1918)
- 192.168.0.0/16: Private network (RFC 1918)
- 127.0.0.0/8: Loopback
- 169.254.0.0/16: Link-local (APIPA)
- 224.0.0.0/4: Multicast
- 240.0.0.0/4: Reserved
URL Length Limit
Fromcore/schemas.py:320-322:
- Buffer overflow attempts
- Denial of service via memory exhaustion
- Excessively large database fields
Settings Validation
Extractor Whitelist
Fromcore/schemas.py:105-116:
Concurrency Limits
Fromcore/schemas.py:97:
Delay Limits
Fromcore/schemas.py:98:
Processor Whitelist
Fromcore/schemas.py:136-156:
Callback Validation
Reserved Names Protection
Fromcore/schemas.py:326-358:
- Overwriting built-in Scrapy methods
- Python injection via
eval(callback_name) - Namespace collisions
Cross-Validation
Fromcore/schemas.py:359-376:
SQL Injection Protection
All database queries use SQLAlchemy ORM with parameterized bindings.Safe Query Pattern
CLI Database Access
FromREADME.md:219:
db query validates table names against a whitelist; UPDATE/DELETE require row count confirmation
Whitelist validation:
Agent Safety
FromREADME.md:222-234:
When you pair an AI agent with a scraping framework, the agent can potentially modify code, run arbitrary commands, and interact with untrusted web content. ScrapAI’s approach: the agent writes config, not code.
Permission Rules (Claude Code)
Configured via./scrapai setup, which creates .claude/settings.local.json:
.claude/settings.local.json
- Run
./scrapaicommands, database clients, and allowed shell commands - Read any file (Python, JSON, configs, documentation)
- Write, edit, and update JSON configs and analysis files
- Use file search tools (Glob, Grep)
- Modify Python framework code
- Delete files
- Run privileged commands
- Execute arbitrary scripts
Validation Before Import
Even if an agent generates a malicious config, it must pass Pydantic validation before import:Deterministic Runtime
FromREADME.md:65:
An agent that writes JSON configs produces data, not code. That data goes through strict validation before it reaches the database. The worst case is a bad config that extracts wrong fields, caught in the test crawl and trivially fixable.At runtime:
- No AI models are called
- No LLM inference on scraped content
- Scrapy executes deterministically based on validated config
- No dynamic code evaluation or
eval()
Prompt Injection Resistance
Scraped pages cannot influence the agent because:- AI is not in the crawl loop: Scrapy runs without LLM inference
- Configs are static: Once imported, extraction rules don’t change based on page content
- Validation is deterministic: Pydantic schemas don’t depend on context
- This content is extracted as data (text field in JSONL)
- The AI agent never sees this during crawl (only during initial analysis)
- Even if seen during analysis, it cannot bypass validation
Security Checklist
Before deploying ScrapAI:- Review all spider configs for localhost/private IPs
- Confirm
extra="forbid"in Pydantic schemas - Verify SQLAlchemy ORM usage (no raw SQL)
- Test SSRF protection with internal hostnames
- Configure Claude Code permissions (if using AI agent)
- Set up database backups
- Enable SSL/TLS for PostgreSQL connections
- Rotate S3 credentials if using S3 upload
- Monitor logs for validation errors
- Set up alerts for repeated validation failures (possible attack)
Reporting Vulnerabilities
FromSECURITY.md:1-44:
Please DO NOT report security vulnerabilities through public GitHub issues. Email us directly: dev@discourselab.ai Include:We’ll acknowledge within 72 hours and work with you on a fix.
- Type of vulnerability (SQL injection, command injection, SSRF, etc.)
- Affected component (CLI command, spider, handler)
- Steps to reproduce
- Impact assessment
In Scope
- Injection vulnerabilities (SQL, command, code)
- Path traversal / directory access
- Remote code execution
- Sensitive data exposure
- Server-side request forgery (SSRF)
- Insecure defaults
Out of Scope
- Web scraping ethics (scraping public websites is not a vulnerability)
- Cloudflare bypass techniques (core feature, not a bug)
- Robots.txt violations (user responsibility)
- Outdated dependencies (unless actively exploitable)