Security

scrapai implements defense-in-depth security with input validation, SSRF protection, and agent safety controls.

Security Model

scrapai’s approach: the agent writes config, not code. JSON configs are validated through Pydantic before import, and at runtime, Scrapy executes deterministically with no AI in the loop. Threat model:

Malicious user input: URLs, spider names, settings
AI agent hallucination: Generates bad configs or tries to modify framework code
Prompt injection from scraped pages: Untrusted web content influencing agent behavior
SSRF attacks: Crawling internal/private network resources
SQL injection: Through spider names or settings

Input Validation

All spider configs go through strict Pydantic validation before touching the database or crawler.

Spider Config Schema

class SpiderConfigSchema(BaseModel):
    model_config = ConfigDict(extra="forbid")  # Reject unknown fields
    
    name: str = Field(..., min_length=1, max_length=255)
    source_url: str = Field(..., min_length=1)
    allowed_domains: List[str] = Field(..., min_items=1)
    start_urls: List[str] = Field(..., min_items=1)
    rules: List[SpiderRuleSchema] = Field(default_factory=list)
    settings: SpiderSettingsSchema = Field(default_factory=SpiderSettingsSchema)
    callbacks: Optional[Dict[str, CallbackSchema]] = Field(default=None)

Key protections:

extra="forbid": Unknown fields are rejected (prevents injection of arbitrary data)
Type enforcement: Strings must be strings, lists must be lists, etc.
Required fields: Missing data causes validation error
Field length limits: Prevents excessively large inputs

Spider Name Validation

@field_validator("name")
@classmethod
def validate_name(cls, v):
    if not re.match(r"^[a-zA-Z0-9_-]+$", v):
        raise ValueError(
            f"Invalid spider name: {v}. "
            "Only alphanumeric characters, underscores, and hyphens allowed."
        )
    return v

Prevents: SQL injection, path traversal, and command injection. Examples:

✓ "bbc_co_uk"
✓ "news-spider-2024"
✗ "spider; DROP TABLE spiders--"
✗ "../etc/passwd"

SSRF Protection

Server-Side Request Forgery (SSRF) attacks trick the crawler into accessing internal resources.

URL Scheme Validation

@field_validator("source_url", "start_urls")
@classmethod
def validate_urls(cls, v):
    allowed_schemes = {"http", "https"}

    url_lower = url.lower()
    if not any(url_lower.startswith(f"{scheme}://") for scheme in allowed_schemes):
        raise ValueError(
            f"Invalid URL scheme: {url}. Only HTTP and HTTPS are allowed."
        )

Blocked schemes: file://, ftp://, gopher://, dict://, ldap://, and custom schemes.

Localhost Protection

parsed = urlparse(url)
hostname = parsed.hostname

if hostname in ("localhost", "0.0.0.0"):
    raise ValueError(f"URL points to localhost: {url}")

Blocked hostnames: localhost, 0.0.0.0, 127.0.0.1, ::1 (IPv6 loopback)

Private IP Protection

try:
    ip = ipaddress.ip_address(hostname)
    if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved:
        raise ValueError(f"URL points to private/reserved IP: {url}")
except ValueError:
    try:
        results = socket.getaddrinfo(hostname, None, socket.AF_UNSPEC)
        for family, _, _, _, sockaddr in results:
            ip = ipaddress.ip_address(sockaddr[0])
            if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved:
                raise ValueError(f"URL hostname '{hostname}' resolves to private IP {ip}")
    except socket.gaierror:
        pass

Blocked IP ranges: Private networks (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16), loopback (127.0.0.0/8), link-local (169.254.0.0/16), multicast, and reserved ranges. DNS resolution check catches public hostnames that resolve to private IPs.

URL Length Limit

if len(url) > 2048:
    raise ValueError(f"URL too long (max 2048 chars): {url[:50]}...")

Prevents: Buffer overflow, DoS via memory exhaustion, and excessively large database fields.

Settings Validation

Extractor Whitelist

@field_validator("EXTRACTOR_ORDER")
@classmethod
def validate_extractor_order(cls, v):
    if v is not None:
        allowed = {"newspaper", "trafilatura", "custom", "playwright"}
        for extractor in v:
            if extractor not in allowed:
                raise ValueError(f"Unknown extractor: {extractor}")
    return v

Prevents: Loading arbitrary Python modules as extractors.

Concurrency and Delay Limits

CONCURRENT_REQUESTS: Optional[int] = Field(default=None, ge=1, le=32)
DOWNLOAD_DELAY: Optional[float] = Field(default=None, ge=0, le=60)

Prevents: DoS via excessive concurrency and zero-delay hammering.

Processor Whitelist

@field_validator("type")
@classmethod
def validate_processor_type(cls, v):
    allowed = {"strip", "replace", "regex", "cast", "join", "default", "lowercase", "parse_datetime"}
    if v not in allowed:
        raise ValueError(f"Unknown processor type: {v}")
    return v

Prevents: Arbitrary code execution through custom processors.

Callback Validation

Reserved Names Protection

@field_validator("callbacks")
@classmethod
def validate_callbacks(cls, v):
    if v is None:
        return v

    reserved_names = {"parse_article", "parse_start_url", "start_requests", "from_crawler", "closed", "parse"}

    for callback_name in v.keys():
        if not re.match(r"^[a-zA-Z_][a-zA-Z0-9_]*$", callback_name):
            raise ValueError(f"Invalid callback name: '{callback_name}'")

        if callback_name in reserved_names:
            raise ValueError(f"Callback name '{callback_name}' is reserved")

Prevents: Overwriting built-in Scrapy methods, Python injection, and namespace collisions.

Cross-Validation

@model_validator(mode="after")
def validate_rule_callbacks(self):
    if not self.callbacks or not self.rules:
        return self

    defined_callbacks = set(self.callbacks.keys())
    defined_callbacks.update({"parse_article", None})

    for idx, rule in enumerate(self.rules):
        if rule.callback and rule.callback not in defined_callbacks:
            raise ValueError(f"Rule {idx} references undefined callback: '{rule.callback}'")

Prevents: Runtime errors from calling non-existent callbacks.

SQL Injection Protection

All database queries use SQLAlchemy ORM with parameterized bindings.

Safe Query Pattern

# SAFE: Parameterized query
spider = session.query(Spider).filter(Spider.name == spider_name).first()

# UNSAFE (never used in scrapai):
# query = f"SELECT * FROM spiders WHERE name = '{spider_name}'"
# session.execute(query)

CLI Database Access

db query validates table names against a whitelist; UPDATE/DELETE require row count confirmation.

ALLOWED_TABLES = {'spiders', 'spider_rules', 'spider_settings', 'scraped_items', 'queue'}

Agent Safety

scrapai’s approach: the agent writes config, not code.

Permission Rules (Claude Code)

Configured via ./scrapai setup:

.claude/settings.local.json

{
  "permissions": {
    "allow": [
      "Read",
      "Write",
      "Edit",
      "Update",
      "Glob",
      "Grep",
      "Bash(./scrapai:*)",
      "Bash(source:*)",
      "Bash(sqlite3:*)",
      "Bash(psql:*)",
      "Bash(xvfb-run:*)"
    ],
    "deny": [
      "Edit(scrapai)",
      "Update(scrapai)",
      "Edit(.claude/*)",
      "Update(.claude/*)",
      "Write(**/*.py)",
      "Edit(**/*.py)",
      "Update(**/*.py)",
      "MultiEdit(**/*.py)",
      "Write(.env)",
      "Write(secrets/**)",
      "Write(config/**/*.key)",
      "Write(**/*password*)",
      "Write(**/*secret*)",
      "WebFetch",
      "WebSearch",
      "Bash(rm:*)"
    ]
  }
}

What the agent can do:

Run ./scrapai commands, database clients, and allowed shell commands
Read any file (Python, JSON, configs, documentation)
Write, edit, and update JSON configs and analysis files
Use file search tools (Glob, Grep)

What the agent cannot do:

Modify Python framework code
Delete files
Run privileged commands
Execute arbitrary scripts

Validation Before Import

All configs must pass Pydantic validation before import:

$ ./scrapai spiders import malicious_config.json --project test
❌ Validation error:
  - URL points to localhost: http://127.0.0.1:8080
  - Callback name '__import__' is invalid
  - Extractor 'eval' not in allowed list

Deterministic Runtime

At runtime, no AI models are called, no LLM inference on scraped content, and Scrapy executes deterministically based on validated config.

Prompt Injection Resistance

Scraped pages cannot influence the agent:

AI is not in the crawl loop: Scrapy runs without LLM inference
Configs are static: Once imported, extraction rules don’t change based on page content
Validation is deterministic: Pydantic schemas don’t depend on context

Malicious page content is extracted as data only and the AI agent never sees it during crawl.

Security Checklist

Before deploying scrapai:

Reporting Vulnerabilities

Please DO NOT report security vulnerabilities through public GitHub issues. Email us directly: dev@discourselab.ai Include:

Type of vulnerability (SQL injection, command injection, SSRF, etc.)
Affected component (CLI command, spider, handler)
Steps to reproduce
Impact assessment

We’ll acknowledge within 72 hours and work with you on a fix.

In Scope

Injection vulnerabilities (SQL, command, code)
Path traversal / directory access
Remote code execution
Sensitive data exposure
Server-side request forgery (SSRF)
Insecure defaults

Out of Scope

Web scraping ethics (scraping public websites is not a vulnerability)
Cloudflare bypass techniques (core feature, not a bug)
Robots.txt violations (user responsibility)
Outdated dependencies (unless actively exploitable)

Migration

See validation in action during config import

Custom Callbacks

Write safe extraction logic with validated processors

Migrating Existing Scrapers

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

Security Model

Input Validation

Spider Config Schema

Spider Name Validation

SSRF Protection

URL Scheme Validation

Localhost Protection

Private IP Protection

URL Length Limit

Settings Validation

Extractor Whitelist

Concurrency and Delay Limits

Processor Whitelist

Callback Validation

Reserved Names Protection

Cross-Validation

SQL Injection Protection

Safe Query Pattern

CLI Database Access

Agent Safety

Permission Rules (Claude Code)

Validation Before Import

Deterministic Runtime

Prompt Injection Resistance

Security Checklist

Reporting Vulnerabilities

In Scope

Out of Scope

See Also

Migration

Custom Callbacks

​Security Model

​Input Validation

​Spider Config Schema

​Spider Name Validation

​SSRF Protection

​URL Scheme Validation

​Localhost Protection

​Private IP Protection

​URL Length Limit

​Settings Validation

​Extractor Whitelist

​Concurrency and Delay Limits

​Processor Whitelist

​Callback Validation

​Reserved Names Protection

​Cross-Validation

​SQL Injection Protection

​Safe Query Pattern

​CLI Database Access

​Agent Safety

​Permission Rules (Claude Code)

​Validation Before Import

​Deterministic Runtime

​Prompt Injection Resistance

​Security Checklist

​Reporting Vulnerabilities

​In Scope

​Out of Scope

​See Also

Migration

Custom Callbacks

Security Model

Input Validation

Spider Config Schema

Spider Name Validation

SSRF Protection

URL Scheme Validation

Localhost Protection

Private IP Protection

URL Length Limit

Settings Validation

Extractor Whitelist

Concurrency and Delay Limits

Processor Whitelist

Callback Validation

Reserved Names Protection

Cross-Validation

SQL Injection Protection

Safe Query Pattern

CLI Database Access

Agent Safety

Permission Rules (Claude Code)

Validation Before Import

Deterministic Runtime

Prompt Injection Resistance

Security Checklist

Reporting Vulnerabilities

In Scope

Out of Scope

See Also