Skip to main content
ScrapAI implements defense-in-depth security with input validation, SSRF protection, and agent safety controls.

Security Model

ScrapAI’s approach: the agent writes config, not code. JSON configs are validated through Pydantic before import, and at runtime, Scrapy executes deterministically with no AI in the loop. Threat model:
  1. Malicious user input: URLs, spider names, settings
  2. AI agent hallucination: Generates bad configs or tries to modify framework code
  3. Prompt injection from scraped pages: Untrusted web content influencing agent behavior
  4. SSRF attacks: Crawling internal/private network resources
  5. SQL injection: Through spider names or settings

Input Validation

All spider configs go through strict Pydantic validation before touching the database or crawler.

Spider Config Schema

class SpiderConfigSchema(BaseModel):
    model_config = ConfigDict(extra="forbid")  # Reject unknown fields
    
    name: str = Field(..., min_length=1, max_length=255)
    source_url: str = Field(..., min_length=1)
    allowed_domains: List[str] = Field(..., min_items=1)
    start_urls: List[str] = Field(..., min_items=1)
    rules: List[SpiderRuleSchema] = Field(default_factory=list)
    settings: SpiderSettingsSchema = Field(default_factory=SpiderSettingsSchema)
    callbacks: Optional[Dict[str, CallbackSchema]] = Field(default=None)
Key protections:
  • extra="forbid": Unknown fields are rejected (prevents injection of arbitrary data)
  • Type enforcement: Strings must be strings, lists must be lists, etc.
  • Required fields: Missing data causes validation error
  • Field length limits: Prevents excessively large inputs

Spider Name Validation

@field_validator("name")
@classmethod
def validate_name(cls, v):
    if not re.match(r"^[a-zA-Z0-9_-]+$", v):
        raise ValueError(
            f"Invalid spider name: {v}. "
            "Only alphanumeric characters, underscores, and hyphens allowed."
        )
    return v
Prevents: SQL injection, path traversal, and command injection. Examples:
"bbc_co_uk"
"news-spider-2024"
"spider; DROP TABLE spiders--"
"../etc/passwd"

SSRF Protection

Server-Side Request Forgery (SSRF) attacks trick the crawler into accessing internal resources.

URL Scheme Validation

@field_validator("source_url", "start_urls")
@classmethod
def validate_urls(cls, v):
    allowed_schemes = {"http", "https"}

    url_lower = url.lower()
    if not any(url_lower.startswith(f"{scheme}://") for scheme in allowed_schemes):
        raise ValueError(
            f"Invalid URL scheme: {url}. Only HTTP and HTTPS are allowed."
        )
Blocked schemes: file://, ftp://, gopher://, dict://, ldap://, and custom schemes.

Localhost Protection

parsed = urlparse(url)
hostname = parsed.hostname

if hostname in ("localhost", "0.0.0.0"):
    raise ValueError(f"URL points to localhost: {url}")
Blocked hostnames: localhost, 0.0.0.0, 127.0.0.1, ::1 (IPv6 loopback)

Private IP Protection

try:
    ip = ipaddress.ip_address(hostname)
    if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved:
        raise ValueError(f"URL points to private/reserved IP: {url}")
except ValueError:
    try:
        results = socket.getaddrinfo(hostname, None, socket.AF_UNSPEC)
        for family, _, _, _, sockaddr in results:
            ip = ipaddress.ip_address(sockaddr[0])
            if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved:
                raise ValueError(f"URL hostname '{hostname}' resolves to private IP {ip}")
    except socket.gaierror:
        pass
Blocked IP ranges: Private networks (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16), loopback (127.0.0.0/8), link-local (169.254.0.0/16), multicast, and reserved ranges. DNS resolution check catches public hostnames that resolve to private IPs.

URL Length Limit

if len(url) > 2048:
    raise ValueError(f"URL too long (max 2048 chars): {url[:50]}...")
Prevents: Buffer overflow, DoS via memory exhaustion, and excessively large database fields.

Settings Validation

Extractor Whitelist

@field_validator("EXTRACTOR_ORDER")
@classmethod
def validate_extractor_order(cls, v):
    if v is not None:
        allowed = {"newspaper", "trafilatura", "custom", "playwright"}
        for extractor in v:
            if extractor not in allowed:
                raise ValueError(f"Unknown extractor: {extractor}")
    return v
Prevents: Loading arbitrary Python modules as extractors.

Concurrency and Delay Limits

CONCURRENT_REQUESTS: Optional[int] = Field(default=None, ge=1, le=32)
DOWNLOAD_DELAY: Optional[float] = Field(default=None, ge=0, le=60)
Prevents: DoS via excessive concurrency and zero-delay hammering.

Processor Whitelist

@field_validator("type")
@classmethod
def validate_processor_type(cls, v):
    allowed = {"strip", "replace", "regex", "cast", "join", "default", "lowercase", "parse_datetime"}
    if v not in allowed:
        raise ValueError(f"Unknown processor type: {v}")
    return v
Prevents: Arbitrary code execution through custom processors.

Callback Validation

Reserved Names Protection

@field_validator("callbacks")
@classmethod
def validate_callbacks(cls, v):
    if v is None:
        return v

    reserved_names = {"parse_article", "parse_start_url", "start_requests", "from_crawler", "closed", "parse"}

    for callback_name in v.keys():
        if not re.match(r"^[a-zA-Z_][a-zA-Z0-9_]*$", callback_name):
            raise ValueError(f"Invalid callback name: '{callback_name}'")

        if callback_name in reserved_names:
            raise ValueError(f"Callback name '{callback_name}' is reserved")
Prevents: Overwriting built-in Scrapy methods, Python injection, and namespace collisions.

Cross-Validation

@model_validator(mode="after")
def validate_rule_callbacks(self):
    if not self.callbacks or not self.rules:
        return self

    defined_callbacks = set(self.callbacks.keys())
    defined_callbacks.update({"parse_article", None})

    for idx, rule in enumerate(self.rules):
        if rule.callback and rule.callback not in defined_callbacks:
            raise ValueError(f"Rule {idx} references undefined callback: '{rule.callback}'")
Prevents: Runtime errors from calling non-existent callbacks.

SQL Injection Protection

All database queries use SQLAlchemy ORM with parameterized bindings.

Safe Query Pattern

# SAFE: Parameterized query
spider = session.query(Spider).filter(Spider.name == spider_name).first()

# UNSAFE (never used in ScrapAI):
# query = f"SELECT * FROM spiders WHERE name = '{spider_name}'"
# session.execute(query)

CLI Database Access

db query validates table names against a whitelist; UPDATE/DELETE require row count confirmation.
ALLOWED_TABLES = {'spiders', 'spider_rules', 'spider_settings', 'scraped_items', 'queue'}

Agent Safety

ScrapAI’s approach: the agent writes config, not code.

Permission Rules (Claude Code)

Configured via ./scrapai setup:
.claude/settings.local.json
{
  "permissions": {
    "allow": [
      "Read",
      "Write",
      "Edit",
      "Update",
      "Glob",
      "Grep",
      "Bash(./scrapai:*)",
      "Bash(source:*)",
      "Bash(sqlite3:*)",
      "Bash(psql:*)",
      "Bash(xvfb-run:*)"
    ],
    "deny": [
      "Edit(scrapai)",
      "Update(scrapai)",
      "Edit(.claude/*)",
      "Update(.claude/*)",
      "Write(**/*.py)",
      "Edit(**/*.py)",
      "Update(**/*.py)",
      "MultiEdit(**/*.py)",
      "Write(.env)",
      "Write(secrets/**)",
      "Write(config/**/*.key)",
      "Write(**/*password*)",
      "Write(**/*secret*)",
      "WebFetch",
      "WebSearch",
      "Bash(rm:*)"
    ]
  }
}
What the agent can do:
  • Run ./scrapai commands, database clients, and allowed shell commands
  • Read any file (Python, JSON, configs, documentation)
  • Write, edit, and update JSON configs and analysis files
  • Use file search tools (Glob, Grep)
What the agent cannot do:
  • Modify Python framework code
  • Delete files
  • Run privileged commands
  • Execute arbitrary scripts

Validation Before Import

All configs must pass Pydantic validation before import:
$ ./scrapai spiders import malicious_config.json --project test
 Validation error:
  - URL points to localhost: http://127.0.0.1:8080
  - Callback name '__import__' is invalid
  - Extractor 'eval' not in allowed list

Deterministic Runtime

At runtime, no AI models are called, no LLM inference on scraped content, and Scrapy executes deterministically based on validated config.

Prompt Injection Resistance

Scraped pages cannot influence the agent:
  1. AI is not in the crawl loop: Scrapy runs without LLM inference
  2. Configs are static: Once imported, extraction rules don’t change based on page content
  3. Validation is deterministic: Pydantic schemas don’t depend on context
Malicious page content is extracted as data only and the AI agent never sees it during crawl.

Security Checklist

Before deploying ScrapAI:
  • Review all spider configs for localhost/private IPs
  • Confirm extra="forbid" in Pydantic schemas
  • Verify SQLAlchemy ORM usage (no raw SQL)
  • Test SSRF protection with internal hostnames
  • Configure Claude Code permissions (if using AI agent)
  • Set up database backups
  • Enable SSL/TLS for PostgreSQL connections
  • Rotate S3 credentials if using S3 upload
  • Monitor logs for validation errors
  • Set up alerts for repeated validation failures (possible attack)

Reporting Vulnerabilities

Please DO NOT report security vulnerabilities through public GitHub issues. Email us directly: dev@discourselab.ai Include:
  1. Type of vulnerability (SQL injection, command injection, SSRF, etc.)
  2. Affected component (CLI command, spider, handler)
  3. Steps to reproduce
  4. Impact assessment
We’ll acknowledge within 72 hours and work with you on a fix.

In Scope

  • Injection vulnerabilities (SQL, command, code)
  • Path traversal / directory access
  • Remote code execution
  • Sensitive data exposure
  • Server-side request forgery (SSRF)
  • Insecure defaults

Out of Scope

  • Web scraping ethics (scraping public websites is not a vulnerability)
  • Cloudflare bypass techniques (core feature, not a bug)
  • Robots.txt violations (user responsibility)
  • Outdated dependencies (unless actively exploitable)

See Also

Migration

See validation in action during config import

Custom Callbacks

Write safe extraction logic with validated processors