Skip to main content
Complete JSON schema for spider configuration. Import spiders from JSON files using scrapai spiders import <file> --project <name>.

Root Schema

name
string
required
Spider identifier (alphanumeric, underscores, hyphens only)Validation:
  • Min length: 1
  • Max length: 255
  • Pattern: ^[a-zA-Z0-9_-]+$
Example: "bbc_co_uk", "example_shop"
source_url
string
required
Original website URLValidation:
  • Must use http:// or https:// scheme
  • Max length: 2048 characters
  • No localhost or private IPs (SSRF protection)
Example: "https://www.bbc.co.uk/"
allowed_domains
string[]
required
List of domains the spider can crawlValidation:
  • Min items: 1
  • Valid domain format
  • No localhost or private domains
Example:
["bbc.co.uk", "www.bbc.co.uk"]
start_urls
string[]
required
Initial URLs to crawl fromValidation:
  • Min items: 1
  • Must use HTTP/HTTPS
  • No localhost or private IPs
  • Max 2048 chars per URL
Example:
["https://www.bbc.co.uk/"]
rules
SpiderRuleSchema[]
default:"[]"
URL matching and routing rulesSee Spider Rules for detailed schema.Example:
[
  {
    "allow": ["/news/articles/.*"],
    "deny": ["/news/articles/.*#comments"],
    "callback": "parse_article"
  }
]
settings
SpiderSettingsSchema
default:"{}"
Spider configuration settingsSee Spider Settings for all available options.Example:
{
  "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
  "DOWNLOAD_DELAY": 1,
  "CONCURRENT_REQUESTS": 16
}
callbacks
object
default:"null"
Named callback extraction configurationsKeys must be valid Python identifiers. See Callbacks for schema.Reserved names (cannot use):
  • parse_article
  • parse_start_url
  • start_requests
  • from_crawler
  • closed
  • parse
Example:
{
  "parse_product": {
    "extract": {
      "name": {"css": "h1.product-name::text"},
      "price": {"css": "span.price::text"}
    }
  }
}

Complete Example

News Site (BBC)

{
  "name": "bbc_co_uk",
  "source_url": "https://bbc.co.uk/",
  "allowed_domains": ["bbc.co.uk", "www.bbc.co.uk"],
  "start_urls": ["https://www.bbc.co.uk/"],
  "rules": [
    {
      "allow": ["/news/articles/.*"],
      "deny": ["/news/articles/.*#comments"],
      "callback": "parse_article"
    },
    {
      "allow": ["/sport/.*/articles/.*"],
      "deny": ["/sport/.*/articles/.*#comments"],
      "callback": "parse_article"
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 1,
    "CONCURRENT_REQUESTS": 16,
    "ROBOTSTXT_OBEY": true
  }
}

E-commerce Site

{
  "name": "example_shop",
  "source_url": "https://shop.example.com",
  "allowed_domains": ["shop.example.com"],
  "start_urls": ["https://shop.example.com/products"],
  "rules": [
    {
      "allow": ["/product/[^/]+$"],
      "callback": "parse_product",
      "follow": false
    },
    {
      "allow": ["/products", "/category/"],
      "callback": null,
      "follow": true
    }
  ],
  "callbacks": {
    "parse_product": {
      "extract": {
        "name": {
          "css": "h1.product-name::text",
          "processors": [{"type": "strip"}]
        },
        "price": {
          "css": "span.price::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d,.]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "float"}
          ]
        },
        "rating": {
          "css": "span.rating::attr(data-rating)",
          "processors": [{"type": "cast", "to": "float"}]
        }
      }
    }
  }
}

Validation Rules

Cross-Field Validation

Rule callbacks must be defined:
  • If a rule references a callback name, that callback must exist in the callbacks object
  • Built-in callback parse_article is always available
  • Use "callback": null for navigation-only rules
Example error:
Rule 0 references undefined callback: 'parse_product'. 
Defined callbacks: parse_article

Security Validation

SSRF Protection:
  • URLs cannot point to localhost (localhost, 127.0.0.1, 0.0.0.0, ::1)
  • URLs cannot point to private IP ranges
  • DNS resolution checked for private IPs
Injection Prevention:
  • Spider names sanitized (alphanumeric + underscore/hyphen only)
  • Callback names must be valid Python identifiers
  • Reserved names blocked

Import Command

scrapai spiders import spider.json --project myproject
Validation errors will be displayed with specific field paths and error messages.