Spider JSON Schema

Complete JSON schema for spider configuration.

Root Schema

name

string

required

Spider identifier (alphanumeric, underscores, hyphens only)Validation:

Min length: 1
Max length: 255
Pattern: ^[a-zA-Z0-9_-]+$

Example: "bbc_co_uk", "example_shop"

source_url

string

required

Original website URLValidation:

Must use http:// or https:// scheme
Max length: 2048 characters
No localhost or private IPs (SSRF protection)

Example: "https://www.bbc.co.uk/"

allowed_domains

string[]

required

List of domains the spider can crawlValidation:

Min items: 1
Valid domain format
No localhost or private domains

Example:

["bbc.co.uk", "www.bbc.co.uk"]

start_urls

string[]

required

Initial URLs to crawl fromValidation:

Min items: 1
Must use HTTP/HTTPS
No localhost or private IPs
Max 2048 chars per URL

Example:

["https://www.bbc.co.uk/"]

rules

SpiderRuleSchema[]

default:"[]"

URL matching and routing rulesSee Spider Rules for detailed schema.Example:

[
  {
    "allow": ["/news/articles/.*"],
    "deny": ["/news/articles/.*#comments"],
    "callback": "parse_article"
  }
]

settings

SpiderSettingsSchema

default:"{}"

Spider configuration settingsSee Spider Settings for all available options.Example:

{
  "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
  "DOWNLOAD_DELAY": 1,
  "CONCURRENT_REQUESTS": 16
}

callbacks

object

default:"null"

Named callback extraction configurationsKeys must be valid Python identifiers. See Callbacks for schema.Reserved names (cannot use):

parse_article
parse_start_url
start_requests
from_crawler
closed
parse

Example:

{
  "parse_product": {
    "extract": {
      "name": {"css": "h1.product-name::text"},
      "price": {"css": "span.price::text"}
    }
  }
}

Complete Example

News Site (BBC)

{
  "name": "bbc_co_uk",
  "source_url": "https://bbc.co.uk/",
  "allowed_domains": ["bbc.co.uk", "www.bbc.co.uk"],
  "start_urls": ["https://www.bbc.co.uk/"],
  "rules": [
    {
      "allow": ["/news/articles/.*"],
      "deny": ["/news/articles/.*#comments"],
      "callback": "parse_article"
    },
    {
      "allow": ["/sport/.*/articles/.*"],
      "deny": ["/sport/.*/articles/.*#comments"],
      "callback": "parse_article"
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 1,
    "CONCURRENT_REQUESTS": 16,
    "ROBOTSTXT_OBEY": true
  }
}

E-commerce Site

{
  "name": "example_shop",
  "source_url": "https://shop.example.com",
  "allowed_domains": ["shop.example.com"],
  "start_urls": ["https://shop.example.com/products"],
  "rules": [
    {
      "allow": ["/product/[^/]+$"],
      "callback": "parse_product",
      "follow": false
    },
    {
      "allow": ["/products", "/category/"],
      "callback": null,
      "follow": true
    }
  ],
  "callbacks": {
    "parse_product": {
      "extract": {
        "name": {
          "css": "h1.product-name::text",
          "processors": [{"type": "strip"}]
        },
        "price": {
          "css": "span.price::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d,.]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "float"}
          ]
        },
        "rating": {
          "css": "span.rating::attr(data-rating)",
          "processors": [{"type": "cast", "to": "float"}]
        }
      }
    }
  }
}

Validation Rules

Rule callbacks must be defined:

If a rule references a callback name, that callback must exist in the callbacks object
Built-in callback parse_article is always available
Use "callback": null for navigation-only rules

SSRF Protection:

URLs cannot point to localhost or private IP ranges
DNS resolution checked for private IPs

Injection Prevention:

Spider names: alphanumeric + underscore/hyphen only
Callback names must be valid Python identifiers
Reserved names blocked

Import Command

scrapai spiders import spider.json --project myproject

Validation errors will be displayed with specific field paths and error messages.

Spider Rules - URL matching configuration
Spider Settings - Available settings
Callbacks - Custom field extraction
Extractors Overview - Content extraction strategies

​Root Schema

​Complete Example

​News Site (BBC)

​E-commerce Site

​Validation Rules

​Import Command

​Related

Root Schema

Complete Example

News Site (BBC)

E-commerce Site

Validation Rules

Import Command

Related