Spider Rules

Rules control which URLs are followed and how they are processed. Each rule defines URL patterns and optional callbacks for extraction.

SpiderRuleSchema

allow

string[]

default:"null"

Python regex patterns for URLs to allow (non-empty strings)Example:

"allow": ["/news/articles/.*", "/blog/.*"]

deny

string[]

default:"null"

Python regex patterns for URLs to deny (takes precedence over allow)Example:

"deny": ["/news/articles/.*#comments", ".*\\?page=.*"]

restrict_xpaths

string[]

default:"null"

Only follow links found in these XPath expressionsExample:

"restrict_xpaths": ["//div[@class='main-content']//a"]

restrict_css

string[]

default:"null"

Only follow links found in these CSS selectorsExample:

"restrict_css": ["div.article-list a", "nav.pagination a"]

callback

string

default:"null"

Callback function name for processing matched URLs. Must be valid Python identifier and defined in callbacks object.Built-in callbacks:

parse_article - Extract article content using configured extractors

Reserved names:

parse_article, parse_start_url, start_requests, from_crawler, closed, parse

Example:

"callback": "parse_product"

Use null for navigation-only rules (follow links but don’t extract).

boolean

default:"true"

Whether to follow links matching this ruleExample:

"follow": false

priority

integer

default:"0"

Rule priority (higher = processed first). Range: 0-1000Example:

"priority": 10

Rule Matching

Allow/Deny Precedence

If URL matches any deny pattern → rejected
If URL matches any allow pattern → accepted
If no allow patterns defined → accepted by default
Otherwise → rejected

Restriction Scopes

restrict_xpaths and restrict_css limit where links are extracted from
Links outside these scopes are ignored, even if they match allow patterns

Examples

News Site

{
  "rules": [
    {
      "allow": ["/news/articles/.*"],
      "deny": ["/news/articles/.*#comments"],
      "callback": "parse_article",
      "follow": false
    },
    {
      "allow": ["/news/", "/sport/"],
      "callback": null,
      "follow": true
    }
  ]
}

E-commerce

{
  "rules": [
    {
      "allow": ["/product/[^/]+$"],
      "callback": "parse_product",
      "follow": false,
      "priority": 5
    },
    {
      "allow": ["/products", "/category/"],
      "callback": null,
      "follow": true
    }
  ]
}

Forum/Discussion

{
  "rules": [
    {
      "allow": ["/item\\?id=\\d+"],
      "deny": ["/vote", "/reply", "/user"],
      "callback": "parse_discussion",
      "follow": false
    }
  ]
}

Restrict Link Sources

{
  "rules": [
    {
      "allow": ["/article/.*"],
      "restrict_css": ["div.main-content a", "nav.pagination a"],
      "callback": "parse_article"
    }
  ]
}

Common Patterns

Exact Path Match

"allow": ["/about$", "/contact$"]

Exclude Query Parameters

"deny": [".*\\?.*"]

Match Numeric IDs

"allow": ["/product/\\d+$", "/job/\\d+$"]

Multiple Domains

{
  "allowed_domains": ["example.com", "blog.example.com"],
  "rules": [
    {"allow": ["^https://example\\.com/products/.*"]},
    {"allow": ["^https://blog\\.example\\.com/posts/.*"]}
  ]
}

Pagination

{
  "allow": ["/page/\\d+$"],
  "callback": null,
  "follow": true
}

Rule Order

Rules are processed in array order. Use priority to control processing order within Scrapy.

Validation Errors

Undefined Callback

Rule 0 references undefined callback: 'parse_product'. 
Defined callbacks: parse_article

Fix: Add callback to callbacks object or use parse_article

Invalid Callback Name

Invalid callback name: 'parse-product'. 
Must be a valid Python identifier.

Fix: Use underscores instead of hyphens: parse_product

Empty Patterns

Patterns must be non-empty strings

Fix: Remove empty strings from allow/deny arrays

Spider Schema - Complete configuration reference
Callbacks - Custom extraction configuration
Settings - Spider behavior settings

Spider Configuration

Extractors

SpiderRuleSchema

Rule Matching

Allow/Deny Precedence

Restriction Scopes

Examples

News Site

E-commerce

Forum/Discussion

Restrict Link Sources

Common Patterns

Exact Path Match

Exclude Query Parameters

Match Numeric IDs

Multiple Domains

Rule Order

Validation Errors

Undefined Callback

Invalid Callback Name

Empty Patterns

​SpiderRuleSchema

​Rule Matching

​Allow/Deny Precedence

​Restriction Scopes

​Examples

​News Site

​E-commerce

​Forum/Discussion

​Restrict Link Sources

​Common Patterns

​Exact Path Match

​Exclude Query Parameters

​Match Numeric IDs

​Multiple Domains

​Pagination

​Rule Order

​Validation Errors

​Undefined Callback

​Invalid Callback Name

​Empty Patterns

​Related

SpiderRuleSchema

Rule Matching

Allow/Deny Precedence

Restriction Scopes

Examples

News Site

E-commerce

Forum/Discussion

Restrict Link Sources

Common Patterns

Exact Path Match

Exclude Query Parameters

Match Numeric IDs

Multiple Domains

Pagination

Rule Order

Validation Errors

Undefined Callback

Invalid Callback Name

Empty Patterns

Related