Skip to main content
Rules control which URLs are followed and how they are processed. Each rule defines URL patterns and optional callbacks for extraction.

SpiderRuleSchema

allow
string[]
default:"null"
Regex patterns for URLs to allowValidation:
  • Must be list of non-empty strings
  • Patterns are Python regex
Example:
"allow": ["/news/articles/.*", "/blog/.*"]
deny
string[]
default:"null"
Regex patterns for URLs to deny (takes precedence over allow)Validation:
  • Must be list of non-empty strings
  • Patterns are Python regex
Example:
"deny": ["/news/articles/.*#comments", ".*\\?page=.*"]
restrict_xpaths
string[]
default:"null"
Only follow links found in these XPath expressionsExample:
"restrict_xpaths": ["//div[@class='main-content']//a"]
restrict_css
string[]
default:"null"
Only follow links found in these CSS selectorsExample:
"restrict_css": ["div.article-list a", "nav.pagination a"]
callback
string
default:"null"
Callback function name for processing matched URLsValidation:
  • Must be valid Python identifier (regex: ^[a-zA-Z_][a-zA-Z0-9_]*$)
  • Cannot be a reserved name
  • Must be defined in callbacks object (or use built-in parse_article)
Built-in callbacks:
  • parse_article - Extract article content using configured extractors
Reserved names (cannot use):
  • parse_article, parse_start_url, start_requests, from_crawler, closed, parse
Example:
"callback": "parse_product"
Use null for navigation-only rules (follow links but don’t extract).
follow
boolean
default:"true"
Whether to follow links matching this ruleCommon patterns:
  • true - Follow and extract (e.g., category pages)
  • false - Extract only (e.g., article pages, product pages)
Example:
"follow": false
priority
integer
default:"0"
Rule priority (higher = processed first)Validation:
  • Min: 0
  • Max: 1000
Use cases:
  • Prioritize important pages
  • Control crawl order
Example:
"priority": 10

Rule Matching

Allow/Deny Precedence

  1. If URL matches any deny pattern → rejected
  2. If URL matches any allow pattern → accepted
  3. If no allow patterns defined → accepted by default
  4. Otherwise → rejected

Restriction Scopes

  • restrict_xpaths and restrict_css limit where links are extracted from
  • Links outside these scopes are ignored, even if they match allow patterns

Examples

News Site

{
  "rules": [
    {
      "allow": ["/news/articles/.*"],
      "deny": ["/news/articles/.*#comments"],
      "callback": "parse_article",
      "follow": false
    },
    {
      "allow": ["/news/", "/sport/"],
      "callback": null,
      "follow": true
    }
  ]
}
Behavior:
  • Article pages (/news/articles/*) → Extract content, don’t follow links
  • Category pages (/news/, /sport/) → Follow links, don’t extract
  • Comments sections → Ignored

E-commerce

{
  "rules": [
    {
      "allow": ["/product/[^/]+$"],
      "callback": "parse_product",
      "follow": false,
      "priority": 5
    },
    {
      "allow": ["/products", "/category/"],
      "callback": null,
      "follow": true
    }
  ]
}
Behavior:
  • Product pages (/product/xyz) → Extract with custom callback, don’t follow
  • Listing pages → Follow links to discover products

Forum/Discussion

{
  "rules": [
    {
      "allow": ["/item\\?id=\\d+"],
      "deny": ["/vote", "/reply", "/user"],
      "callback": "parse_discussion",
      "follow": false
    }
  ]
}
Behavior:
  • Discussion threads → Extract
  • Vote/reply/user pages → Ignored
{
  "rules": [
    {
      "allow": ["/article/.*"],
      "restrict_css": ["div.main-content a", "nav.pagination a"],
      "callback": "parse_article"
    }
  ]
}
Behavior:
  • Only follow article links from main content and pagination
  • Ignore sidebar, footer, and navigation links

Common Patterns

Exact Path Match

"allow": ["/about$", "/contact$"]

Exclude Query Parameters

"deny": [".*\\?.*"]

Match Numeric IDs

"allow": ["/product/\\d+$", "/job/\\d+$"]

Multiple Domains

{
  "allowed_domains": ["example.com", "blog.example.com"],
  "rules": [
    {"allow": ["^https://example\\.com/products/.*"]},
    {"allow": ["^https://blog\\.example\\.com/posts/.*"]}
  ]
}

Pagination

{
  "allow": ["/page/\\d+$"],
  "callback": null,
  "follow": true
}

Rule Order

Rules are processed in the order defined in the rules array. Use priority to control processing order within Scrapy.

Validation Errors

Undefined Callback

Rule 0 references undefined callback: 'parse_product'. 
Defined callbacks: parse_article
Fix: Add callback to callbacks object or use parse_article

Invalid Callback Name

Invalid callback name: 'parse-product'. 
Must be a valid Python identifier.
Fix: Use underscores instead of hyphens: parse_product

Empty Patterns

Patterns must be non-empty strings
Fix: Remove empty strings from allow/deny arrays