SpiderRuleSchema
Python regex patterns for URLs to allow (non-empty strings)Example:
Python regex patterns for URLs to deny (takes precedence over allow)Example:
Only follow links found in these XPath expressionsExample:
Only follow links found in these CSS selectorsExample:
Callback function name for processing matched URLs. Must be valid Python identifier and defined in Use
callbacks object.Built-in callbacks:parse_article- Extract article content using configured extractors
parse_article,parse_start_url,start_requests,from_crawler,closed,parse
null for navigation-only rules (follow links but don’t extract).Whether to follow links matching this ruleExample:
Rule priority (higher = processed first). Range: 0-1000Example:
Rule Matching
Allow/Deny Precedence
- If URL matches any
denypattern → rejected - If URL matches any
allowpattern → accepted - If no
allowpatterns defined → accepted by default - Otherwise → rejected
Restriction Scopes
restrict_xpathsandrestrict_csslimit where links are extracted from- Links outside these scopes are ignored, even if they match
allowpatterns
Examples
News Site
E-commerce
Forum/Discussion
Restrict Link Sources
Common Patterns
Exact Path Match
Exclude Query Parameters
Match Numeric IDs
Multiple Domains
Pagination
Rule Order
Rules are processed in array order. Usepriority to control processing order within Scrapy.
Validation Errors
Undefined Callback
callbacks object or use parse_article
Invalid Callback Name
parse_product
Empty Patterns
allow/deny arrays
Related
- Spider Schema - Complete configuration reference
- Callbacks - Custom extraction configuration
- Settings - Spider behavior settings