SpiderRuleSchema
Regex patterns for URLs to allowValidation:
- Must be list of non-empty strings
- Patterns are Python regex
Regex patterns for URLs to deny (takes precedence over allow)Validation:
- Must be list of non-empty strings
- Patterns are Python regex
Only follow links found in these XPath expressionsExample:
Only follow links found in these CSS selectorsExample:
Callback function name for processing matched URLsValidation:Use
- Must be valid Python identifier (regex:
^[a-zA-Z_][a-zA-Z0-9_]*$) - Cannot be a reserved name
- Must be defined in
callbacksobject (or use built-inparse_article)
parse_article- Extract article content using configured extractors
parse_article,parse_start_url,start_requests,from_crawler,closed,parse
null for navigation-only rules (follow links but don’t extract).Whether to follow links matching this ruleCommon patterns:
true- Follow and extract (e.g., category pages)false- Extract only (e.g., article pages, product pages)
Rule priority (higher = processed first)Validation:
- Min: 0
- Max: 1000
- Prioritize important pages
- Control crawl order
Rule Matching
Allow/Deny Precedence
- If URL matches any
denypattern → rejected - If URL matches any
allowpattern → accepted - If no
allowpatterns defined → accepted by default - Otherwise → rejected
Restriction Scopes
restrict_xpathsandrestrict_csslimit where links are extracted from- Links outside these scopes are ignored, even if they match
allowpatterns
Examples
News Site
- Article pages (
/news/articles/*) → Extract content, don’t follow links - Category pages (
/news/,/sport/) → Follow links, don’t extract - Comments sections → Ignored
E-commerce
- Product pages (
/product/xyz) → Extract with custom callback, don’t follow - Listing pages → Follow links to discover products
Forum/Discussion
- Discussion threads → Extract
- Vote/reply/user pages → Ignored
Restrict Link Sources
- Only follow article links from main content and pagination
- Ignore sidebar, footer, and navigation links
Common Patterns
Exact Path Match
Exclude Query Parameters
Match Numeric IDs
Multiple Domains
Pagination
Rule Order
Rules are processed in the order defined in therules array. Use priority to control processing order within Scrapy.
Validation Errors
Undefined Callback
callbacks object or use parse_article
Invalid Callback Name
parse_product
Empty Patterns
allow/deny arrays
Related
- Spider Schema - Complete configuration reference
- Callbacks - Custom extraction configuration
- Settings - Spider behavior settings