scrapai spiders import <file> --project <name>.
Root Schema
Spider identifier (alphanumeric, underscores, hyphens only)Validation:
- Min length: 1
- Max length: 255
- Pattern:
^[a-zA-Z0-9_-]+$
"bbc_co_uk", "example_shop"Original website URLValidation:
- Must use
http://orhttps://scheme - Max length: 2048 characters
- No localhost or private IPs (SSRF protection)
"https://www.bbc.co.uk/"List of domains the spider can crawlValidation:
- Min items: 1
- Valid domain format
- No localhost or private domains
Initial URLs to crawl fromValidation:
- Min items: 1
- Must use HTTP/HTTPS
- No localhost or private IPs
- Max 2048 chars per URL
Named callback extraction configurationsKeys must be valid Python identifiers. See Callbacks for schema.Reserved names (cannot use):
parse_articleparse_start_urlstart_requestsfrom_crawlerclosedparse
Complete Example
News Site (BBC)
E-commerce Site
Validation Rules
Cross-Field Validation
Rule callbacks must be defined:- If a rule references a
callbackname, that callback must exist in thecallbacksobject - Built-in callback
parse_articleis always available - Use
"callback": nullfor navigation-only rules
Security Validation
SSRF Protection:- URLs cannot point to localhost (
localhost,127.0.0.1,0.0.0.0,::1) - URLs cannot point to private IP ranges
- DNS resolution checked for private IPs
- Spider names sanitized (alphanumeric + underscore/hyphen only)
- Callback names must be valid Python identifiers
- Reserved names blocked
Import Command
Related
- Spider Rules - URL matching configuration
- Spider Settings - Available settings
- Callbacks - Custom field extraction
- Extractors Overview - Content extraction strategies