When to Use
Use callbacks for:- E-commerce (products, prices, ratings)
- Job boards (titles, companies, salaries)
- Real estate (properties, prices, features)
- Forums (posts, authors, replies)
- Any non-article structured data
parse_article for:
- News, blogs, documentation
- Content with title/content/author/date structure
CallbackSchema
Field extraction rules (mapping of field name → FieldExtractSchema). Min 1 field required.
FieldExtractSchema
CSS selector for extractionSyntax:
"h1" (element), "h1::text" (text), "img::attr(src)" (attribute)XPath expression for extraction
Extract all matches (returns list)
Value transformations applied sequentiallyAvailable:
strip, replace, regex, cast, join, default, lowercase, parse_datetimeNested Lists
Field type. Use
"nested_list" for extracting lists of objects.CSS selector for nested list items. Required when
type: "nested_list".Nested extraction config (field name → FieldExtractSchema). Required when
type: "nested_list". Max depth: 3 levels.ProcessorSchema
Processor type:
strip, replace, regex, cast, join, default, lowercase, parse_datetime- replace:
old,new(strings) - regex:
pattern(string),group(int, default: 1) - cast:
to(“int”|“float”|“bool”|“str”) - join:
separator(string, default: ” ”) - default:
default(any) - parse_datetime:
format(string, optional)
Examples
E-commerce Product
Job Listing
Forum with Nested Comments
Field Storage
Standard fields map to database columns:url, title, content, author, published_date
Custom fields stored in metadata_json column and flattened in exports
Reserved Names
Cannot use as callback names:parse_article, parse_start_url, start_requests, from_crawler, closed, parse
Validation
Callback names: Must be valid Python identifiers (e.g.,parse_product, not parse-product)
Field extraction: Must have css/xpath selector OR type: "nested_list" with selector and extract
Cross-validation: Rules must reference defined callbacks
Workflow
- Analyze page:
scrapai analyze page.html - Test selectors:
scrapai analyze page.html --test "h1.title" - Build callback config with processors
- Import spider:
scrapai spiders import spider.json --project proj - Test extraction:
scrapai crawl spider --limit 5 --project proj - View results:
scrapai show spider 1 --project proj
Related
- Spider Schema - Complete configuration
- Rules - URL matching and routing
- Custom Extractors - CSS selector details