When to Use
Use callbacks for:- E-commerce (products, prices, ratings)
- Job boards (titles, companies, salaries)
- Real estate (properties, prices, features)
- Forums (posts, authors, replies)
- Any non-article structured data
parse_article for:
- News, blogs, documentation
- Content with title/content/author/date structure
CallbackSchema
Field extraction rules (mapping of field name → FieldExtractSchema)Validation:
- Min 1 field required
- Each key is the field name
- Each value is a FieldExtractSchema
FieldExtractSchema
CSS selector for extractionSyntax:
"h1"- Element"h1::text"- Text content"img::attr(src)"- Attribute value"div.class > p"- Nested selector
XPath expression for extractionExample:
Extract all matches (returns list)Example:Output:
["WiFi", "Bluetooth", "GPS"]Value transformations to apply sequentiallyAvailable processors:
strip- Remove whitespacereplace- Replace substringregex- Extract with patterncast- Convert type (int, float, bool, str)join- Join list to stringdefault- Fallback valuelowercase- Convert to lowercaseparse_datetime- Parse dates
Nested Lists
Field type (use
"nested_list" for extracting lists of objects)Example:CSS selector for nested list itemsRequired when:
type: "nested_list"Example:Nested extraction config (field name → FieldExtractSchema)Required when:
type: "nested_list"Max depth: 3 levelsExample:ProcessorSchema
Processor typeAllowed values:
strip,replace,regex,cast,join,default,lowercase,parse_datetime
- replace:
old(string),new(string) - regex:
pattern(string),group(int, default: 1) - cast:
to(“int”|“float”|“bool”|“str”) - join:
separator(string, default: ” ”) - default:
default(any) - parse_datetime:
format(string, optional)
Examples
E-commerce Product
Job Listing
Forum with Nested Comments
Field Storage
Standard Fields
Map to database columns:url→ scraped_items.urltitle→ scraped_items.titlecontent→ scraped_items.contentauthor→ scraped_items.authorpublished_date→ scraped_items.published_date
Custom Fields
Stored inmetadata_json column:
price,rating,category, etc.- Displayed in
showcommand - Flattened in exports (CSV/JSON/JSONL)
Reserved Names
Cannot use as callback names:parse_articleparse_start_urlstart_requestsfrom_crawlerclosedparse
Validation
Callback Name
Field Extraction
Must have either:cssorxpathselector, ORtype: "nested_list"withselectorandextract
Cross-Validation
Rules must reference defined callbacks:Workflow
- Analyze page:
scrapai analyze page.html - Test selectors:
scrapai analyze page.html --test "h1.title" - Build callback config with processors
- Import spider:
scrapai spiders import spider.json --project proj - Test extraction:
scrapai crawl spider --limit 5 --project proj - View results:
scrapai show spider 1 --project proj
Related
- Spider Schema - Complete configuration
- Rules - URL matching and routing
- Custom Extractors - CSS selector details