Extract custom fields from any structured data - not just articles. Perfect for e-commerce, job boards, real estate, forums, and any non-article content.
When to Use
Use Callbacks For
Use parse_article For
Custom callbacks are ideal for:
E-commerce (products, prices, ratings)
Job boards (titles, companies, salaries)
Real estate (properties, prices, features)
Forums (posts, authors, replies)
Any non-article structured data
Default parse_article works for:
News sites, blogs, documentation
Content with title/content/author/date structure
Standard article formats
Basic Structure
{
"rules" : [
{
"allow" : [ "/product/.*" ],
"callback" : "parse_product"
}
],
"callbacks" : {
"parse_product" : {
"extract" : {
"name" : { "css" : "h1::text" },
"price" : {
"css" : "span.price::text" ,
"processors" : [
{ "type" : "strip" },
{ "type" : "regex" , "pattern" : " \\ $([ \\ d.]+)" },
{ "type" : "cast" , "to" : "float" }
]
}
}
}
}
}
Selectors
CSS Selector
Attribute Selector
XPath Selector
List Extraction
Nested Lists
Extract complex nested data structures:
{
"reviews" : {
"type" : "nested_list" ,
"selector" : "div.review" ,
"extract" : {
"author" : { "css" : "span.author::text" },
"rating" : {
"css" : "span.stars::attr(data-rating)" ,
"processors" : [{ "type" : "cast" , "to" : "int" }]
},
"comment" : { "css" : "p.text::text" }
}
}
}
Max nesting depth: 3 levels
Field Processors
8 powerful processors available for data transformation:
regex Extract with pattern
cast Convert type (int, float, bool, str)
lowercase Convert to lowercase
parse_datetime Parse dates (stores as ISO strings)
Chaining Processors
Processors execute sequentially, passing output to next processor:
{
"price" : {
"css" : "span.price::text" ,
"processors" : [
{ "type" : "strip" },
{ "type" : "regex" , "pattern" : " \\ $([ \\ d.]+)" },
{ "type" : "cast" , "to" : "float" }
]
}
}
Templates
Complete working examples in templates/:
E-commerce templates/spider-ecommerce.jsonProduct pages with prices, ratings, stock
Job Boards templates/spider-jobs.jsonJob listings with companies, salaries
Real Estate templates/spider-realestate.jsonProperty listings with prices, features
Use templates as starting points - adjust selectors to match your target site.
Common Patterns
{
"price" : {
"css" : "span.price::text" ,
"processors" : [
{ "type" : "strip" },
{ "type" : "regex" , "pattern" : " \\ $([ \\ d,.]+)" },
{ "type" : "replace" , "old" : "," , "new" : "" },
{ "type" : "cast" , "to" : "float" }
]
}
}
{
"in_stock" : {
"css" : "span.availability::text" ,
"processors" : [
{ "type" : "lowercase" },
{ "type" : "regex" , "pattern" : "(yes|true|available)" },
{ "type" : "cast" , "to" : "bool" }
]
}
}
Handle Missing Fields
{
"optional_field" : {
"css" : "span.optional::text" ,
"processors" : [
{ "type" : "strip" },
{ "type" : "default" , "default" : null }
]
}
}
Storage Behavior
Standard fields (url, title, content, author, published_date) → Main DB columns
Custom fields → metadata_json column
show command displays custom fields
Exports flatten custom fields to top-level columns/keys
Workflow
Analyze sample page
./scrapai analyze page.html
Discover page structure and identify fields
Test selectors
./scrapai analyze page.html --test "h1::text"
./scrapai analyze page.html --test "span.price::text"
Verify selectors extract correct data
Build callback config
Create callback with selectors and processors
Test on multiple pages
Verify selectors work across different pages
Import and test
./scrapai crawl spider --limit 5 --project proj
Run test crawl to validate extraction
Complete Examples
E-commerce Product
{
"name" : "mystore" ,
"allowed_domains" : [ "example.com" ],
"start_urls" : [ "https://example.com/products" ],
"rules" : [
{
"allow" : [ "/product/[^/]+$" ],
"callback" : "parse_product" ,
"follow" : false
}
],
"callbacks" : {
"parse_product" : {
"extract" : {
"title" : { "css" : "h1.product-name::text" },
"content" : { "css" : "div.product-description::text" },
"price" : {
"css" : "span.price-value::text" ,
"processors" : [
{ "type" : "strip" },
{ "type" : "regex" , "pattern" : " \\ $([ \\ d,.]+)" },
{ "type" : "replace" , "old" : "," , "new" : "" },
{ "type" : "cast" , "to" : "float" }
]
},
"rating" : {
"css" : "div.star-rating::attr(data-rating)" ,
"processors" : [{ "type" : "cast" , "to" : "float" }]
},
"stock" : { "css" : "span.availability::text" },
"brand" : { "css" : "div.brand-name::text" }
}
}
}
}
Job Listing
{
"name" : "jobboard" ,
"allowed_domains" : [ "jobs.example.com" ],
"start_urls" : [ "https://jobs.example.com/listings" ],
"rules" : [
{
"allow" : [ "/job/[^/]+$" ],
"callback" : "parse_job" ,
"follow" : false
}
],
"callbacks" : {
"parse_job" : {
"extract" : {
"title" : { "css" : "h1.job-title::text" },
"company" : { "css" : "span.company-name::text" },
"content" : { "css" : "div.job-description::text" },
"salary" : { "css" : "span.salary-range::text" },
"location" : { "css" : "span.location::text" },
"job_type" : { "css" : "span.job-type::text" },
"date" : {
"css" : "time.posted-date::attr(datetime)" ,
"processors" : [{ "type" : "parse_datetime" }]
}
}
}
}
}
Forum Posts
{
"name" : "forum" ,
"allowed_domains" : [ "forum.example.com" ],
"start_urls" : [ "https://forum.example.com/threads" ],
"rules" : [
{
"allow" : [ "/thread/[^/]+$" ],
"callback" : "parse_thread" ,
"follow" : false
}
],
"callbacks" : {
"parse_thread" : {
"extract" : {
"title" : { "css" : "h1.thread-title::text" },
"author" : { "css" : "span.username::text" },
"content" : { "css" : "div.post-content::text" },
"date" : {
"css" : "time.post-date::attr(datetime)" ,
"processors" : [{ "type" : "parse_datetime" }]
},
"upvotes" : {
"css" : "span.vote-count::text" ,
"processors" : [{ "type" : "cast" , "to" : "int" }]
},
"category" : { "css" : "a.category-link::text" }
}
}
}
}
Reserved Names
Never use these reserved callback names:
parse_article
parse_start_url
start_requests
from_crawler
closed
parse
Troubleshooting
Field Returns None
Test selector
./scrapai analyze page.html --test "your-selector"
Check if page needs browser rendering
./scrapai inspect https://example.com --project proj --browser
or ./scrapai inspect https://example.com --project proj --browser
Verify processor chain
Check if processor is failing and returning None
Wrong Type in Output
Add cast processor to convert type:
{ "processors" : [{ "type" : "cast" , "to" : "float" }]}
Rule References Undefined Callback
Add callback to callbacks dict
Ensure callback is defined in callbacks section
Or use null for navigation-only
{ "allow" : [ "/category/" ], "callback" : null , "follow" : true }