Custom Callbacks & Field Extraction

Extract custom fields from any structured data - not just articles. Perfect for e-commerce, job boards, real estate, forums, and any non-article content.

When to Use

Use Callbacks For
Use parse_article For

E-commerce (products, prices, ratings)
Job boards (titles, companies, salaries)
Real estate (properties, prices, features)
Forums (posts, authors, replies)
Any non-article structured data

Basic Structure

spider.json

{
  "rules": [
    {
      "allow": ["/product/.*"],
      "callback": "parse_product"
    }
  ],
  "callbacks": {
    "parse_product": {
      "extract": {
        "name": {"css": "h1::text"},
        "price": {
          "css": "span.price::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d.]+)"},
            {"type": "cast", "to": "float"}
          ]
        }
      }
    }
  }
}

Field Extraction

Selectors

{"css": "h1::text"}

Nested Lists

Extract complex nested data structures:

{
  "reviews": {
    "type": "nested_list",
    "selector": "div.review",
    "extract": {
      "author": {"css": "span.author::text"},
      "rating": {
        "css": "span.stars::attr(data-rating)",
        "processors": [{"type": "cast", "to": "int"}]
      },
      "comment": {"css": "p.text::text"}
    }
  }
}

Max nesting depth: 3 levels

Field Processors

8 processors available for data transformation:

strip

Remove whitespace

replace

Replace substring

regex

Extract with pattern

cast

Convert type (int, float, bool, str)

join

Join list to string

default

Fallback value

lowercase

Convert to lowercase

parse_datetime

Parse dates (stores as ISO strings)

See Data Processors for complete reference.

Chaining Processors

Processors execute sequentially, passing output to next processor:

{
  "price": {
    "css": "span.price::text",
    "processors": [
      {"type": "strip"},
      {"type": "regex", "pattern": "\\$([\\d.]+)"},
      {"type": "cast", "to": "float"}
    ]
  }
}

Templates

Complete working examples in templates/:

E-commerce

templates/spider-ecommerce.jsonProduct pages with prices, ratings, stock

Job Boards

templates/spider-jobs.jsonJob listings with companies, salaries

Real Estate

templates/spider-realestate.jsonProperty listings with prices, features

Common Patterns

Extract Price

{
  "price": {
    "css": "span.price::text",
    "processors": [
      {"type": "strip"},
      {"type": "regex", "pattern": "\\$([\\d,.]+)"},
      {"type": "replace", "old": ",", "new": ""},
      {"type": "cast", "to": "float"}
    ]
  }
}

Extract Boolean

{
  "in_stock": {
    "css": "span.availability::text",
    "processors": [
      {"type": "lowercase"},
      {"type": "regex", "pattern": "(yes|true|available)"},
      {"type": "cast", "to": "bool"}
    ]
  }
}

Handle Missing Fields

{
  "optional_field": {
    "css": "span.optional::text",
    "processors": [
      {"type": "strip"},
      {"type": "default", "default": null}
    ]
  }
}

Storage Behavior

Standard fields (url, title, content, author, published_date) → Main DB columns
Custom fields → metadata_json column
show command displays custom fields
Exports flatten custom fields to top-level columns/keys

Workflow

Analyze sample page

./scrapai analyze page.html

Test selectors

./scrapai analyze page.html --test "h1::text"
./scrapai analyze page.html --test "span.price::text"

Build callback config

Create callback with selectors and processors

Test on multiple pages

Verify selectors work across different pages

Import and test

./scrapai crawl spider --limit 5 --project proj

Complete Examples

E-commerce Product

{
  "name": "mystore",
  "allowed_domains": ["example.com"],
  "start_urls": ["https://example.com/products"],
  "rules": [
    {
      "allow": ["/product/[^/]+$"],
      "callback": "parse_product",
      "follow": false
    }
  ],
  "callbacks": {
    "parse_product": {
      "extract": {
        "title": {"css": "h1.product-name::text"},
        "content": {"css": "div.product-description::text"},
        "price": {
          "css": "span.price-value::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d,.]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "float"}
          ]
        },
        "rating": {
          "css": "div.star-rating::attr(data-rating)",
          "processors": [{"type": "cast", "to": "float"}]
        },
        "stock": {"css": "span.availability::text"},
        "brand": {"css": "div.brand-name::text"}
      }
    }
  }
}

Job Listing

{
  "name": "jobboard",
  "allowed_domains": ["jobs.example.com"],
  "start_urls": ["https://jobs.example.com/listings"],
  "rules": [
    {
      "allow": ["/job/[^/]+$"],
      "callback": "parse_job",
      "follow": false
    }
  ],
  "callbacks": {
    "parse_job": {
      "extract": {
        "title": {"css": "h1.job-title::text"},
        "company": {"css": "span.company-name::text"},
        "content": {"css": "div.job-description::text"},
        "salary": {"css": "span.salary-range::text"},
        "location": {"css": "span.location::text"},
        "job_type": {"css": "span.job-type::text"},
        "date": {
          "css": "time.posted-date::attr(datetime)",
          "processors": [{"type": "parse_datetime"}]
        }
      }
    }
  }
}

Forum Posts

{
  "name": "forum",
  "allowed_domains": ["forum.example.com"],
  "start_urls": ["https://forum.example.com/threads"],
  "rules": [
    {
      "allow": ["/thread/[^/]+$"],
      "callback": "parse_thread",
      "follow": false
    }
  ],
  "callbacks": {
    "parse_thread": {
      "extract": {
        "title": {"css": "h1.thread-title::text"},
        "author": {"css": "span.username::text"},
        "content": {"css": "div.post-content::text"},
        "date": {
          "css": "time.post-date::attr(datetime)",
          "processors": [{"type": "parse_datetime"}]
        },
        "upvotes": {
          "css": "span.vote-count::text",
          "processors": [{"type": "cast", "to": "int"}]
        },
        "category": {"css": "a.category-link::text"}
      }
    }
  }
}

Reserved Names

Never use these reserved callback names:

parse_article
parse_start_url
start_requests
from_crawler
closed
parse

Troubleshooting

Field Returns None

Test selector

./scrapai analyze page.html --test "your-selector"

Check if page needs browser rendering

./scrapai inspect https://example.com --project proj --browser

Verify processor chain

Check if processor is failing and returning None

Wrong Type in Output

Add cast processor to convert type:

{"processors": [{"type": "cast", "to": "float"}]}

Rule References Undefined Callback

Add callback to callbacks dict

Ensure callback is defined in callbacks section

Or use null for navigation-only

{"allow": ["/category/"], "callback": null, "follow": true}

Data Processors

Complete processor reference

Extractors

Content extraction strategies

​When to Use

​Basic Structure

​Field Extraction

​Selectors

​Nested Lists

​Field Processors

strip

replace

regex

cast

join

default

lowercase

parse_datetime

​Chaining Processors

​Templates

E-commerce

Job Boards

Real Estate

​Common Patterns

​Extract Price

​Extract Boolean

​Handle Missing Fields

​Storage Behavior

​Workflow

​Complete Examples

​E-commerce Product

​Job Listing

​Forum Posts

​Reserved Names

​Troubleshooting

​Field Returns None

​Wrong Type in Output

​Rule References Undefined Callback

​Related Guides

Data Processors

Extractors

When to Use

Basic Structure

Field Extraction

Selectors

Nested Lists

Field Processors

Chaining Processors

Templates

Common Patterns

Extract Price

Extract Boolean

Handle Missing Fields

Storage Behavior

Workflow

Complete Examples

E-commerce Product

Job Listing

Forum Posts

Reserved Names

Troubleshooting

Field Returns None

Wrong Type in Output

Rule References Undefined Callback

Related Guides