Skip to main content
Callbacks extract custom fields from structured data like products, jobs, listings, or forums. They define CSS/XPath selectors and processors for each field.

When to Use

Use callbacks for:
  • E-commerce (products, prices, ratings)
  • Job boards (titles, companies, salaries)
  • Real estate (properties, prices, features)
  • Forums (posts, authors, replies)
  • Any non-article structured data
Use parse_article for:
  • News, blogs, documentation
  • Content with title/content/author/date structure

CallbackSchema

extract
object
required
Field extraction rules (mapping of field name → FieldExtractSchema). Min 1 field required.
{
  "extract": {
    "name": {"css": "h1::text"},
    "price": {"css": "span.price::text"}
  }
}

FieldExtractSchema

css
string
default:"null"
CSS selector for extractionSyntax: "h1" (element), "h1::text" (text), "img::attr(src)" (attribute)
{"css": "h1.product-name::text"}
xpath
string
default:"null"
XPath expression for extraction
{"xpath": "//h1/text()"}
get_all
boolean
default:"false"
Extract all matches (returns list)
{
  "css": "li.feature::text",
  "get_all": true
}
processors
ProcessorSchema[]
default:"null"
Value transformations applied sequentiallyAvailable: strip, replace, regex, cast, join, default, lowercase, parse_datetime
{
  "css": "span.price::text",
  "processors": [
    {"type": "strip"},
    {"type": "regex", "pattern": "\\$([\\d,.]+)"},
    {"type": "replace", "old": ",", "new": ""},
    {"type": "cast", "to": "float"}
  ]
}

Nested Lists

type
string
default:"null"
Field type. Use "nested_list" for extracting lists of objects.
{"type": "nested_list"}
selector
string
default:"null"
CSS selector for nested list items. Required when type: "nested_list".
{
  "type": "nested_list",
  "selector": "div.review"
}
extract
object
default:"null"
Nested extraction config (field name → FieldExtractSchema). Required when type: "nested_list". Max depth: 3 levels.
{
  "type": "nested_list",
  "selector": "div.review",
  "extract": {
    "author": {"css": "span.author::text"},
    "rating": {
      "css": "span.stars::attr(data-rating)",
      "processors": [{"type": "cast", "to": "int"}]
    },
    "comment": {"css": "p.text::text"}
  }
}

ProcessorSchema

type
string
required
Processor type: strip, replace, regex, cast, join, default, lowercase, parse_datetime
{"type": "strip"}
Processor parameters:
  • replace: old, new (strings)
  • regex: pattern (string), group (int, default: 1)
  • cast: to (“int”|“float”|“bool”|“str”)
  • join: separator (string, default: ” ”)
  • default: default (any)
  • parse_datetime: format (string, optional)

Examples

E-commerce Product

{
  "callbacks": {
    "parse_product": {
      "extract": {
        "name": {
          "css": "h1.product-name::text",
          "processors": [{"type": "strip"}]
        },
        "price": {
          "css": "span.price::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d,.]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "float"}
          ]
        },
        "rating": {
          "css": "span.rating::attr(data-rating)",
          "processors": [{"type": "cast", "to": "float"}]
        },
        "features": {
          "css": "li.feature::text",
          "get_all": true
        }
      }
    }
  }
}

Job Listing

{
  "callbacks": {
    "parse_job": {
      "extract": {
        "title": {
          "css": "h1.job-title::text",
          "processors": [{"type": "strip"}]
        },
        "company": {
          "css": "span.company-name::text",
          "processors": [{"type": "strip"}]
        },
        "salary_min": {
          "css": "span.salary-min::text",
          "processors": [
            {"type": "regex", "pattern": "\\$([\\d,]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "int"}
          ]
        },
        "remote": {
          "css": "span.remote::text",
          "processors": [
            {"type": "lowercase"},
            {"type": "regex", "pattern": "(remote|work from home)"},
            {"type": "cast", "to": "bool"}
          ]
        },
        "requirements": {
          "css": "li.requirement::text",
          "get_all": true
        }
      }
    }
  }
}

Forum with Nested Comments

{
  "callbacks": {
    "parse_discussion": {
      "extract": {
        "story_title": {
          "css": "tr.athing td.title a::text"
        },
        "story_points": {
          "css": "span.score::text",
          "processors": [
            {"type": "regex", "pattern": "(\\d+)"}
          ]
        },
        "comments": {
          "type": "nested_list",
          "selector": "tr.athing.comtr",
          "extract": {
            "author": {"css": "a.hnuser::text"},
            "timestamp": {"css": "span.age::attr(title)"},
            "comment_text": {
              "css": "div.commtext",
              "get_all": true,
              "processors": [
                {"type": "join", "separator": " "}
              ]
            }
          }
        }
      }
    }
  }
}

Field Storage

Standard fields map to database columns: url, title, content, author, published_date Custom fields stored in metadata_json column and flattened in exports

Reserved Names

Cannot use as callback names: parse_article, parse_start_url, start_requests, from_crawler, closed, parse

Validation

Callback names: Must be valid Python identifiers (e.g., parse_product, not parse-product) Field extraction: Must have css/xpath selector OR type: "nested_list" with selector and extract Cross-validation: Rules must reference defined callbacks

Workflow

  1. Analyze page: scrapai analyze page.html
  2. Test selectors: scrapai analyze page.html --test "h1.title"
  3. Build callback config with processors
  4. Import spider: scrapai spiders import spider.json --project proj
  5. Test extraction: scrapai crawl spider --limit 5 --project proj
  6. View results: scrapai show spider 1 --project proj