Callbacks - ScrapAI

Callbacks extract custom fields from structured data like products, jobs, listings, or forums. They define CSS/XPath selectors and processors for each field.

When to Use

Use callbacks for:

E-commerce (products, prices, ratings)
Job boards (titles, companies, salaries)
Real estate (properties, prices, features)
Forums (posts, authors, replies)
Any non-article structured data

Use parse_article for:

News, blogs, documentation
Content with title/content/author/date structure

CallbackSchema

extract

object

required

Field extraction rules (mapping of field name → FieldExtractSchema). Min 1 field required.

{
  "extract": {
    "name": {"css": "h1::text"},
    "price": {"css": "span.price::text"}
  }
}

FieldExtractSchema

css

string

default:"null"

CSS selector for extractionSyntax: "h1" (element), "h1::text" (text), "img::attr(src)" (attribute)

{"css": "h1.product-name::text"}

xpath

string

default:"null"

XPath expression for extraction

{"xpath": "//h1/text()"}

get_all

boolean

default:"false"

Extract all matches (returns list)

{
  "css": "li.feature::text",
  "get_all": true
}

processors

ProcessorSchema[]

default:"null"

Value transformations applied sequentiallyAvailable: strip, replace, regex, cast, join, default, lowercase, parse_datetime

{
  "css": "span.price::text",
  "processors": [
    {"type": "strip"},
    {"type": "regex", "pattern": "\\$([\\d,.]+)"},
    {"type": "replace", "old": ",", "new": ""},
    {"type": "cast", "to": "float"}
  ]
}

Nested Lists

type

string

default:"null"

Field type. Use "nested_list" for extracting lists of objects.

{"type": "nested_list"}

selector

string

default:"null"

CSS selector for nested list items. Required when type: "nested_list".

{
  "type": "nested_list",
  "selector": "div.review"
}

extract

object

default:"null"

Nested extraction config (field name → FieldExtractSchema). Required when type: "nested_list". Max depth: 3 levels.

{
  "type": "nested_list",
  "selector": "div.review",
  "extract": {
    "author": {"css": "span.author::text"},
    "rating": {
      "css": "span.stars::attr(data-rating)",
      "processors": [{"type": "cast", "to": "int"}]
    },
    "comment": {"css": "p.text::text"}
  }
}

ProcessorSchema

type

string

required

Processor type: strip, replace, regex, cast, join, default, lowercase, parse_datetime

{"type": "strip"}

Processor parameters:

replace: old, new (strings)
regex: pattern (string), group (int, default: 1)
cast: to (“int”|“float”|“bool”|“str”)
join: separator (string, default: ” ”)
default: default (any)
parse_datetime: format (string, optional)

Examples

E-commerce Product

{
  "callbacks": {
    "parse_product": {
      "extract": {
        "name": {
          "css": "h1.product-name::text",
          "processors": [{"type": "strip"}]
        },
        "price": {
          "css": "span.price::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d,.]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "float"}
          ]
        },
        "rating": {
          "css": "span.rating::attr(data-rating)",
          "processors": [{"type": "cast", "to": "float"}]
        },
        "features": {
          "css": "li.feature::text",
          "get_all": true
        }
      }
    }
  }
}

Job Listing

{
  "callbacks": {
    "parse_job": {
      "extract": {
        "title": {
          "css": "h1.job-title::text",
          "processors": [{"type": "strip"}]
        },
        "company": {
          "css": "span.company-name::text",
          "processors": [{"type": "strip"}]
        },
        "salary_min": {
          "css": "span.salary-min::text",
          "processors": [
            {"type": "regex", "pattern": "\\$([\\d,]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "int"}
          ]
        },
        "remote": {
          "css": "span.remote::text",
          "processors": [
            {"type": "lowercase"},
            {"type": "regex", "pattern": "(remote|work from home)"},
            {"type": "cast", "to": "bool"}
          ]
        },
        "requirements": {
          "css": "li.requirement::text",
          "get_all": true
        }
      }
    }
  }
}

Forum with Nested Comments

{
  "callbacks": {
    "parse_discussion": {
      "extract": {
        "story_title": {
          "css": "tr.athing td.title a::text"
        },
        "story_points": {
          "css": "span.score::text",
          "processors": [
            {"type": "regex", "pattern": "(\\d+)"}
          ]
        },
        "comments": {
          "type": "nested_list",
          "selector": "tr.athing.comtr",
          "extract": {
            "author": {"css": "a.hnuser::text"},
            "timestamp": {"css": "span.age::attr(title)"},
            "comment_text": {
              "css": "div.commtext",
              "get_all": true,
              "processors": [
                {"type": "join", "separator": " "}
              ]
            }
          }
        }
      }
    }
  }
}

Field Storage

Standard fields map to database columns: url, title, content, author, published_date Custom fields stored in metadata_json column and flattened in exports

Reserved Names

Cannot use as callback names: parse_article, parse_start_url, start_requests, from_crawler, closed, parse

Validation

Callback names: Must be valid Python identifiers (e.g., parse_product, not parse-product) Field extraction: Must have css/xpath selector OR type: "nested_list" with selector and extract Cross-validation: Rules must reference defined callbacks

Workflow

Analyze page: scrapai analyze page.html
Test selectors: scrapai analyze page.html --test "h1.title"
Build callback config with processors
Import spider: scrapai spiders import spider.json --project proj
Test extraction: scrapai crawl spider --limit 5 --project proj
View results: scrapai show spider 1 --project proj

Spider Schema - Complete configuration
Rules - URL matching and routing
Custom Extractors - CSS selector details

​When to Use

​CallbackSchema

​FieldExtractSchema

​Nested Lists

​ProcessorSchema

​Examples

​E-commerce Product

​Job Listing

​Forum with Nested Comments

​Field Storage

​Reserved Names

​Validation

​Workflow

​Related

When to Use

CallbackSchema

FieldExtractSchema

Nested Lists

ProcessorSchema

Examples

E-commerce Product

Job Listing

Forum with Nested Comments

Field Storage

Reserved Names

Validation

Workflow

Related