Skip to main content
Callbacks extract custom fields from structured data like products, jobs, listings, or forums. They define CSS/XPath selectors and processors for each field.

When to Use

Use callbacks for:
  • E-commerce (products, prices, ratings)
  • Job boards (titles, companies, salaries)
  • Real estate (properties, prices, features)
  • Forums (posts, authors, replies)
  • Any non-article structured data
Use parse_article for:
  • News, blogs, documentation
  • Content with title/content/author/date structure

CallbackSchema

extract
object
required
Field extraction rules (mapping of field name → FieldExtractSchema)Validation:
  • Min 1 field required
  • Each key is the field name
  • Each value is a FieldExtractSchema
Example:
{
  "extract": {
    "name": {"css": "h1::text"},
    "price": {"css": "span.price::text"}
  }
}

FieldExtractSchema

css
string
default:"null"
CSS selector for extractionSyntax:
  • "h1" - Element
  • "h1::text" - Text content
  • "img::attr(src)" - Attribute value
  • "div.class > p" - Nested selector
Example:
{"css": "h1.product-name::text"}
xpath
string
default:"null"
XPath expression for extractionExample:
{"xpath": "//h1/text()"}
get_all
boolean
default:"false"
Extract all matches (returns list)Example:
{
  "css": "li.feature::text",
  "get_all": true
}
Output: ["WiFi", "Bluetooth", "GPS"]
processors
ProcessorSchema[]
default:"null"
Value transformations to apply sequentiallyAvailable processors:
  • strip - Remove whitespace
  • replace - Replace substring
  • regex - Extract with pattern
  • cast - Convert type (int, float, bool, str)
  • join - Join list to string
  • default - Fallback value
  • lowercase - Convert to lowercase
  • parse_datetime - Parse dates
Example:
{
  "css": "span.price::text",
  "processors": [
    {"type": "strip"},
    {"type": "regex", "pattern": "\\$([\\d,.]+)"},
    {"type": "replace", "old": ",", "new": ""},
    {"type": "cast", "to": "float"}
  ]
}

Nested Lists

type
string
default:"null"
Field type (use "nested_list" for extracting lists of objects)Example:
{"type": "nested_list"}
selector
string
default:"null"
CSS selector for nested list itemsRequired when: type: "nested_list"Example:
{
  "type": "nested_list",
  "selector": "div.review"
}
extract
object
default:"null"
Nested extraction config (field name → FieldExtractSchema)Required when: type: "nested_list"Max depth: 3 levelsExample:
{
  "type": "nested_list",
  "selector": "div.review",
  "extract": {
    "author": {"css": "span.author::text"},
    "rating": {
      "css": "span.stars::attr(data-rating)",
      "processors": [{"type": "cast", "to": "int"}]
    },
    "comment": {"css": "p.text::text"}
  }
}

ProcessorSchema

type
string
required
Processor typeAllowed values:
  • strip, replace, regex, cast, join, default, lowercase, parse_datetime
Example:
{"type": "strip"}
Processor-specific parameters:
  • replace: old (string), new (string)
  • regex: pattern (string), group (int, default: 1)
  • cast: to (“int”|“float”|“bool”|“str”)
  • join: separator (string, default: ” ”)
  • default: default (any)
  • parse_datetime: format (string, optional)
See complete processor reference in related docs.

Examples

E-commerce Product

{
  "callbacks": {
    "parse_product": {
      "extract": {
        "name": {
          "css": "h1.product-name::text",
          "processors": [{"type": "strip"}]
        },
        "price": {
          "css": "span.price::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d,.]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "float"}
          ]
        },
        "rating": {
          "css": "span.rating::attr(data-rating)",
          "processors": [{"type": "cast", "to": "float"}]
        },
        "features": {
          "css": "li.feature::text",
          "get_all": true
        }
      }
    }
  }
}

Job Listing

{
  "callbacks": {
    "parse_job": {
      "extract": {
        "title": {
          "css": "h1.job-title::text",
          "processors": [{"type": "strip"}]
        },
        "company": {
          "css": "span.company-name::text",
          "processors": [{"type": "strip"}]
        },
        "salary_min": {
          "css": "span.salary-min::text",
          "processors": [
            {"type": "regex", "pattern": "\\$([\\d,]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "int"}
          ]
        },
        "remote": {
          "css": "span.remote::text",
          "processors": [
            {"type": "lowercase"},
            {"type": "regex", "pattern": "(remote|work from home)"},
            {"type": "cast", "to": "bool"}
          ]
        },
        "requirements": {
          "css": "li.requirement::text",
          "get_all": true
        }
      }
    }
  }
}

Forum with Nested Comments

{
  "callbacks": {
    "parse_discussion": {
      "extract": {
        "story_title": {
          "css": "tr.athing td.title a::text"
        },
        "story_points": {
          "css": "span.score::text",
          "processors": [
            {"type": "regex", "pattern": "(\\d+)"}
          ]
        },
        "comments": {
          "type": "nested_list",
          "selector": "tr.athing.comtr",
          "extract": {
            "author": {"css": "a.hnuser::text"},
            "timestamp": {"css": "span.age::attr(title)"},
            "comment_text": {
              "css": "div.commtext",
              "get_all": true,
              "processors": [
                {"type": "join", "separator": " "}
              ]
            }
          }
        }
      }
    }
  }
}

Field Storage

Standard Fields

Map to database columns:
  • url → scraped_items.url
  • title → scraped_items.title
  • content → scraped_items.content
  • author → scraped_items.author
  • published_date → scraped_items.published_date

Custom Fields

Stored in metadata_json column:
  • price, rating, category, etc.
  • Displayed in show command
  • Flattened in exports (CSV/JSON/JSONL)

Reserved Names

Cannot use as callback names:
  • parse_article
  • parse_start_url
  • start_requests
  • from_crawler
  • closed
  • parse

Validation

Callback Name

# Valid
"parse_product", "extract_job", "get_listing"

# Invalid
"parse-product"  # No hyphens
"123parse"       # Must start with letter/underscore
"parse article"  # No spaces

Field Extraction

Must have either:
  • css or xpath selector, OR
  • type: "nested_list" with selector and extract

Cross-Validation

Rules must reference defined callbacks:
{
  "rules": [
    {"allow": ["/product/.*"], "callback": "parse_product"}
  ],
  "callbacks": {
    "parse_product": { /* ... */ }
  }
}

Workflow

  1. Analyze page: scrapai analyze page.html
  2. Test selectors: scrapai analyze page.html --test "h1.title"
  3. Build callback config with processors
  4. Import spider: scrapai spiders import spider.json --project proj
  5. Test extraction: scrapai crawl spider --limit 5 --project proj
  6. View results: scrapai show spider 1 --project proj