Skip to main content
Extract custom fields from any structured data - not just articles. Perfect for e-commerce, job boards, real estate, forums, and any non-article content.

When to Use

Custom callbacks are ideal for:
  • E-commerce (products, prices, ratings)
  • Job boards (titles, companies, salaries)
  • Real estate (properties, prices, features)
  • Forums (posts, authors, replies)
  • Any non-article structured data

Basic Structure

spider.json
{
  "rules": [
    {
      "allow": ["/product/.*"],
      "callback": "parse_product"
    }
  ],
  "callbacks": {
    "parse_product": {
      "extract": {
        "name": {"css": "h1::text"},
        "price": {
          "css": "span.price::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d.]+)"},
            {"type": "cast", "to": "float"}
          ]
        }
      }
    }
  }
}

Field Extraction

Selectors

{"css": "h1::text"}

Nested Lists

Extract complex nested data structures:
{
  "reviews": {
    "type": "nested_list",
    "selector": "div.review",
    "extract": {
      "author": {"css": "span.author::text"},
      "rating": {
        "css": "span.stars::attr(data-rating)",
        "processors": [{"type": "cast", "to": "int"}]
      },
      "comment": {"css": "p.text::text"}
    }
  }
}
Max nesting depth: 3 levels

Field Processors

8 powerful processors available for data transformation:

strip

Remove whitespace

replace

Replace substring

regex

Extract with pattern

cast

Convert type (int, float, bool, str)

join

Join list to string

default

Fallback value

lowercase

Convert to lowercase

parse_datetime

Parse dates (stores as ISO strings)
See Data Processors for complete reference with examples.

Chaining Processors

Processors execute sequentially, passing output to next processor:
{
  "price": {
    "css": "span.price::text",
    "processors": [
      {"type": "strip"},
      {"type": "regex", "pattern": "\\$([\\d.]+)"},
      {"type": "cast", "to": "float"}
    ]
  }
}

Templates

Complete working examples in templates/:

E-commerce

templates/spider-ecommerce.jsonProduct pages with prices, ratings, stock

Job Boards

templates/spider-jobs.jsonJob listings with companies, salaries

Real Estate

templates/spider-realestate.jsonProperty listings with prices, features
Use templates as starting points - adjust selectors to match your target site.

Common Patterns

Extract Price

{
  "price": {
    "css": "span.price::text",
    "processors": [
      {"type": "strip"},
      {"type": "regex", "pattern": "\\$([\\d,.]+)"},
      {"type": "replace", "old": ",", "new": ""},
      {"type": "cast", "to": "float"}
    ]
  }
}

Extract Boolean

{
  "in_stock": {
    "css": "span.availability::text",
    "processors": [
      {"type": "lowercase"},
      {"type": "regex", "pattern": "(yes|true|available)"},
      {"type": "cast", "to": "bool"}
    ]
  }
}

Handle Missing Fields

{
  "optional_field": {
    "css": "span.optional::text",
    "processors": [
      {"type": "strip"},
      {"type": "default", "default": null}
    ]
  }
}

Storage Behavior

  • Standard fields (url, title, content, author, published_date) → Main DB columns
  • Custom fieldsmetadata_json column
  • show command displays custom fields
  • Exports flatten custom fields to top-level columns/keys

Workflow

1

Analyze sample page

./scrapai analyze page.html
Discover page structure and identify fields
2

Test selectors

./scrapai analyze page.html --test "h1::text"
./scrapai analyze page.html --test "span.price::text"
Verify selectors extract correct data
3

Build callback config

Create callback with selectors and processors
4

Test on multiple pages

Verify selectors work across different pages
5

Import and test

./scrapai crawl spider --limit 5 --project proj
Run test crawl to validate extraction

Complete Examples

E-commerce Product

{
  "name": "mystore",
  "allowed_domains": ["example.com"],
  "start_urls": ["https://example.com/products"],
  "rules": [
    {
      "allow": ["/product/[^/]+$"],
      "callback": "parse_product",
      "follow": false
    }
  ],
  "callbacks": {
    "parse_product": {
      "extract": {
        "title": {"css": "h1.product-name::text"},
        "content": {"css": "div.product-description::text"},
        "price": {
          "css": "span.price-value::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d,.]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "float"}
          ]
        },
        "rating": {
          "css": "div.star-rating::attr(data-rating)",
          "processors": [{"type": "cast", "to": "float"}]
        },
        "stock": {"css": "span.availability::text"},
        "brand": {"css": "div.brand-name::text"}
      }
    }
  }
}

Job Listing

{
  "name": "jobboard",
  "allowed_domains": ["jobs.example.com"],
  "start_urls": ["https://jobs.example.com/listings"],
  "rules": [
    {
      "allow": ["/job/[^/]+$"],
      "callback": "parse_job",
      "follow": false
    }
  ],
  "callbacks": {
    "parse_job": {
      "extract": {
        "title": {"css": "h1.job-title::text"},
        "company": {"css": "span.company-name::text"},
        "content": {"css": "div.job-description::text"},
        "salary": {"css": "span.salary-range::text"},
        "location": {"css": "span.location::text"},
        "job_type": {"css": "span.job-type::text"},
        "date": {
          "css": "time.posted-date::attr(datetime)",
          "processors": [{"type": "parse_datetime"}]
        }
      }
    }
  }
}

Forum Posts

{
  "name": "forum",
  "allowed_domains": ["forum.example.com"],
  "start_urls": ["https://forum.example.com/threads"],
  "rules": [
    {
      "allow": ["/thread/[^/]+$"],
      "callback": "parse_thread",
      "follow": false
    }
  ],
  "callbacks": {
    "parse_thread": {
      "extract": {
        "title": {"css": "h1.thread-title::text"},
        "author": {"css": "span.username::text"},
        "content": {"css": "div.post-content::text"},
        "date": {
          "css": "time.post-date::attr(datetime)",
          "processors": [{"type": "parse_datetime"}]
        },
        "upvotes": {
          "css": "span.vote-count::text",
          "processors": [{"type": "cast", "to": "int"}]
        },
        "category": {"css": "a.category-link::text"}
      }
    }
  }
}

Reserved Names

Never use these reserved callback names:
  • parse_article
  • parse_start_url
  • start_requests
  • from_crawler
  • closed
  • parse

Troubleshooting

Field Returns None

1

Test selector

./scrapai analyze page.html --test "your-selector"
2

Check if page needs browser rendering

./scrapai inspect https://example.com --project proj --browser
or
./scrapai inspect https://example.com --project proj --browser
3

Verify processor chain

Check if processor is failing and returning None

Wrong Type in Output

Add cast processor to convert type:
{"processors": [{"type": "cast", "to": "float"}]}

Rule References Undefined Callback

1

Add callback to callbacks dict

Ensure callback is defined in callbacks section
2

Or use null for navigation-only

{"allow": ["/category/"], "callback": null, "follow": true}