Skip to main content
Processors transform extracted values (strip whitespace, cast types, apply regex, etc.).

Available Processors

strip

Remove leading and trailing whitespace

replace

Replace substring in strings

regex

Extract substring using pattern

cast

Convert to specified type

join

Join list values to string

default

Return fallback value if empty

lowercase

Convert strings to lowercase

parse_datetime

Parse datetime to ISO format

Processor Reference

1. strip

Remove leading and trailing whitespace from strings. Parameters: None Example:
{
  "css": "h1::text",
  "processors": [{"type": "strip"}]
}
Transformation:
"  Hello World  " → "Hello World"
Works on: Strings and lists of strings

2. replace

Replace substring in strings. Parameters:
  • old (required): Substring to replace
  • new (required): Replacement string
Example:
{
  "css": "span.price::text",
  "processors": [
    {"type": "replace", "old": "$", "new": ""},
    {"type": "replace", "old": ",", "new": ""}
  ]
}
Transformation:
"$1,299.99" → "1299.99"
Works on: Strings and lists of strings

3. regex

Extract substring using regular expression pattern. Parameters:
  • pattern (required): Regex pattern to match
  • group (optional): Capture group to extract (default: 1)
Example:
{
  "css": "span::text",
  "processors": [
    {"type": "regex", "pattern": "Price: \\$([\\d.]+)"}
  ]
}
Transformation:
"Price: $99.99" → "99.99"
Multiple groups:
{"type": "regex", "pattern": "(\\d+) items", "group": 1}
Returns original value if no match. Works on: Strings only

4. cast

Convert value to specified type. Parameters:
  • to (required): Target type - "int", "float", "bool", or "str"
Example:
{
  "css": "span.rating::attr(data-rating)",
  "processors": [
    {"type": "cast", "to": "float"}
  ]
}
Transformations:
"4.5" → 4.5 (float)
"42" → 42 (int)
"true" → True (bool)
Boolean conversion:
  • true, 1, yes, on → True
  • Everything else → False
Returns None if conversion fails. Works on: Any type

5. join

Join list values into a single string. Parameters:
  • separator (optional): String to join with (default: " ")
Example:
{
  "css": "li.feature::text",
  "get_all": true,
  "processors": [
    {"type": "join", "separator": ", "}
  ]
}
Transformation:
["WiFi", "Bluetooth", "GPS"] → "WiFi, Bluetooth, GPS"
Filters out None values automatically. Works on: Lists only

6. default

Return default value if input is None, empty string, or empty list. Parameters:
  • default (required): Fallback value
Example:
{
  "css": "span.optional::text",
  "processors": [
    {"type": "default", "default": "N/A"}
  ]
}
Transformations:
None → "N/A"
"" → "N/A"
[] → "N/A"
"actual value" → "actual value"
Works on: Any type

7. lowercase

Convert strings to lowercase. Parameters: None Example:
{
  "css": "span.status::text",
  "processors": [
    {"type": "strip"},
    {"type": "lowercase"}
  ]
}
Transformation:
"IN STOCK" → "in stock"
Works on: Strings and lists of strings

8. parse_datetime

Parse datetime string into ISO format. Parameters:
  • format (optional): strptime format string (if None, uses dateutil parser for flexible parsing)
Example with format:
{
  "css": "time.date::attr(datetime)",
  "processors": [
    {"type": "parse_datetime", "format": "%Y-%m-%d"}
  ]
}
Example without format (auto-detect):
{
  "css": "span.date::text",
  "processors": [
    {"type": "parse_datetime"}
  ]
}
Transformations:
"2024-02-24" → "2024-02-24T00:00:00" (ISO format)
"February 24, 2024" → "2024-02-24T00:00:00"
"24/02/2024" → "2024-02-24T00:00:00" (auto-detected)
Stored as ISO string in database (automatically serialized). Returns None if parsing fails. Works on: Strings only

Processor Chaining

Processors run sequentially. Output of one becomes input to the next.

Example 1: Clean and Convert Price

{
  "css": "span.price::text",
  "processors": [
    {"type": "strip"},                           // "  $99.99  " → "$99.99"
    {"type": "replace", "old": "$", "new": ""},  // "$99.99" → "99.99"
    {"type": "cast", "to": "float"}              // "99.99" → 99.99
  ]
}

Example 2: Extract Rating Number

{
  "css": "div.rating::text",
  "processors": [
    {"type": "strip"},                           // "  Rating: 4.5 stars  " → "Rating: 4.5 stars"
    {"type": "regex", "pattern": "([\\d.]+)"},   // "Rating: 4.5 stars" → "4.5"
    {"type": "cast", "to": "float"}              // "4.5" → 4.5
  ]
}

Example 3: Normalize Text

{
  "css": "span.status::text",
  "processors": [
    {"type": "strip"},
    {"type": "lowercase"},
    {"type": "replace", "old": " ", "new": "_"}
  ]
}
Input: " In Stock "
Output: "in_stock"

Example 4: Handle Missing Values

{
  "css": "span.optional-field::text",
  "processors": [
    {"type": "strip"},
    {"type": "default", "default": "Not specified"}
  ]
}

Common Patterns

Extracting Currency Values

{
  "price": {
    "css": "span.price::text",
    "processors": [
      {"type": "strip"},
      {"type": "regex", "pattern": "\\$([\\d,.]+)"},
      {"type": "replace", "old": ",", "new": ""},
      {"type": "cast", "to": "float"}
    ]
  }
}
Handles: "$1,299.99", "Price: $99", " $42.50 "

Extracting Numbers from Text

{
  "quantity": {
    "css": "div.quantity::text",
    "processors": [
      {"type": "regex", "pattern": "(\\d+)"},
      {"type": "cast", "to": "int"}
    ]
  }
}
Handles: "23 items", "Quantity: 5", "42"

Boolean Fields

{
  "in_stock": {
    "css": "span.availability::text",
    "processors": [
      {"type": "lowercase"},
      {"type": "regex", "pattern": "(in stock|available)"},
      {"type": "cast", "to": "bool"}
    ]
  }
}
Returns: True if “in stock” or “available”, else False

Date Fields

{
  "published_date": {
    "css": "time::attr(datetime)",
    "processors": [
      {"type": "parse_datetime"}
    ]
  }
}
Auto-detects format, stores as ISO string.

Lists to Comma-Separated String

{
  "tags": {
    "css": "li.tag::text",
    "get_all": true,
    "processors": [
      {"type": "join", "separator": ", "}
    ]
  }
}
Input: ["Python", "Web Scraping", "Automation"]
Output: "Python, Web Scraping, Automation"

Complete Examples

E-commerce Product

{
  "callbacks": {
    "parse_product": {
      "extract": {
        "name": {
          "css": "h1.product-name::text",
          "processors": [{"type": "strip"}]
        },
        "price": {
          "css": "span.price::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d,.]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "float"}
          ]
        },
        "rating": {
          "css": "div.rating::attr(data-rating)",
          "processors": [{"type": "cast", "to": "float"}]
        },
        "in_stock": {
          "css": "span.availability::text",
          "processors": [
            {"type": "lowercase"},
            {"type": "regex", "pattern": "(in stock|available)"},
            {"type": "cast", "to": "bool"}
          ]
        },
        "features": {
          "css": "li.feature::text",
          "get_all": true,
          "processors": [{"type": "join", "separator": ", "}]
        }
      }
    }
  }
}

Job Listing

{
  "callbacks": {
    "parse_job": {
      "extract": {
        "title": {
          "css": "h1.job-title::text",
          "processors": [{"type": "strip"}]
        },
        "salary_min": {
          "css": "span.salary-min::text",
          "processors": [
            {"type": "strip"},
            {"type": "replace", "old": "$", "new": ""},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "int"}
          ]
        },
        "salary_max": {
          "css": "span.salary-max::text",
          "processors": [
            {"type": "strip"},
            {"type": "replace", "old": "$", "new": ""},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "int"}
          ]
        },
        "posted_date": {
          "css": "time.posted-date::attr(datetime)",
          "processors": [{"type": "parse_datetime"}]
        },
        "remote": {
          "css": "span.job-type::text",
          "processors": [
            {"type": "lowercase"},
            {"type": "regex", "pattern": "(remote|work from home)"},
            {"type": "cast", "to": "bool"}
          ]
        },
        "skills": {
          "css": "span.skill::text",
          "get_all": true,
          "processors": [{"type": "join", "separator": ", "}]
        }
      }
    }
  }
}

Real Estate Listing

{
  "callbacks": {
    "parse_property": {
      "extract": {
        "address": {
          "css": "h1.property-address::text",
          "processors": [{"type": "strip"}]
        },
        "price": {
          "css": "span.property-price::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d,.]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "float"}
          ]
        },
        "bedrooms": {
          "css": "span.bedrooms::text",
          "processors": [
            {"type": "regex", "pattern": "(\\d+)"},
            {"type": "cast", "to": "int"}
          ]
        },
        "bathrooms": {
          "css": "span.bathrooms::text",
          "processors": [
            {"type": "regex", "pattern": "([\\d.]+)"},
            {"type": "cast", "to": "float"}
          ]
        },
        "sqft": {
          "css": "span.square-feet::text",
          "processors": [
            {"type": "regex", "pattern": "([\\d,]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "int"}
          ]
        },
        "amenities": {
          "css": "li.amenity::text",
          "get_all": true,
          "processors": [{"type": "join", "separator": ", "}]
        }
      }
    }
  }
}

Error Handling

  • strip, replace, lowercase, join: Return original value if not applicable type
  • regex: Returns original value if no match
  • cast: Returns None if conversion fails
  • parse_datetime: Returns None if parsing fails
  • Unknown processor type: Skipped, logs warning

Best Practices

1

Always strip text fields

{"processors": [{"type": "strip"}]}
2

Use regex before cast

[
  {"type": "regex", "pattern": "([\\d.]+)"},
  {"type": "cast", "to": "float"}
]
3

Chain replace for complex cleaning

[
  {"type": "replace", "old": "$", "new": ""},
  {"type": "replace", "old": ",", "new": ""}
]
4

Default at the end

[
  {"type": "strip"},
  {"type": "cast", "to": "float"},
  {"type": "default", "default": 0.0}
]
5

Test selectors first

./scrapai analyze --test "selector"
6

Validate processor output

./scrapai crawl spider --limit 5 --project proj
./scrapai show 1 --project proj

Troubleshooting

Processor Returns None

  1. Check processor type - Verify name is correct
  2. Validate input type - regex (strings), join (lists), parse_datetime (strings)
  3. Test without processors - See raw extracted value
  4. Check logs - Look for processor warnings

Wrong Output Type

Add cast processor at the end:
{"processors": [{"type": "cast", "to": "float"}]}

Regex Not Matching

  1. Test pattern - Use regex101.com
  2. Check escaping - Double backslashes in JSON: {"pattern": "\\$([\\d.]+)"}
  3. Add fallback - [{"type": "regex", "pattern": "([\\d.]+)"}, {"type": "default", "default": null}]

Date Parsing Fails

  1. Try without format - {"type": "parse_datetime"} (auto-detect)
  2. Specify format - {"type": "parse_datetime", "format": "%Y-%m-%d"}
  3. Check raw value - View extracted value to understand format

Custom Callbacks

Extract structured data with callbacks

Extractors

Content extraction strategies