Data Processors

Processors transform extracted values (strip whitespace, cast types, apply regex, etc.).

Available Processors

strip

Remove leading and trailing whitespace

replace

Replace substring in strings

regex

Extract substring using pattern

cast

Convert to specified type

join

Join list values to string

default

Return fallback value if empty

lowercase

Convert strings to lowercase

parse_datetime

Parse datetime to ISO format

Processor Reference

1. strip

Remove leading and trailing whitespace from strings. Parameters: None Example:

{
  "css": "h1::text",
  "processors": [{"type": "strip"}]
}

Transformation:

"  Hello World  " → "Hello World"

Works on: Strings and lists of strings

2. replace

Replace substring in strings. Parameters:

old (required): Substring to replace
new (required): Replacement string

Example:

{
  "css": "span.price::text",
  "processors": [
    {"type": "replace", "old": "$", "new": ""},
    {"type": "replace", "old": ",", "new": ""}
  ]
}

Transformation:

"$1,299.99" → "1299.99"

Works on: Strings and lists of strings

3. regex

Extract substring using regular expression pattern. Parameters:

pattern (required): Regex pattern to match
group (optional): Capture group to extract (default: 1)

Example:

{
  "css": "span::text",
  "processors": [
    {"type": "regex", "pattern": "Price: \\$([\\d.]+)"}
  ]
}

Transformation:

"Price: $99.99" → "99.99"

Multiple groups:

{"type": "regex", "pattern": "(\\d+) items", "group": 1}

Returns original value if no match. Works on: Strings only

4. cast

Convert value to specified type. Parameters:

to (required): Target type - "int", "float", "bool", or "str"

Example:

{
  "css": "span.rating::attr(data-rating)",
  "processors": [
    {"type": "cast", "to": "float"}
  ]
}

Transformations:

"4.5" → 4.5 (float)
"42" → 42 (int)
"true" → True (bool)

Boolean conversion:

true, 1, yes, on → True
Everything else → False

Returns None if conversion fails. Works on: Any type

5. join

Join list values into a single string. Parameters:

separator (optional): String to join with (default: " ")

Example:

{
  "css": "li.feature::text",
  "get_all": true,
  "processors": [
    {"type": "join", "separator": ", "}
  ]
}

Transformation:

["WiFi", "Bluetooth", "GPS"] → "WiFi, Bluetooth, GPS"

Filters out None values automatically. Works on: Lists only

6. default

Return default value if input is None, empty string, or empty list. Parameters:

default (required): Fallback value

Example:

{
  "css": "span.optional::text",
  "processors": [
    {"type": "default", "default": "N/A"}
  ]
}

Transformations:

None → "N/A"
"" → "N/A"
[] → "N/A"
"actual value" → "actual value"

Works on: Any type

7. lowercase

Convert strings to lowercase. Parameters: None Example:

{
  "css": "span.status::text",
  "processors": [
    {"type": "strip"},
    {"type": "lowercase"}
  ]
}

Transformation:

"IN STOCK" → "in stock"

Works on: Strings and lists of strings

8. parse_datetime

Parse datetime string into ISO format. Parameters:

format (optional): strptime format string (if None, uses dateutil parser for flexible parsing)

Example with format:

{
  "css": "time.date::attr(datetime)",
  "processors": [
    {"type": "parse_datetime", "format": "%Y-%m-%d"}
  ]
}

Example without format (auto-detect):

{
  "css": "span.date::text",
  "processors": [
    {"type": "parse_datetime"}
  ]
}

Transformations:

"2024-02-24" → "2024-02-24T00:00:00" (ISO format)
"February 24, 2024" → "2024-02-24T00:00:00"
"24/02/2024" → "2024-02-24T00:00:00" (auto-detected)

Stored as ISO string in database (automatically serialized). Returns None if parsing fails. Works on: Strings only

Processor Chaining

Processors run sequentially. Output of one becomes input to the next.

Example 1: Clean and Convert Price

{
  "css": "span.price::text",
  "processors": [
    {"type": "strip"},                           // "  $99.99  " → "$99.99"
    {"type": "replace", "old": "$", "new": ""},  // "$99.99" → "99.99"
    {"type": "cast", "to": "float"}              // "99.99" → 99.99
  ]
}

Example 2: Extract Rating Number

{
  "css": "div.rating::text",
  "processors": [
    {"type": "strip"},                           // "  Rating: 4.5 stars  " → "Rating: 4.5 stars"
    {"type": "regex", "pattern": "([\\d.]+)"},   // "Rating: 4.5 stars" → "4.5"
    {"type": "cast", "to": "float"}              // "4.5" → 4.5
  ]
}

Example 3: Normalize Text

{
  "css": "span.status::text",
  "processors": [
    {"type": "strip"},
    {"type": "lowercase"},
    {"type": "replace", "old": " ", "new": "_"}
  ]
}

Input: " In Stock "
Output: "in_stock"

Example 4: Handle Missing Values

{
  "css": "span.optional-field::text",
  "processors": [
    {"type": "strip"},
    {"type": "default", "default": "Not specified"}
  ]
}

Common Patterns

Extracting Currency Values

{
  "price": {
    "css": "span.price::text",
    "processors": [
      {"type": "strip"},
      {"type": "regex", "pattern": "\\$([\\d,.]+)"},
      {"type": "replace", "old": ",", "new": ""},
      {"type": "cast", "to": "float"}
    ]
  }
}

Handles: "$1,299.99", "Price: $99", " $42.50 "

Extracting Numbers from Text

{
  "quantity": {
    "css": "div.quantity::text",
    "processors": [
      {"type": "regex", "pattern": "(\\d+)"},
      {"type": "cast", "to": "int"}
    ]
  }
}

Handles: "23 items", "Quantity: 5", "42"

Boolean Fields

{
  "in_stock": {
    "css": "span.availability::text",
    "processors": [
      {"type": "lowercase"},
      {"type": "regex", "pattern": "(in stock|available)"},
      {"type": "cast", "to": "bool"}
    ]
  }
}

Returns: True if “in stock” or “available”, else False

Date Fields

{
  "published_date": {
    "css": "time::attr(datetime)",
    "processors": [
      {"type": "parse_datetime"}
    ]
  }
}

Auto-detects format, stores as ISO string.

Lists to Comma-Separated String

{
  "tags": {
    "css": "li.tag::text",
    "get_all": true,
    "processors": [
      {"type": "join", "separator": ", "}
    ]
  }
}

Input: ["Python", "Web Scraping", "Automation"]
Output: "Python, Web Scraping, Automation"

Complete Examples

E-commerce Product

{
  "callbacks": {
    "parse_product": {
      "extract": {
        "name": {
          "css": "h1.product-name::text",
          "processors": [{"type": "strip"}]
        },
        "price": {
          "css": "span.price::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d,.]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "float"}
          ]
        },
        "rating": {
          "css": "div.rating::attr(data-rating)",
          "processors": [{"type": "cast", "to": "float"}]
        },
        "in_stock": {
          "css": "span.availability::text",
          "processors": [
            {"type": "lowercase"},
            {"type": "regex", "pattern": "(in stock|available)"},
            {"type": "cast", "to": "bool"}
          ]
        },
        "features": {
          "css": "li.feature::text",
          "get_all": true,
          "processors": [{"type": "join", "separator": ", "}]
        }
      }
    }
  }
}

Job Listing

{
  "callbacks": {
    "parse_job": {
      "extract": {
        "title": {
          "css": "h1.job-title::text",
          "processors": [{"type": "strip"}]
        },
        "salary_min": {
          "css": "span.salary-min::text",
          "processors": [
            {"type": "strip"},
            {"type": "replace", "old": "$", "new": ""},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "int"}
          ]
        },
        "salary_max": {
          "css": "span.salary-max::text",
          "processors": [
            {"type": "strip"},
            {"type": "replace", "old": "$", "new": ""},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "int"}
          ]
        },
        "posted_date": {
          "css": "time.posted-date::attr(datetime)",
          "processors": [{"type": "parse_datetime"}]
        },
        "remote": {
          "css": "span.job-type::text",
          "processors": [
            {"type": "lowercase"},
            {"type": "regex", "pattern": "(remote|work from home)"},
            {"type": "cast", "to": "bool"}
          ]
        },
        "skills": {
          "css": "span.skill::text",
          "get_all": true,
          "processors": [{"type": "join", "separator": ", "}]
        }
      }
    }
  }
}

Real Estate Listing

{
  "callbacks": {
    "parse_property": {
      "extract": {
        "address": {
          "css": "h1.property-address::text",
          "processors": [{"type": "strip"}]
        },
        "price": {
          "css": "span.property-price::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d,.]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "float"}
          ]
        },
        "bedrooms": {
          "css": "span.bedrooms::text",
          "processors": [
            {"type": "regex", "pattern": "(\\d+)"},
            {"type": "cast", "to": "int"}
          ]
        },
        "bathrooms": {
          "css": "span.bathrooms::text",
          "processors": [
            {"type": "regex", "pattern": "([\\d.]+)"},
            {"type": "cast", "to": "float"}
          ]
        },
        "sqft": {
          "css": "span.square-feet::text",
          "processors": [
            {"type": "regex", "pattern": "([\\d,]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "int"}
          ]
        },
        "amenities": {
          "css": "li.amenity::text",
          "get_all": true,
          "processors": [{"type": "join", "separator": ", "}]
        }
      }
    }
  }
}

Error Handling

Graceful Failures
Chain Behavior

strip, replace, lowercase, join: Return original value if not applicable type
regex: Returns original value if no match
cast: Returns None if conversion fails
parse_datetime: Returns None if parsing fails
Unknown processor type: Skipped, logs warning

Failed processors pass the last valid value or None to subsequent processors.

[
  {"type": "strip"},                     // "  abc  " → "abc"
  {"type": "cast", "to": "int"},         // "abc" → None (fails)
  {"type": "default", "default": 0}      // None → 0
]

Best Practices

Always strip text fields

{"processors": [{"type": "strip"}]}

Use regex before cast

[
  {"type": "regex", "pattern": "([\\d.]+)"},
  {"type": "cast", "to": "float"}
]

Chain replace for complex cleaning

[
  {"type": "replace", "old": "$", "new": ""},
  {"type": "replace", "old": ",", "new": ""}
]

Default at the end

[
  {"type": "strip"},
  {"type": "cast", "to": "float"},
  {"type": "default", "default": 0.0}
]

Test selectors first

./scrapai analyze --test "selector"

Validate processor output

./scrapai crawl spider --limit 5 --project proj
./scrapai show 1 --project proj

Troubleshooting

Processor Returns None

Check processor type - Verify name is correct
Validate input type - regex (strings), join (lists), parse_datetime (strings)
Test without processors - See raw extracted value
Check logs - Look for processor warnings

Wrong Output Type

Add cast processor at the end:

{"processors": [{"type": "cast", "to": "float"}]}

Regex Not Matching

Test pattern - Use regex101.com
Check escaping - Double backslashes in JSON: {"pattern": "\\$([\\d.]+)"}
Add fallback - [{"type": "regex", "pattern": "([\\d.]+)"}, {"type": "default", "default": null}]

Date Parsing Fails

Try without format - {"type": "parse_datetime"} (auto-detect)
Specify format - {"type": "parse_datetime", "format": "%Y-%m-%d"}
Check raw value - View extracted value to understand format

Custom Callbacks

Extract structured data with callbacks

Extractors

Content extraction strategies

​Available Processors

strip

replace

regex

cast

join

default

lowercase

parse_datetime

​Processor Reference

​1. strip

​2. replace

​3. regex

​4. cast

​5. join

​6. default

​7. lowercase

​8. parse_datetime

​Processor Chaining

​Example 1: Clean and Convert Price

​Example 2: Extract Rating Number

​Example 3: Normalize Text

​Example 4: Handle Missing Values

​Common Patterns

​Extracting Currency Values

​Extracting Numbers from Text

​Boolean Fields

​Date Fields

​Lists to Comma-Separated String

​Complete Examples

​E-commerce Product

​Job Listing

​Real Estate Listing

​Error Handling

​Best Practices

​Troubleshooting

​Processor Returns None

​Wrong Output Type

​Regex Not Matching

​Date Parsing Fails

​Related Guides

Custom Callbacks

Extractors

Available Processors

Processor Reference

1. strip

2. replace

3. regex

4. cast

5. join

6. default

7. lowercase

8. parse_datetime

Processor Chaining

Example 1: Clean and Convert Price

Example 2: Extract Rating Number

Example 3: Normalize Text

Example 4: Handle Missing Values

Common Patterns

Extracting Currency Values

Extracting Numbers from Text

Boolean Fields

Date Fields

Lists to Comma-Separated String

Complete Examples

E-commerce Product

Job Listing

Real Estate Listing

Error Handling

Best Practices

Troubleshooting

Processor Returns None

Wrong Output Type

Regex Not Matching

Date Parsing Fails

Related Guides