Custom CSS Extractors

Custom CSS extractors allow you to define precise selectors for extracting content from sites with unique structures. Use when generic extractors (newspaper, trafilatura) fail.

When to Use

Generic extractors fail or extract wrong content

Site has unique/non-semantic HTML structure

Need to extract custom fields (price, rating, category)

E-commerce, job boards, real estate, forums

Requires manual selector discovery and testing

Site changes may break selectors

Configuration

For Article Content

Use CUSTOM_SELECTORS in settings:

{
  "settings": {
    "EXTRACTOR_ORDER": ["custom", "newspaper", "trafilatura"],
    "CUSTOM_SELECTORS": {
      "title": "h1.article-title",
      "content": "div.article-body",
      "author": "span.author-name",
      "date": "time.published-date",
      "category": "a.category-link",
      "tags": "div.tags a"
    }
  }
}

For Structured Data

Use callbacks for non-article content:

{
  "callbacks": {
    "parse_product": {
      "extract": {
        "name": {"css": "h1.product-name::text"},
        "price": {"css": "span.price::text"},
        "rating": {"css": "span.rating::attr(data-rating)"}
      }
    }
  }
}

See Callbacks for full schema.

CSS Selector Syntax

Basic Selectors

// Element
"css": "h1"

// Class
"css": "h1.article-title"

// ID
"css": "#main-title"

// Multiple classes
"css": "div.article.featured"

// Descendant
"css": "article div.content"

// Direct child
"css": "article > div.content"

Extract Text

// Text content
"css": "h1::text"

// All text (including nested elements)
"css": "div.content::text"

Extract Attributes

// Image src
"css": "img.main-image::attr(src)"

// Link href
"css": "a.external::attr(href)"

// Data attribute
"css": "span.rating::attr(data-rating)"

// Class name
"css": "div::attr(class)"

Multiple Matches

{
  "css": "li.feature::text",
  "get_all": true
}

Returns list: ["WiFi", "Bluetooth", "GPS"]

Standard vs Custom Fields

Standard Fields (Database Columns)

Map directly to scraped_items table:

title → scraped_items.title
content → scraped_items.content
author → scraped_items.author
date → scraped_items.published_date

Custom Fields (Metadata JSON)

Any other field names stored in metadata_json:

price, rating, category, brand, etc.
Displayed in show command
Flattened in exports

Selector Discovery Workflow

Step 1: Inspect Page

scrapai inspect https://example.com/article --project proj

Saves HTML to data/proj/spider/analysis/page.html

Step 2: Analyze Structure

scrapai analyze data/proj/spider/analysis/page.html

Shows:

h1/h2 elements with classes
Content containers by size
Date elements
Author elements

Step 3: Test Selectors

scrapai analyze page.html --test "h1.article-title"
scrapai analyze page.html --test "div.article-body"

Verifies selector matches correct element.

Step 4: Search for Fields

scrapai analyze page.html --find "price"
scrapai analyze page.html --find "rating"

Finds elements containing text.

Step 5: Test on Multiple Pages

# Import spider with selectors
scrapai spiders import spider.json --project proj

# Test on 5 pages
scrapai crawl spider --limit 5 --project proj

# Verify results
scrapai show spider 1 --project proj
scrapai show spider 2 --project proj

Selector Best Practices

Target main content element (not navigation/sidebar/footer)

Selector should match ONE element per page

Prefer specific classes (.article-title) over generic (.title)

Test on multiple pages to verify consistency

Prefer semantic tags (<article>, <time>, <h1>)

Content selector should return >500 chars; title >10 chars

Avoid dynamic/random class names (e.g., class="css-abc123")

Don’t guess - always test on actual HTML

Avoid overly generic selectors like div.text

Examples

News Article

{
  "settings": {
    "EXTRACTOR_ORDER": ["custom"],
    "CUSTOM_SELECTORS": {
      "title": "h1.article-title",
      "content": "div.article-body",
      "author": "span.author-name",
      "date": "time.published-date",
      "category": "a.category-link",
      "tags": "div.tags a"
    }
  }
}

E-commerce Product

{
  "callbacks": {
    "parse_product": {
      "extract": {
        "name": {
          "css": "h1.product-name::text",
          "processors": [{"type": "strip"}]
        },
        "price": {
          "css": "span.price::text",
          "processors": [
            {"type": "strip"},
            {"type": "regex", "pattern": "\\$([\\d,.]+)"},
            {"type": "replace", "old": ",", "new": ""},
            {"type": "cast", "to": "float"}
          ]
        },
        "rating": {
          "css": "span.rating::attr(data-rating)",
          "processors": [{"type": "cast", "to": "float"}]
        },
        "features": {
          "css": "li.feature::text",
          "get_all": true
        },
        "images": {
          "css": "img.product-image::attr(src)",
          "get_all": true
        }
      }
    }
  }
}

Forum Thread

{
  "callbacks": {
    "parse_thread": {
      "extract": {
        "title": {"css": "h1.thread-title::text"},
        "author": {"css": "span.username::text"},
        "content": {"css": "div.post-content"},
        "date": {"css": "time.post-date::attr(datetime)"},
        "upvotes": {
          "css": "span.vote-count::text",
          "processors": [{"type": "cast", "to": "int"}]
        }
      }
    }
  }
}

Validation

Custom extractor requires title (min 5 chars) and content (min 100 chars). If validation fails, returns None and tries next extractor (if configured).

Fallback Strategy

Recommended: Use custom with fallback to generic

"EXTRACTOR_ORDER": ["custom", "newspaper", "trafilatura"]

Debugging

Test selectors using:

scrapai analyze page.html --test "h1.article-title"
scrapai analyze page.html --test "div.article-body"

Common issues:

Selector doesn’t match element (typo in class name)
Content too short (selector targets sidebar instead of main content)
Wrong content extracted (selector too generic, matches multiple elements)

Callbacks - Complete callback schema with processors
Extractors Overview - Strategy selection
Newspaper Extractor - Generic news extractor
Trafilatura Extractor - Generic content extractor
Settings - Configuration options

​When to Use

​Configuration

​For Article Content

​For Structured Data

​CSS Selector Syntax

​Basic Selectors

​Extract Text

​Extract Attributes

​Multiple Matches

​Standard vs Custom Fields

​Standard Fields (Database Columns)

​Custom Fields (Metadata JSON)

​Selector Discovery Workflow

​Step 1: Inspect Page

​Step 2: Analyze Structure

​Step 3: Test Selectors

​Step 4: Search for Fields

​Step 5: Test on Multiple Pages

​Selector Best Practices

​Examples

​News Article

​E-commerce Product

​Forum Thread

​Validation

​Fallback Strategy

​Debugging

​Related

When to Use

Configuration

For Article Content

For Structured Data

CSS Selector Syntax

Basic Selectors

Extract Text

Extract Attributes

Multiple Matches

Standard vs Custom Fields

Standard Fields (Database Columns)

Custom Fields (Metadata JSON)

Selector Discovery Workflow

Step 1: Inspect Page

Step 2: Analyze Structure

Step 3: Test Selectors

Step 4: Search for Fields

Step 5: Test on Multiple Pages

Selector Best Practices

Examples

News Article

E-commerce Product

Forum Thread

Validation

Fallback Strategy

Debugging

Related