Custom CSS extractors allow you to define precise selectors for extracting content from sites with unique structures. Use when generic extractors (newspaper, trafilatura) fail.
When to Use
Generic extractors fail or extract wrong content
Site has unique/non-semantic HTML structure
Need to extract custom fields (price, rating, category)
E-commerce, job boards, real estate, forums
Requires manual selector discovery and testing
Site changes may break selectors
Configuration
For Article Content
Use CUSTOM_SELECTORS in settings:
{
"settings": {
"EXTRACTOR_ORDER": ["custom", "newspaper", "trafilatura"],
"CUSTOM_SELECTORS": {
"title": "h1.article-title",
"content": "div.article-body",
"author": "span.author-name",
"date": "time.published-date",
"category": "a.category-link",
"tags": "div.tags a"
}
}
}
For Structured Data
Use callbacks for non-article content:
{
"callbacks": {
"parse_product": {
"extract": {
"name": {"css": "h1.product-name::text"},
"price": {"css": "span.price::text"},
"rating": {"css": "span.rating::attr(data-rating)"}
}
}
}
}
See Callbacks for full schema.
CSS Selector Syntax
Basic Selectors
// Element
"css": "h1"
// Class
"css": "h1.article-title"
// ID
"css": "#main-title"
// Multiple classes
"css": "div.article.featured"
// Descendant
"css": "article div.content"
// Direct child
"css": "article > div.content"
// Text content
"css": "h1::text"
// All text (including nested elements)
"css": "div.content::text"
// Image src
"css": "img.main-image::attr(src)"
// Link href
"css": "a.external::attr(href)"
// Data attribute
"css": "span.rating::attr(data-rating)"
// Class name
"css": "div::attr(class)"
Multiple Matches
{
"css": "li.feature::text",
"get_all": true
}
Returns list: ["WiFi", "Bluetooth", "GPS"]
Standard vs Custom Fields
Standard Fields (Database Columns)
Map directly to scraped_items table:
title → scraped_items.title
content → scraped_items.content
author → scraped_items.author
date → scraped_items.published_date
Any other field names stored in metadata_json:
price, rating, category, brand, etc.
- Displayed in
show command
- Flattened in exports
Selector Discovery Workflow
Step 1: Inspect Page
scrapai inspect https://example.com/article --project proj
Saves HTML to data/proj/spider/analysis/page.html
Step 2: Analyze Structure
scrapai analyze data/proj/spider/analysis/page.html
Shows:
- h1/h2 elements with classes
- Content containers by size
- Date elements
- Author elements
Step 3: Test Selectors
scrapai analyze page.html --test "h1.article-title"
scrapai analyze page.html --test "div.article-body"
Verifies selector matches correct element.
Step 4: Search for Fields
scrapai analyze page.html --find "price"
scrapai analyze page.html --find "rating"
Finds elements containing text.
Step 5: Test on Multiple Pages
# Import spider with selectors
scrapai spiders import spider.json --project proj
# Test on 5 pages
scrapai crawl spider --limit 5 --project proj
# Verify results
scrapai show spider 1 --project proj
scrapai show spider 2 --project proj
Selector Best Practices
Target main content element (not navigation/sidebar/footer)
Selector should match ONE element per page
Prefer specific classes (.article-title) over generic (.title)
Test on multiple pages to verify consistency
Prefer semantic tags (<article>, <time>, <h1>)
Content selector should return >500 chars; title >10 chars
Avoid dynamic/random class names (e.g., class="css-abc123")
Don’t guess - always test on actual HTML
Avoid overly generic selectors like div.text
Examples
News Article
{
"settings": {
"EXTRACTOR_ORDER": ["custom"],
"CUSTOM_SELECTORS": {
"title": "h1.article-title",
"content": "div.article-body",
"author": "span.author-name",
"date": "time.published-date",
"category": "a.category-link",
"tags": "div.tags a"
}
}
}
E-commerce Product
{
"callbacks": {
"parse_product": {
"extract": {
"name": {
"css": "h1.product-name::text",
"processors": [{"type": "strip"}]
},
"price": {
"css": "span.price::text",
"processors": [
{"type": "strip"},
{"type": "regex", "pattern": "\\$([\\d,.]+)"},
{"type": "replace", "old": ",", "new": ""},
{"type": "cast", "to": "float"}
]
},
"rating": {
"css": "span.rating::attr(data-rating)",
"processors": [{"type": "cast", "to": "float"}]
},
"features": {
"css": "li.feature::text",
"get_all": true
},
"images": {
"css": "img.product-image::attr(src)",
"get_all": true
}
}
}
}
}
Forum Thread
{
"callbacks": {
"parse_thread": {
"extract": {
"title": {"css": "h1.thread-title::text"},
"author": {"css": "span.username::text"},
"content": {"css": "div.post-content"},
"date": {"css": "time.post-date::attr(datetime)"},
"upvotes": {
"css": "span.vote-count::text",
"processors": [{"type": "cast", "to": "int"}]
}
}
}
}
}
Validation
Custom extractor requires title (min 5 chars) and content (min 100 chars). If validation fails, returns None and tries next extractor (if configured).
Fallback Strategy
Recommended: Use custom with fallback to generic
"EXTRACTOR_ORDER": ["custom", "newspaper", "trafilatura"]
Debugging
Test selectors using:
scrapai analyze page.html --test "h1.article-title"
scrapai analyze page.html --test "div.article-body"
Common issues:
- Selector doesn’t match element (typo in class name)
- Content too short (selector targets sidebar instead of main content)
- Wrong content extracted (selector too generic, matches multiple elements)