Extractors Overview

ScrapAI supports multiple extraction strategies that can be chained with fallback. Each extractor tries to extract content, and if it fails, the next one is attempted.

Available Extractors

Newspaper4k

General-purpose article extractor for news and blogs

Trafilatura

Lightweight content extraction with high accuracy

Custom CSS

Site-specific CSS selectors for structured data

Playwright

Browser rendering for JavaScript-heavy sites

Extraction Order

Configure extraction order in spider settings:

{
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"]
  }
}

Extractors are tried in sequence. If one fails or returns content too short, the next is attempted.

Strategy Selection

Scenario	Recommended Order
Generic news/blog (clean HTML)	`["newspaper", "trafilatura"]`
Generic extractors fail	`["custom", "newspaper", "trafilatura"]`
JavaScript-rendered content (SPA)	`["playwright", "trafilatura"]`
JS-rendered + custom structure	`["playwright", "custom"]`
E-commerce, jobs, listings	`["custom"]` (with callbacks)
Infinite scroll page	`["playwright"]` (single extractor)

Extractor Comparison

Feature	Newspaper	Trafilatura	Custom	Playwright
Speed	Fast	Fast	Fast	Slow
Accuracy	Good	Excellent	Perfect (if configured)	Good
Setup	None	None	Requires CSS selectors	Requires wait config
Use Case	News articles	Any content	Structured data	JS content
Metadata	Keywords, summary, top_image	Description, tags, fingerprint	Custom fields	None (uses trafilatura)

Content Validation

All extractors validate:

Title: min 5 characters
Content: min 100 characters

If validation fails, the next extractor is tried.

ScrapedArticle Schema

url

string

required

Page URL

title

string

required

Article/page title (min 5 chars)

content

string

required

Main content text (min 100 chars)

author

string

Author name (if available)

published_date

datetime

Publication date (if available)

source

string

required

Extractor used: "newspaper4k", "trafilatura", "custom", "playwright"

extracted_at

datetime

required

Extraction timestamp (UTC)

metadata

object

default:"{}"

Extractor-specific or custom fieldsNewspaper metadata:

top_image - Main image URL
keywords - Extracted keywords
summary - Auto-generated summary

Trafilatura metadata:

description - Meta description
sitename - Site name
categories, tags, fingerprint, license

Custom metadata:

Any fields from CUSTOM_SELECTORS (except title/content/author/date)
Any fields from callback extract config

html

string

Raw HTML (only if include_html=True in export)

Configuration Examples

Generic News Site

{
  "name": "bbc_co_uk",
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 1,
    "CONCURRENT_REQUESTS": 16
  }
}

Custom Selectors with Fallback

{
  "name": "custom_blog",
  "settings": {
    "EXTRACTOR_ORDER": ["custom", "newspaper", "trafilatura"],
    "CUSTOM_SELECTORS": {
      "title": "h1.post-title",
      "content": "div.post-content",
      "author": "span.author-name",
      "category": "a.category"
    }
  }
}

JavaScript-Rendered Site

{
  "name": "spa_site",
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".article-content",
    "PLAYWRIGHT_DELAY": 3,
    "CONCURRENT_REQUESTS": 2
  }
}

E-commerce (Custom Only)

{
  "name": "example_shop",
  "settings": {
    "EXTRACTOR_ORDER": []
  },
  "callbacks": {
    "parse_product": {
      "extract": {
        "name": {"css": "h1.product-name::text"},
        "price": {"css": "span.price::text"}
      }
    }
  }
}

Fallback Behavior

Extraction fails when:

Selector returns no match
Content/title too short (< 100 chars / < 5 chars)
Parser exception

When all extractors fail, the page is skipped and error logged.

Performance Considerations

Fast extractors (news/blogs):

"EXTRACTOR_ORDER": ["newspaper", "trafilatura"]
"CONCURRENT_REQUESTS": 16

Playwright (slow, use low concurrency):

"EXTRACTOR_ORDER": ["playwright", "trafilatura"]
"CONCURRENT_REQUESTS": 2
"DOWNLOAD_DELAY": 2

Custom selectors (fast if configured correctly):

"EXTRACTOR_ORDER": ["custom"]
"CONCURRENT_REQUESTS": 16

Debugging Extraction

# Analyze HTML structure
scrapai analyze data/proj/spider/analysis/page.html

# Test custom selector
scrapai analyze page.html --test "h1.article-title"

# Crawl with limit
scrapai crawl spider --limit 5 --project proj

# View extracted data (check 'source' field for which extractor succeeded)
scrapai show spider 1 --project proj

Newspaper Extractor - Newspaper4k configuration
Trafilatura Extractor - Trafilatura options
Custom Extractors - CSS selector syntax
Playwright Extractor - Browser rendering
Settings - Complete settings reference

​Available Extractors

Newspaper4k

Trafilatura

Custom CSS

Playwright

​Extraction Order

​Strategy Selection

​Extractor Comparison

​Content Validation

​ScrapedArticle Schema

​Configuration Examples

​Generic News Site

​Custom Selectors with Fallback

​JavaScript-Rendered Site

​E-commerce (Custom Only)

​Fallback Behavior

​Performance Considerations

​Debugging Extraction

​Related

Available Extractors

Extraction Order

Strategy Selection

Extractor Comparison

Content Validation

ScrapedArticle Schema

Configuration Examples

Generic News Site

Custom Selectors with Fallback

JavaScript-Rendered Site

E-commerce (Custom Only)

Fallback Behavior

Performance Considerations

Debugging Extraction

Related