Skip to main content
ScrapAI supports multiple extraction strategies that can be chained with fallback. Each extractor tries to extract content, and if it fails, the next one is attempted.

Available Extractors

Extraction Order

Configure extraction order in spider settings:
{
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"]
  }
}
How it works:
  1. Try first extractor (e.g., newspaper)
  2. If extraction fails or content too short → try next extractor
  3. Continue until successful extraction or all extractors exhausted
  4. Return ScrapedArticle or None

Strategy Selection

ScenarioRecommended Order
Generic news/blog (clean HTML)["newspaper", "trafilatura"]
Generic extractors fail["custom", "newspaper", "trafilatura"]
JavaScript-rendered content (SPA)["playwright", "trafilatura"]
JS-rendered + custom structure["playwright", "custom"]
E-commerce, jobs, listings["custom"] (with callbacks)
Infinite scroll page["playwright"] (single extractor)

Extractor Comparison

FeatureNewspaperTrafilaturaCustomPlaywright
SpeedFastFastFastSlow
AccuracyGoodExcellentPerfect (if configured)Good
SetupNoneNoneRequires CSS selectorsRequires wait config
Use CaseNews articlesAny contentStructured dataJS content
MetadataKeywords, summary, top_imageDescription, tags, fingerprintCustom fieldsNone (uses trafilatura)

Content Validation

All extractors return ScrapedArticle which validates: Title:
  • Must exist
  • Min length: 5 characters
Content:
  • Must exist
  • Min length: 100 characters
If validation fails: Extractor returns None, next strategy is tried

ScrapedArticle Schema

url
string
required
Page URL
title
string
required
Article/page title (min 5 chars)
content
string
required
Main content text (min 100 chars)
author
string
Author name (if available)
published_date
datetime
Publication date (if available)
source
string
required
Extractor used: "newspaper4k", "trafilatura", "custom", "playwright"
extracted_at
datetime
required
Extraction timestamp (UTC)
metadata
object
default:"{}"
Extractor-specific or custom fieldsNewspaper metadata:
  • top_image - Main image URL
  • keywords - Extracted keywords
  • summary - Auto-generated summary
Trafilatura metadata:
  • description - Meta description
  • sitename - Site name
  • categories, tags, fingerprint, license
Custom metadata:
  • Any fields from CUSTOM_SELECTORS (except title/content/author/date)
  • Any fields from callback extract config
html
string
Raw HTML (only if include_html=True in export)

Configuration Examples

Generic News Site

{
  "name": "bbc_co_uk",
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 1,
    "CONCURRENT_REQUESTS": 16
  }
}
Behavior:
  • Try newspaper first (fast, good for news)
  • Fallback to trafilatura if newspaper fails

Custom Selectors with Fallback

{
  "name": "custom_blog",
  "settings": {
    "EXTRACTOR_ORDER": ["custom", "newspaper", "trafilatura"],
    "CUSTOM_SELECTORS": {
      "title": "h1.post-title",
      "content": "div.post-content",
      "author": "span.author-name",
      "category": "a.category"
    }
  }
}
Behavior:
  • Try custom selectors first (highest accuracy)
  • Fallback to generic extractors if selectors fail

JavaScript-Rendered Site

{
  "name": "spa_site",
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".article-content",
    "PLAYWRIGHT_DELAY": 3,
    "CONCURRENT_REQUESTS": 2
  }
}
Behavior:
  • Render page with Playwright
  • Wait for .article-content to appear
  • Extract with trafilatura from rendered HTML

E-commerce (Custom Only)

{
  "name": "example_shop",
  "settings": {
    "EXTRACTOR_ORDER": []
  },
  "callbacks": {
    "parse_product": {
      "extract": {
        "name": {"css": "h1.product-name::text"},
        "price": {"css": "span.price::text"}
      }
    }
  }
}
Behavior:
  • No generic extractors (not article content)
  • Use callback-based extraction only

Fallback Behavior

Extraction fails when:
  • Selector returns no match
  • Content/title too short (< 100 chars / < 5 chars)
  • Parser exception
Next strategy is tried when:
  • Extractor returns None
  • Validation fails on returned ScrapedArticle
All extractors failed:
  • Page is skipped
  • Error logged
  • No item saved to database

Performance Considerations

Fast extractors (news/blogs):
"EXTRACTOR_ORDER": ["newspaper", "trafilatura"]
"CONCURRENT_REQUESTS": 16
Playwright (slow, use low concurrency):
"EXTRACTOR_ORDER": ["playwright", "trafilatura"]
"CONCURRENT_REQUESTS": 2
"DOWNLOAD_DELAY": 2
Custom selectors (fast if configured correctly):
"EXTRACTOR_ORDER": ["custom"]
"CONCURRENT_REQUESTS": 16

Debugging Extraction

Test extraction order:
# Analyze HTML structure
scrapai analyze data/proj/spider/analysis/page.html

# Test custom selector
scrapai analyze page.html --test "h1.article-title"

# Crawl with limit
scrapai crawl spider --limit 5 --project proj

# View extracted data
scrapai show spider 1 --project proj
Check which extractor succeeded: Look at the source field in scraped items:
scrapai show spider 1 --project proj
Output shows source: newspaper4k or source: trafilatura, etc.