Content Extractors

Test generic extractors (newspaper, trafilatura) first. Only use custom selectors if they fail.

Discovery Workflow

Inspect article page

./scrapai inspect https://example.com/article-url --project proj

Analyze HTML structure

./scrapai analyze data/proj/spider/analysis/page.html

Test selectors

./scrapai analyze data/proj/spider/analysis/page.html --test "h1.article-title"
./scrapai analyze data/proj/spider/analysis/page.html --test "div.article-body"

Search for specific fields

./scrapai analyze data/proj/spider/analysis/page.html --find "price"

Extractor Order Options

Config	When to use
`["newspaper", "trafilatura"]`	Generic extractors work (clean news/blog HTML)
`["custom", "newspaper", "trafilatura"]`	Generic extractors fail; custom selectors needed
`["playwright", "custom"]`	JS-rendered content (SPAs, dynamic loading)
`["playwright", "trafilatura"]`	JS-rendered, generic extractors work after rendering

Generic Extractors

Newspaper
Trafilatura

Best for: Clean news articles and blog posts

{
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"]
  }
}

What it extracts:

Title
Author
Published date
Main content
Top image
Keywords
Summary

Best for: Wide variety of content types

{
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura", "newspaper"]
  }
}

What it extracts:

Title
Author
Published date
Main content
Description
Site name
Categories
Tags

Generic extractors work for ~80% of news/blog sites. Try them first before creating custom selectors.

Custom Selectors

Standard fields (title, author, content, date) → main DB columns. Any other field → stored in metadata JSON column.

News Article

spider.json

{
  "settings": {
    "EXTRACTOR_ORDER": ["custom", "trafilatura"],
    "CUSTOM_SELECTORS": {
      "title": "h1.article-title",
      "content": "div.article-body",
      "author": "span.author-name",
      "date": "time.published-date",
      "category": "a.category-link",
      "tags": "div.tags a"
    }
  }
}

E-commerce Product

spider.json

{
  "settings": {
    "EXTRACTOR_ORDER": ["custom"],
    "CUSTOM_SELECTORS": {
      "title": "h1.product-name",
      "content": "div.product-description",
      "price": "span.price-value",
      "rating": "div.star-rating",
      "stock": "span.availability",
      "brand": "div.brand-name"
    }
  }
}

Forum Thread

spider.json

{
  "settings": {
    "EXTRACTOR_ORDER": ["custom"],
    "CUSTOM_SELECTORS": {
      "title": "h1.thread-title",
      "author": "span.username",
      "content": "div.post-content",
      "date": "time.post-date",
      "upvotes": "span.vote-count"
    }
  }
}

Playwright Extractor

Basic Configuration

spider.json

{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".article-content",
    "PLAYWRIGHT_DELAY": 5
  }
}

Settings:

PLAYWRIGHT_WAIT_SELECTOR: CSS selector to wait for (max 30s)
PLAYWRIGHT_DELAY: Extra seconds after page load

Infinite Scroll

spider.json

{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".quote",
    "PLAYWRIGHT_DELAY": 2,
    "INFINITE_SCROLL": true,
    "MAX_SCROLLS": 10,
    "SCROLL_DELAY": 2.0
  }
}

Settings:

INFINITE_SCROLL: Enable scroll behavior (default: false)
MAX_SCROLLS: Max scrolls to perform (default: 5)
SCROLL_DELAY: Seconds between scrolls (default: 1.0)

Complete Playwright Example

spa_spider.json

{
  "name": "react_app",
  "allowed_domains": ["example.com"],
  "start_urls": ["https://example.com/app"],
  "rules": [
    {
      "allow": ["/article/.*"],
      "callback": "parse_article",
      "follow": false
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "custom"],
    "PLAYWRIGHT_WAIT_SELECTOR": "#article-loaded",
    "PLAYWRIGHT_DELAY": 3,
    "CUSTOM_SELECTORS": {
      "title": "h1.title",
      "content": "div.content",
      "author": "span.author"
    }
  }
}

Selector Discovery Principles

Target main content element (not navigation, sidebar, footer)
Selector should match ONE unique element
Prefer specific classes (.article-title over .title)
Test on multiple pages
Prefer semantic tags (<article>, <time>, <h1>)
Validate content length (>500 chars for content, >10 for title)
Avoid dynamic/generated class names

Common mistakes:

Selector matches multiple elements
Targets sidebar/footer instead of main content
Overly generic selectors like div.text

Identifying JS-Rendered Sites

<body>
  <div id="app"></div>
  <script src="bundle.js"></script>
</body>

Playwright Wait: Common Selectors

Article sites:

.article-content
#main-content
article.post
[data-loaded="true"]

Product pages:

.product-details
.price-container
#product-info

Social/Forums:

.post-list
#posts
.loaded

Generic:

.content-loaded
[data-ready]
.main-container

Implementation Details

Extractor classes (from source: core/extractors.py):

NewspaperExtractor

extractors.py:36-78Uses newspaper4k library

TrafilaturaExtractor

extractors.py:80-128Uses trafilatura library

CustomExtractor

extractors.py:131-257Uses BeautifulSoup + CSS selectors

SmartExtractor

extractors.py:259-464Tries multiple strategies in order

PlaywrightExtractor

extractors.py:398-464Async browser rendering

Troubleshooting

Generic Extractors Return Empty Content

Check if JS-rendered (empty <div id="app"></div>)
Try Playwright: {"EXTRACTOR_ORDER": ["playwright", "trafilatura"]}
Use custom selectors: {"EXTRACTOR_ORDER": ["custom", "trafilatura"]}

Custom Selector Returns None

Test selector: ./scrapai analyze page.html --test "your-selector"
Check selector specificity and uniqueness
Verify element exists in HTML structure

Playwright Timeout

Increase delay: {"PLAYWRIGHT_DELAY": 10}
Use different wait selector that appears earlier
Verify selector exists in rendered page

Content Extracted But Wrong

Verify selector uniqueness (may match sidebar/footer)
Make selector more specific: {"content": "main article.post div.body"}
Test on multiple pages

Best Practices

Start with generic extractors: ["newspaper", "trafilatura"] (80% success rate)
Add custom selectors if needed: ["custom", "trafilatura"]
Use Playwright for JS-rendered sites: ["playwright", "trafilatura"]
Test on multiple pages
Monitor quality: ./scrapai show 1 --project proj

Custom Callbacks

Extract structured data with callbacks

Data Processors

Transform extracted data

​Discovery Workflow

​Extractor Order Options

​Generic Extractors

​Custom Selectors

​News Article

​E-commerce Product

​Forum Thread

​Playwright Extractor

​Basic Configuration

​Infinite Scroll

​Complete Playwright Example

​Selector Discovery Principles

​Identifying JS-Rendered Sites

​Playwright Wait: Common Selectors

​Implementation Details

NewspaperExtractor

TrafilaturaExtractor

CustomExtractor

SmartExtractor

PlaywrightExtractor

​Troubleshooting

​Generic Extractors Return Empty Content

​Custom Selector Returns None

​Playwright Timeout

​Content Extracted But Wrong

​Best Practices

​Related Guides

Custom Callbacks

Data Processors

Discovery Workflow

Extractor Order Options

Generic Extractors

Custom Selectors

News Article

E-commerce Product

Forum Thread

Playwright Extractor

Basic Configuration

Infinite Scroll

Complete Playwright Example

Selector Discovery Principles

Identifying JS-Rendered Sites

Playwright Wait: Common Selectors

Implementation Details

Troubleshooting

Generic Extractors Return Empty Content

Custom Selector Returns None

Playwright Timeout

Content Extracted But Wrong

Best Practices

Related Guides