Test generic extractors (newspaper, trafilatura) first. Only use custom selectors if they fail.
Discovery Workflow
Inspect article page
./scrapai inspect https://example.com/article-url --project proj
Fetches the page and saves HTML for analysis
Analyze HTML structure
./scrapai analyze data/proj/spider/analysis/page.html
Shows:
h1/h2 titles with classes
Content containers by size
Date elements
Author elements
Test selectors
./scrapai analyze data/proj/spider/analysis/page.html --test "h1.article-title"
./scrapai analyze data/proj/spider/analysis/page.html --test "div.article-body"
Validate that selectors extract correct content
Search for specific fields
./scrapai analyze data/proj/spider/analysis/page.html --find "price"
Find elements containing specific text
Config When to use ["newspaper", "trafilatura"]Generic extractors work (clean news/blog HTML) ["custom", "newspaper", "trafilatura"]Generic extractors fail; custom selectors needed ["playwright", "custom"]JS-rendered content (SPAs, dynamic loading) ["playwright", "trafilatura"]JS-rendered, generic extractors work after rendering
Extractors are tried in order. First successful extraction wins.
Best for: Clean news articles and blog posts{
"settings" : {
"EXTRACTOR_ORDER" : [ "newspaper" , "trafilatura" ]
}
}
What it extracts:
Title
Author
Published date
Main content
Top image
Keywords
Summary
Best for: Wide variety of content types{
"settings" : {
"EXTRACTOR_ORDER" : [ "trafilatura" , "newspaper" ]
}
}
What it extracts:
Title
Author
Published date
Main content
Description
Site name
Categories
Tags
Generic extractors work for ~80% of news/blog sites. Try them first before creating custom selectors.
Custom Selectors
Use when generic extractors fail to extract content properly.
Standard fields (title, author, content, date) → main DB columns.
Any other field → stored in metadata JSON column.
News Article
{
"settings" : {
"EXTRACTOR_ORDER" : [ "custom" , "trafilatura" ],
"CUSTOM_SELECTORS" : {
"title" : "h1.article-title" ,
"content" : "div.article-body" ,
"author" : "span.author-name" ,
"date" : "time.published-date" ,
"category" : "a.category-link" ,
"tags" : "div.tags a"
}
}
}
E-commerce Product
{
"settings" : {
"EXTRACTOR_ORDER" : [ "custom" ],
"CUSTOM_SELECTORS" : {
"title" : "h1.product-name" ,
"content" : "div.product-description" ,
"price" : "span.price-value" ,
"rating" : "div.star-rating" ,
"stock" : "span.availability" ,
"brand" : "div.brand-name"
}
}
}
Forum Thread
{
"settings" : {
"EXTRACTOR_ORDER" : [ "custom" ],
"CUSTOM_SELECTORS" : {
"title" : "h1.thread-title" ,
"author" : "span.username" ,
"content" : "div.post-content" ,
"date" : "time.post-date" ,
"upvotes" : "span.vote-count"
}
}
}
For JavaScript-rendered or dynamically loaded content.
Basic Configuration
{
"settings" : {
"EXTRACTOR_ORDER" : [ "playwright" , "trafilatura" ],
"PLAYWRIGHT_WAIT_SELECTOR" : ".article-content" ,
"PLAYWRIGHT_DELAY" : 5
}
}
Settings:
PLAYWRIGHT_WAIT_SELECTOR: CSS selector to wait for (max 30s)
PLAYWRIGHT_DELAY: Extra seconds after page load
For single-page sites with dynamic scroll loading:
{
"settings" : {
"EXTRACTOR_ORDER" : [ "playwright" , "trafilatura" ],
"PLAYWRIGHT_WAIT_SELECTOR" : ".quote" ,
"PLAYWRIGHT_DELAY" : 2 ,
"INFINITE_SCROLL" : true ,
"MAX_SCROLLS" : 10 ,
"SCROLL_DELAY" : 2.0
}
}
Settings:
INFINITE_SCROLL: Enable scroll behavior (default: false)
MAX_SCROLLS: Max scrolls to perform (default: 5)
SCROLL_DELAY: Seconds between scrolls (default: 1.0)
Requires playwright in EXTRACTOR_ORDER. Set follow: false in rules to avoid following dynamically loaded links.
Complete Playwright Example
{
"name" : "react_app" ,
"allowed_domains" : [ "example.com" ],
"start_urls" : [ "https://example.com/app" ],
"rules" : [
{
"allow" : [ "/article/.*" ],
"callback" : "parse_article" ,
"follow" : false
}
],
"settings" : {
"EXTRACTOR_ORDER" : [ "playwright" , "custom" ],
"PLAYWRIGHT_WAIT_SELECTOR" : "#article-loaded" ,
"PLAYWRIGHT_DELAY" : 3 ,
"CUSTOM_SELECTORS" : {
"title" : "h1.title" ,
"content" : "div.content" ,
"author" : "span.author"
}
}
}
Selector Discovery Principles
Target MAIN content element
Not navigation, sidebar, or footer
Selector should match ONE element
Verify uniqueness on multiple pages
Prefer specific classes
Use .article-title over generic .title
Test on multiple articles
Selector must work across different pages
Prefer semantic tags
Use <article>, <time>, <h1> when available
Validate content length
Content selector should return >500 chars
Title should return >10 chars
Avoid dynamic class names
Skip randomly generated or versioned classes
Common mistakes:
Selector matches multiple elements (gets first, often wrong)
Selector targets sidebar/footer instead of main content
Overly generic selectors like div.text
Guessing without testing on actual HTML
Identifying JS-Rendered Sites
Signs that page.html needs Playwright:
Empty Content
JS Data Objects
Loading Placeholders
< body >
< div id = "app" ></ div >
< script src = "bundle.js" ></ script >
</ body >
If you see minimal/empty content or just <div id="app"></div>, use ["playwright", ...] in EXTRACTOR_ORDER.
Playwright Wait: Common Selectors
Article sites:
.article-content
#main-content
article.post
[data-loaded="true"]
Product pages:
.product-details
.price-container
#product-info
Social/Forums:
.post-list
#posts
.loaded
Generic:
.content-loaded
[data-ready]
.main-container
Implementation Details
Extractor classes (from source: core/extractors.py):
NewspaperExtractor extractors.py:36-78Uses newspaper4k library
TrafilaturaExtractor extractors.py:80-128Uses trafilatura library
CustomExtractor extractors.py:131-257Uses BeautifulSoup + CSS selectors
SmartExtractor extractors.py:259-464Tries multiple strategies in order
PlaywrightExtractor extractors.py:398-464Async browser rendering
Troubleshooting
Check if JS-rendered
Inspect HTML for empty <div id="app"></div> or minimal content
Try Playwright extractor
{ "EXTRACTOR_ORDER" : [ "playwright" , "trafilatura" ]}
Use custom selectors
If generic extractors consistently fail: { "EXTRACTOR_ORDER" : [ "custom" , "trafilatura" ]}
Custom Selector Returns None
Test selector
./scrapai analyze page.html --test "your-selector"
Check selector specificity
Ensure selector matches the intended element uniquely
Verify HTML structure
Element might be in different location than expected
Playwright Timeout
Use different wait selector
Find selector that appears earlier: { "PLAYWRIGHT_WAIT_SELECTOR" : ".loading-complete" }
Check selector exists
Verify wait selector actually appears in rendered page
Verify selector uniqueness
May be matching wrong element (sidebar, footer)
Make selector more specific
Add parent classes or IDs: { "content" : "main article.post div.body" }
Test on multiple pages
Ensure selector works across different articles
Best Practices
Start with generic extractors
80% success rate for news/blog sites { "EXTRACTOR_ORDER" : [ "newspaper" , "trafilatura" ]}
Add custom selectors if needed
When generic extractors fail: { "EXTRACTOR_ORDER" : [ "custom" , "trafilatura" ]}
Use Playwright for JS sites
Only when content is dynamically rendered: { "EXTRACTOR_ORDER" : [ "playwright" , "trafilatura" ]}
Test on multiple pages
Verify extraction works across different articles
Monitor extraction quality
Check scraped content with show command: ./scrapai show 1 --project proj