Discovery Workflow
Extractor Order Options
| Config | When to use |
|---|---|
["newspaper", "trafilatura"] | Generic extractors work (clean news/blog HTML) |
["custom", "newspaper", "trafilatura"] | Generic extractors fail; custom selectors needed |
["playwright", "custom"] | JS-rendered content (SPAs, dynamic loading) |
["playwright", "trafilatura"] | JS-rendered, generic extractors work after rendering |
Generic Extractors
- Newspaper
- Trafilatura
Best for: Clean news articles and blog postsWhat it extracts:
- Title
- Author
- Published date
- Main content
- Top image
- Keywords
- Summary
Generic extractors work for ~80% of news/blog sites. Try them first before creating custom selectors.
Custom Selectors
Standard fields (title, author, content, date) → main DB columns. Any other field → stored inmetadata JSON column.
News Article
spider.json
E-commerce Product
spider.json
Forum Thread
spider.json
Playwright Extractor
Basic Configuration
spider.json
PLAYWRIGHT_WAIT_SELECTOR: CSS selector to wait for (max 30s)PLAYWRIGHT_DELAY: Extra seconds after page load
Infinite Scroll
spider.json
INFINITE_SCROLL: Enable scroll behavior (default: false)MAX_SCROLLS: Max scrolls to perform (default: 5)SCROLL_DELAY: Seconds between scrolls (default: 1.0)
Complete Playwright Example
spa_spider.json
Selector Discovery Principles
- Target main content element (not navigation, sidebar, footer)
- Selector should match ONE unique element
- Prefer specific classes (
.article-titleover.title) - Test on multiple pages
- Prefer semantic tags (
<article>,<time>,<h1>) - Validate content length (>500 chars for content, >10 for title)
- Avoid dynamic/generated class names
Identifying JS-Rendered Sites
Playwright Wait: Common Selectors
Article sites:.article-content#main-contentarticle.post[data-loaded="true"]
.product-details.price-container#product-info
.post-list#posts.loaded
.content-loaded[data-ready].main-container
Implementation Details
Extractor classes (from source:core/extractors.py):
NewspaperExtractor
extractors.py:36-78Uses newspaper4k libraryTrafilaturaExtractor
extractors.py:80-128Uses trafilatura libraryCustomExtractor
extractors.py:131-257Uses BeautifulSoup + CSS selectorsSmartExtractor
extractors.py:259-464Tries multiple strategies in orderPlaywrightExtractor
extractors.py:398-464Async browser renderingTroubleshooting
Generic Extractors Return Empty Content
- Check if JS-rendered (empty
<div id="app"></div>) - Try Playwright:
{"EXTRACTOR_ORDER": ["playwright", "trafilatura"]} - Use custom selectors:
{"EXTRACTOR_ORDER": ["custom", "trafilatura"]}
Custom Selector Returns None
- Test selector:
./scrapai analyze page.html --test "your-selector" - Check selector specificity and uniqueness
- Verify element exists in HTML structure
Playwright Timeout
- Increase delay:
{"PLAYWRIGHT_DELAY": 10} - Use different wait selector that appears earlier
- Verify selector exists in rendered page
Content Extracted But Wrong
- Verify selector uniqueness (may match sidebar/footer)
- Make selector more specific:
{"content": "main article.post div.body"} - Test on multiple pages
Best Practices
- Start with generic extractors:
["newspaper", "trafilatura"](80% success rate) - Add custom selectors if needed:
["custom", "trafilatura"] - Use Playwright for JS-rendered sites:
["playwright", "trafilatura"] - Test on multiple pages
- Monitor quality:
./scrapai show 1 --project proj
Related Guides
Custom Callbacks
Extract structured data with callbacks
Data Processors
Transform extracted data