Available Extractors
Newspaper4k
General-purpose article extractor for news and blogs
Trafilatura
Lightweight content extraction with high accuracy
Custom CSS
Site-specific CSS selectors for structured data
Playwright
Browser rendering for JavaScript-heavy sites
Extraction Order
Configure extraction order in spider settings:- Try first extractor (e.g.,
newspaper) - If extraction fails or content too short → try next extractor
- Continue until successful extraction or all extractors exhausted
- Return
ScrapedArticleorNone
Strategy Selection
| Scenario | Recommended Order |
|---|---|
| Generic news/blog (clean HTML) | ["newspaper", "trafilatura"] |
| Generic extractors fail | ["custom", "newspaper", "trafilatura"] |
| JavaScript-rendered content (SPA) | ["playwright", "trafilatura"] |
| JS-rendered + custom structure | ["playwright", "custom"] |
| E-commerce, jobs, listings | ["custom"] (with callbacks) |
| Infinite scroll page | ["playwright"] (single extractor) |
Extractor Comparison
| Feature | Newspaper | Trafilatura | Custom | Playwright |
|---|---|---|---|---|
| Speed | Fast | Fast | Fast | Slow |
| Accuracy | Good | Excellent | Perfect (if configured) | Good |
| Setup | None | None | Requires CSS selectors | Requires wait config |
| Use Case | News articles | Any content | Structured data | JS content |
| Metadata | Keywords, summary, top_image | Description, tags, fingerprint | Custom fields | None (uses trafilatura) |
Content Validation
All extractors returnScrapedArticle which validates:
Title:
- Must exist
- Min length: 5 characters
- Must exist
- Min length: 100 characters
None, next strategy is tried
ScrapedArticle Schema
Page URL
Article/page title (min 5 chars)
Main content text (min 100 chars)
Author name (if available)
Publication date (if available)
Extractor used:
"newspaper4k", "trafilatura", "custom", "playwright"Extraction timestamp (UTC)
Extractor-specific or custom fieldsNewspaper metadata:
top_image- Main image URLkeywords- Extracted keywordssummary- Auto-generated summary
description- Meta descriptionsitename- Site namecategories,tags,fingerprint,license
- Any fields from
CUSTOM_SELECTORS(except title/content/author/date) - Any fields from callback
extractconfig
Raw HTML (only if
include_html=True in export)Configuration Examples
Generic News Site
- Try
newspaperfirst (fast, good for news) - Fallback to
trafilaturaif newspaper fails
Custom Selectors with Fallback
- Try custom selectors first (highest accuracy)
- Fallback to generic extractors if selectors fail
JavaScript-Rendered Site
- Render page with Playwright
- Wait for
.article-contentto appear - Extract with trafilatura from rendered HTML
E-commerce (Custom Only)
- No generic extractors (not article content)
- Use callback-based extraction only
Fallback Behavior
Extraction fails when:- Selector returns no match
- Content/title too short (< 100 chars / < 5 chars)
- Parser exception
- Extractor returns
None - Validation fails on returned
ScrapedArticle
- Page is skipped
- Error logged
- No item saved to database
Performance Considerations
Fast extractors (news/blogs):Debugging Extraction
Test extraction order:source field in scraped items:
source: newspaper4k or source: trafilatura, etc.
Related
- Newspaper Extractor - Newspaper4k configuration
- Trafilatura Extractor - Trafilatura options
- Custom Extractors - CSS selector syntax
- Playwright Extractor - Browser rendering
- Settings - Complete settings reference