Overview
Best for:- Any text content (articles, blogs, documentation)
- Sites with complex HTML structure
- Minimal metadata requirements
- High accuracy content extraction
- Excellent accuracy (removes boilerplate reliably)
- Fast extraction
- Works on varied HTML structures
- Lightweight
- Less metadata than newspaper4k
- Not suitable for non-text content (e-commerce, listings)
Configuration
Enable inEXTRACTOR_ORDER:
Extracted Fields
Standard Fields
Page title (extracted from
<title>, <h1>, or meta tags)Falls back to title_hint if extraction fails.Main content text (cleaned, boilerplate removed)Validation: Min 100 charactersTrafilatura’s strength: Excellent at identifying main content vs navigation/ads
Author name (if available in metadata)Extracted from:
<meta name="author">- JSON-LD structured data
Publication date (if available)Extracted from:
- Structured data (JSON-LD, microdata)
- Common date patterns
Metadata Fields
Meta description
Site name (from meta tags or content)
Content categories (if available)
Content tags (if available)
Content fingerprint (for deduplication)
Content license (if specified)
Example Output
Fallback Behavior
Trafilatura fails when:- Content extraction returns empty or < 100 characters
- No text content found (e.g., image-only pages)
Performance
Speed: Fast (pure Python, in-memory processing) Recommended concurrency:When to Use
Use Trafilatura When:
Need highest accuracy content extraction
Dealing with complex HTML layouts
Metadata is not critical
Any text-based content (not just news)
Want lightweight, fast extraction
Don’t Use Trafilatura When:
Use with Playwright
Trafilatura is the recommended extractor after Playwright rendering:- Playwright renders JavaScript → full HTML
- Trafilatura extracts clean content from rendered HTML
- Best accuracy for JS-heavy sites
Comparison with Newspaper
| Feature | Trafilatura | Newspaper |
|---|---|---|
| Accuracy | Excellent | Good |
| Speed | Fast | Fast |
| Metadata | Basic | Rich (keywords, summary, images) |
| HTML Tolerance | High (works on varied structures) | Medium (needs semantic HTML) |
| Best For | Any text content | News/blogs |
- Newspaper extracts rich metadata
- Trafilatura catches pages newspaper misses
Debugging
Check if trafilatura succeeded:source: trafilatura in output.
Test extraction:
- Check if page has text content (not image-only)
- Try with Playwright if content is JS-rendered
- Use custom selectors if generic extraction doesn’t work
Configuration Examples
Primary Extractor
With Newspaper Fallback
After Playwright Rendering
Documentation Site
- Clean content extraction without navigation/sidebars
- Works on varied doc site structures
- Fast, reliable
Advanced Usage
Infinite Scroll with Trafilatura
- Playwright scrolls page to load all content
- Trafilatura extracts combined content from full page
Related
- Extractors Overview - Strategy selection
- Newspaper Extractor - Alternative generic extractor
- Playwright Extractor - Browser rendering
- Custom Extractors - Site-specific extraction
- Settings - Configuration options