Overview
Best for:- Any text content (articles, blogs, documentation)
- Sites with complex HTML structure
- Minimal metadata requirements
- High accuracy content extraction
- Excellent accuracy (removes boilerplate reliably)
- Fast extraction
- Works on varied HTML structures
- Lightweight
- Less metadata than newspaper4k
- Not suitable for non-text content (e-commerce, listings)
Configuration
Enable inEXTRACTOR_ORDER:
Extracted Fields
Standard Fields
Page title (extracted from
<title>, <h1>, or meta tags)Main content text (cleaned, boilerplate removed)Validation: Min 100 characters
Author name (if available in metadata)
Publication date (if available)
Metadata Fields
Meta description
Site name (from meta tags or content)
Content categories (if available)
Content tags (if available)
Content fingerprint (for deduplication)
Content license (if specified)
Example Output
Fallback Behavior
Trafilatura fails when content extraction returns empty or < 100 characters.Performance
Speed: Fast (pure Python, in-memory processing) Recommended concurrency:When to Use
Use Trafilatura When:
Need highest accuracy content extraction
Dealing with complex HTML layouts
Metadata is not critical
Any text-based content (not just news)
Want lightweight, fast extraction
Don’t Use Trafilatura When:
Comparison with Newspaper
| Feature | Trafilatura | Newspaper |
|---|---|---|
| Accuracy | Excellent | Good |
| Speed | Fast | Fast |
| Metadata | Basic | Rich (keywords, summary, images) |
| HTML Tolerance | High (works on varied structures) | Medium (needs semantic HTML) |
| Best For | Any text content | News/blogs |
- Newspaper extracts rich metadata
- Trafilatura catches pages newspaper misses
Debugging
Check if trafilatura succeeded:source: trafilatura in output.
Test extraction:
Configuration Examples
Primary Extractor
With Newspaper Fallback
After Playwright Rendering
Documentation Site
Advanced Usage
Infinite Scroll with Trafilatura
- Playwright scrolls page to load all content
- Trafilatura extracts combined content from full page
Related
- Extractors Overview - Strategy selection
- Newspaper Extractor - Alternative generic extractor
- Playwright Extractor - Browser rendering
- Custom Extractors - Site-specific extraction
- Settings - Configuration options