Skip to main content
Trafilatura is a fast, accurate content extractor that works on any text-heavy page. It focuses on extracting main content while removing boilerplate, ads, and navigation.

Overview

Best for:
  • Any text content (articles, blogs, documentation)
  • Sites with complex HTML structure
  • Minimal metadata requirements
  • High accuracy content extraction
Strengths:
  • Excellent accuracy (removes boilerplate reliably)
  • Fast extraction
  • Works on varied HTML structures
  • Lightweight
Limitations:
  • Less metadata than newspaper4k
  • Not suitable for non-text content (e-commerce, listings)

Configuration

Enable in EXTRACTOR_ORDER:
{
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura", "newspaper"]
  }
}
No additional configuration required.

Extracted Fields

Standard Fields

title
string
Page title (extracted from <title>, <h1>, or meta tags)Falls back to title_hint if extraction fails.
content
string
Main content text (cleaned, boilerplate removed)Validation: Min 100 charactersTrafilatura’s strength: Excellent at identifying main content vs navigation/ads
author
string
Author name (if available in metadata)Extracted from:
  • <meta name="author">
  • JSON-LD structured data
published_date
datetime
Publication date (if available)Extracted from:
  • Structured data (JSON-LD, microdata)
  • Common date patterns

Metadata Fields

metadata.description
string
Meta description
metadata.sitename
string
Site name (from meta tags or content)
metadata.categories
string
Content categories (if available)
metadata.tags
string
Content tags (if available)
metadata.fingerprint
string
Content fingerprint (for deduplication)
metadata.license
string
Content license (if specified)

Example Output

ScrapedArticle(
    url="https://example.com/article",
    title="Article Title",
    content="Main article content with excellent accuracy...",
    author="Jane Doe",
    published_date=datetime(2024, 2, 24),
    source="trafilatura",
    metadata={
        "description": "Article meta description",
        "sitename": "Example News",
        "categories": "Technology",
        "tags": "web scraping, automation",
        "fingerprint": "abc123def456",
        "license": "CC BY-NC-SA 4.0"
    }
)

Fallback Behavior

Trafilatura fails when:
  • Content extraction returns empty or < 100 characters
  • No text content found (e.g., image-only pages)
Common fallback patterns:
// Try trafilatura first (highest accuracy)
"EXTRACTOR_ORDER": ["trafilatura", "newspaper"]

// Try newspaper first (more metadata)
"EXTRACTOR_ORDER": ["newspaper", "trafilatura"]

// Playwright + trafilatura for JS content
"EXTRACTOR_ORDER": ["playwright", "trafilatura"]

Performance

Speed: Fast (pure Python, in-memory processing) Recommended concurrency:
{
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura"],
    "CONCURRENT_REQUESTS": 16,
    "DOWNLOAD_DELAY": 1
  }
}

When to Use

Use Trafilatura When:

Need highest accuracy content extraction
Dealing with complex HTML layouts
Metadata is not critical
Any text-based content (not just news)
Want lightweight, fast extraction

Don’t Use Trafilatura When:

Need rich metadata (keywords, summaries, images) - use newspaper
E-commerce or structured data - use custom extractors
JavaScript-rendered content - use playwright first

Use with Playwright

Trafilatura is the recommended extractor after Playwright rendering:
{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".article-content",
    "PLAYWRIGHT_DELAY": 3
  }
}
Why:
  • Playwright renders JavaScript → full HTML
  • Trafilatura extracts clean content from rendered HTML
  • Best accuracy for JS-heavy sites

Comparison with Newspaper

FeatureTrafilaturaNewspaper
AccuracyExcellentGood
SpeedFastFast
MetadataBasicRich (keywords, summary, images)
HTML ToleranceHigh (works on varied structures)Medium (needs semantic HTML)
Best ForAny text contentNews/blogs
Recommendation: Use both with fallback
"EXTRACTOR_ORDER": ["newspaper", "trafilatura"]
  • Newspaper extracts rich metadata
  • Trafilatura catches pages newspaper misses

Debugging

Check if trafilatura succeeded:
scrapai show spider 1 --project proj
Look for source: trafilatura in output. Test extraction:
# Download page
scrapai inspect https://example.com/article --project proj

# Test extraction
scrapai crawl spider --limit 1 --project proj

# View results
scrapai show spider 1 --project proj
If extraction fails:
  1. Check if page has text content (not image-only)
  2. Try with Playwright if content is JS-rendered
  3. Use custom selectors if generic extraction doesn’t work

Configuration Examples

Primary Extractor

{
  "name": "docs_site",
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura"],
    "DOWNLOAD_DELAY": 0.5,
    "CONCURRENT_REQUESTS": 16
  }
}

With Newspaper Fallback

{
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura", "newspaper"],
    "CONCURRENT_REQUESTS": 16
  }
}

After Playwright Rendering

{
  "name": "js_heavy_site",
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".content-loaded",
    "PLAYWRIGHT_DELAY": 2,
    "CONCURRENT_REQUESTS": 2,
    "DOWNLOAD_DELAY": 2
  }
}

Documentation Site

{
  "name": "docs_python_org",
  "source_url": "https://docs.python.org/",
  "allowed_domains": ["docs.python.org"],
  "start_urls": ["https://docs.python.org/3/"],
  "rules": [
    {
      "allow": [".*"],
      "callback": "parse_article"
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura"],
    "DOWNLOAD_DELAY": 1,
    "CONCURRENT_REQUESTS": 8,
    "ROBOTSTXT_OBEY": true
  }
}
Why trafilatura for docs:
  • Clean content extraction without navigation/sidebars
  • Works on varied doc site structures
  • Fast, reliable

Advanced Usage

Infinite Scroll with Trafilatura

{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "INFINITE_SCROLL": true,
    "MAX_SCROLLS": 10,
    "PLAYWRIGHT_WAIT_SELECTOR": ".quote"
  },
  "rules": [
    {"allow": [".*"], "follow": false}
  ]
}
Behavior:
  • Playwright scrolls page to load all content
  • Trafilatura extracts combined content from full page