Skip to main content
Trafilatura is a fast, accurate content extractor that works on any text-heavy page. It focuses on extracting main content while removing boilerplate, ads, and navigation.

Overview

Best for:
  • Any text content (articles, blogs, documentation)
  • Sites with complex HTML structure
  • Minimal metadata requirements
  • High accuracy content extraction
Strengths:
  • Excellent accuracy (removes boilerplate reliably)
  • Fast extraction
  • Works on varied HTML structures
  • Lightweight
Limitations:
  • Less metadata than newspaper4k
  • Not suitable for non-text content (e-commerce, listings)

Configuration

Enable in EXTRACTOR_ORDER:
{
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura", "newspaper"]
  }
}
No additional configuration required.

Extracted Fields

Standard Fields

title
string
Page title (extracted from <title>, <h1>, or meta tags)
content
string
Main content text (cleaned, boilerplate removed)Validation: Min 100 characters
author
string
Author name (if available in metadata)
published_date
datetime
Publication date (if available)

Metadata Fields

metadata.description
string
Meta description
metadata.sitename
string
Site name (from meta tags or content)
metadata.categories
string
Content categories (if available)
metadata.tags
string
Content tags (if available)
metadata.fingerprint
string
Content fingerprint (for deduplication)
metadata.license
string
Content license (if specified)

Example Output

ScrapedArticle(
    url="https://example.com/article",
    title="Article Title",
    content="Main article content with excellent accuracy...",
    author="Jane Doe",
    published_date=datetime(2024, 2, 24),
    source="trafilatura",
    metadata={
        "description": "Article meta description",
        "sitename": "Example News",
        "categories": "Technology",
        "tags": "web scraping, automation",
        "fingerprint": "abc123def456",
        "license": "CC BY-NC-SA 4.0"
    }
)

Fallback Behavior

Trafilatura fails when content extraction returns empty or < 100 characters.

Performance

Speed: Fast (pure Python, in-memory processing) Recommended concurrency:
{
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura"],
    "CONCURRENT_REQUESTS": 16,
    "DOWNLOAD_DELAY": 1
  }
}

When to Use

Use Trafilatura When:

Need highest accuracy content extraction
Dealing with complex HTML layouts
Metadata is not critical
Any text-based content (not just news)
Want lightweight, fast extraction

Don’t Use Trafilatura When:

Need rich metadata (keywords, summaries, images) - use newspaper
E-commerce or structured data - use custom extractors
JavaScript-rendered content - use playwright first

Comparison with Newspaper

FeatureTrafilaturaNewspaper
AccuracyExcellentGood
SpeedFastFast
MetadataBasicRich (keywords, summary, images)
HTML ToleranceHigh (works on varied structures)Medium (needs semantic HTML)
Best ForAny text contentNews/blogs
Recommendation: Use both with fallback
"EXTRACTOR_ORDER": ["newspaper", "trafilatura"]
  • Newspaper extracts rich metadata
  • Trafilatura catches pages newspaper misses

Debugging

Check if trafilatura succeeded:
scrapai show spider 1 --project proj
Look for source: trafilatura in output. Test extraction:
# Download page
scrapai inspect https://example.com/article --project proj

# Test extraction
scrapai crawl spider --limit 1 --project proj

# View results
scrapai show spider 1 --project proj
If extraction fails: Try with Playwright for JS-rendered content, or use custom selectors.

Configuration Examples

Primary Extractor

{
  "name": "docs_site",
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura"],
    "DOWNLOAD_DELAY": 0.5,
    "CONCURRENT_REQUESTS": 16
  }
}

With Newspaper Fallback

{
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura", "newspaper"],
    "CONCURRENT_REQUESTS": 16
  }
}

After Playwright Rendering

{
  "name": "js_heavy_site",
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".content-loaded",
    "PLAYWRIGHT_DELAY": 2,
    "CONCURRENT_REQUESTS": 2,
    "DOWNLOAD_DELAY": 2
  }
}

Documentation Site

{
  "name": "docs_python_org",
  "source_url": "https://docs.python.org/",
  "allowed_domains": ["docs.python.org"],
  "start_urls": ["https://docs.python.org/3/"],
  "rules": [
    {
      "allow": [".*"],
      "callback": "parse_article"
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura"],
    "DOWNLOAD_DELAY": 1,
    "CONCURRENT_REQUESTS": 8,
    "ROBOTSTXT_OBEY": true
  }
}

Advanced Usage

Infinite Scroll with Trafilatura

{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "INFINITE_SCROLL": true,
    "MAX_SCROLLS": 10,
    "PLAYWRIGHT_WAIT_SELECTOR": ".quote"
  },
  "rules": [
    {"allow": [".*"], "follow": false}
  ]
}
Behavior:
  • Playwright scrolls page to load all content
  • Trafilatura extracts combined content from full page