Trafilatura Extractor

Trafilatura is a fast, accurate content extractor that works on any text-heavy page. It focuses on extracting main content while removing boilerplate, ads, and navigation.

Overview

Best for:

Any text content (articles, blogs, documentation)
Sites with complex HTML structure
Minimal metadata requirements
High accuracy content extraction

Strengths:

Excellent accuracy (removes boilerplate reliably)
Fast extraction
Works on varied HTML structures
Lightweight

Limitations:

Less metadata than newspaper4k
Not suitable for non-text content (e-commerce, listings)

Configuration

Enable in EXTRACTOR_ORDER:

{
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura", "newspaper"]
  }
}

No additional configuration required.

Extracted Fields

Standard Fields

title

string

Page title (extracted from <title>, <h1>, or meta tags)

content

string

Main content text (cleaned, boilerplate removed)Validation: Min 100 characters

author

string

Author name (if available in metadata)

published_date

datetime

Publication date (if available)

Metadata Fields

metadata.description

string

Meta description

metadata.sitename

string

Site name (from meta tags or content)

metadata.categories

string

Content categories (if available)

metadata.tags

string

Content tags (if available)

metadata.fingerprint

string

Content fingerprint (for deduplication)

metadata.license

string

Content license (if specified)

Example Output

ScrapedArticle(
    url="https://example.com/article",
    title="Article Title",
    content="Main article content with excellent accuracy...",
    author="Jane Doe",
    published_date=datetime(2024, 2, 24),
    source="trafilatura",
    metadata={
        "description": "Article meta description",
        "sitename": "Example News",
        "categories": "Technology",
        "tags": "web scraping, automation",
        "fingerprint": "abc123def456",
        "license": "CC BY-NC-SA 4.0"
    }
)

Fallback Behavior

Trafilatura fails when content extraction returns empty or < 100 characters.

Performance

Speed: Fast (pure Python, in-memory processing) Recommended concurrency:

{
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura"],
    "CONCURRENT_REQUESTS": 16,
    "DOWNLOAD_DELAY": 1
  }
}

When to Use

Use Trafilatura When:

Need highest accuracy content extraction

Dealing with complex HTML layouts

Metadata is not critical

Any text-based content (not just news)

Want lightweight, fast extraction

Don’t Use Trafilatura When:

Need rich metadata (keywords, summaries, images) - use newspaper

E-commerce or structured data - use custom extractors

JavaScript-rendered content - use playwright first

Comparison with Newspaper

Feature	Trafilatura	Newspaper
Accuracy	Excellent	Good
Speed	Fast	Fast
Metadata	Basic	Rich (keywords, summary, images)
HTML Tolerance	High (works on varied structures)	Medium (needs semantic HTML)
Best For	Any text content	News/blogs

Recommendation: Use both with fallback

"EXTRACTOR_ORDER": ["newspaper", "trafilatura"]

Newspaper extracts rich metadata
Trafilatura catches pages newspaper misses

Debugging

Check if trafilatura succeeded:

scrapai show spider 1 --project proj

Look for source: trafilatura in output. Test extraction:

# Download page
scrapai inspect https://example.com/article --project proj

# Test extraction
scrapai crawl spider --limit 1 --project proj

# View results
scrapai show spider 1 --project proj

If extraction fails: Try with Playwright for JS-rendered content, or use custom selectors.

Configuration Examples

Primary Extractor

{
  "name": "docs_site",
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura"],
    "DOWNLOAD_DELAY": 0.5,
    "CONCURRENT_REQUESTS": 16
  }
}

With Newspaper Fallback

{
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura", "newspaper"],
    "CONCURRENT_REQUESTS": 16
  }
}

After Playwright Rendering

{
  "name": "js_heavy_site",
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".content-loaded",
    "PLAYWRIGHT_DELAY": 2,
    "CONCURRENT_REQUESTS": 2,
    "DOWNLOAD_DELAY": 2
  }
}

Documentation Site

{
  "name": "docs_python_org",
  "source_url": "https://docs.python.org/",
  "allowed_domains": ["docs.python.org"],
  "start_urls": ["https://docs.python.org/3/"],
  "rules": [
    {
      "allow": [".*"],
      "callback": "parse_article"
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["trafilatura"],
    "DOWNLOAD_DELAY": 1,
    "CONCURRENT_REQUESTS": 8,
    "ROBOTSTXT_OBEY": true
  }
}

Advanced Usage

Infinite Scroll with Trafilatura

{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "INFINITE_SCROLL": true,
    "MAX_SCROLLS": 10,
    "PLAYWRIGHT_WAIT_SELECTOR": ".quote"
  },
  "rules": [
    {"allow": [".*"], "follow": false}
  ]
}

Behavior:

Playwright scrolls page to load all content
Trafilatura extracts combined content from full page

Extractors Overview - Strategy selection
Newspaper Extractor - Alternative generic extractor
Playwright Extractor - Browser rendering
Custom Extractors - Site-specific extraction
Settings - Configuration options

​Overview

​Configuration

​Extracted Fields

​Standard Fields

​Metadata Fields

​Example Output

​Fallback Behavior

​Performance

​When to Use

​Use Trafilatura When:

​Don’t Use Trafilatura When:

​Comparison with Newspaper

​Debugging

​Configuration Examples

​Primary Extractor

​With Newspaper Fallback

​After Playwright Rendering

​Documentation Site

​Advanced Usage

​Infinite Scroll with Trafilatura

​Related

Overview

Configuration

Extracted Fields

Standard Fields

Metadata Fields

Example Output

Fallback Behavior

Performance

When to Use

Use Trafilatura When:

Don’t Use Trafilatura When:

Comparison with Newspaper

Debugging

Configuration Examples

Primary Extractor

With Newspaper Fallback

After Playwright Rendering

Documentation Site

Advanced Usage

Infinite Scroll with Trafilatura

Related