Overview
Best for:- News websites
- Blog posts
- Magazine articles
- Clean, semantic HTML
- Fast extraction
- Automatic article detection
- Extracts metadata (author, date, keywords, images)
- No configuration needed
- Requires semantic HTML structure
- May struggle with unusual layouts
- Not suitable for non-article content (e-commerce, listings)
Configuration
Enable inEXTRACTOR_ORDER:
Extracted Fields
Standard Fields
Article title (extracted from
<h1>, <title>, or meta tags)Falls back to title_hint if extraction fails.Main article text (cleaned, no ads/navigation)Validation: Min 100 characters
Author name(s) (comma-separated if multiple)Extracted from:
<meta name="author">rel="author"- Common author class names
Publication dateExtracted from:
<time>elements<meta property="article:published_time">- URL patterns
Metadata Fields
Main article image URLExtracted from:
<meta property="og:image"><img>tags within article
Extracted keywords (auto-generated from content)
Auto-generated article summary
Example Output
Fallback Behavior
Newspaper fails when:- Title extraction fails and no
title_hintprovided - Content extraction returns empty or < 100 characters
- HTML structure is non-semantic (divs without semantic classes)
Title Hints
The spider can provide atitle_hint from other sources:
Example: Link text from parent page
Performance
Speed: Fast (processes HTML in-memory) Recommended concurrency:When to Use
Use Newspaper When:
Well-structured news websites
Blog platforms (WordPress, Medium, Ghost)
Semantic HTML with proper tags
Need automatic metadata extraction
Don’t Use Newspaper When:
Debugging
Check if newspaper succeeded:source: newspaper4k in output.
Test on sample page:
- Check if content is in semantic tags (
<article>,<main>,<p>) - Try trafilatura as fallback
- Use custom selectors if generic extractors fail
Comparison with Trafilatura
| Feature | Newspaper | Trafilatura |
|---|---|---|
| Speed | Fast | Fast |
| Accuracy | Good | Excellent |
| Metadata | Rich (keywords, summary, images) | Basic (description, tags) |
| Best For | News sites | Any text content |
| Fallback | Use together: ["newspaper", "trafilatura"] |
Configuration Examples
News Site (Primary Strategy)
Blog with Custom Fallback
- Try newspaper first (works for most WordPress blogs)
- Fallback to custom selectors for non-standard templates
Related
- Extractors Overview - Strategy selection
- Trafilatura Extractor - Alternative generic extractor
- Custom Extractors - Site-specific extraction
- Settings - Configuration options