Configuration
Enable inEXTRACTOR_ORDER:
Extracted Fields
Standard Fields
Article title (extracted from
<h1>, <title>, or meta tags)Falls back to title_hint if extraction fails.Main article text (cleaned, no ads/navigation)Validation: Min 100 characters
Author name(s) (comma-separated if multiple)Extracted from:
<meta name="author">rel="author"- Common author class names
Publication dateExtracted from:
<time>elements<meta property="article:published_time">- URL patterns
Metadata Fields
Main article image URLExtracted from:
<meta property="og:image"><img>tags within article
Extracted keywords (auto-generated from content)
Auto-generated article summary
Example Output
Fallback Behavior
Newspaper fails when title extraction fails (withouttitle_hint), content is < 100 characters, or HTML structure is non-semantic. Configure fallback extractors in EXTRACTOR_ORDER.
Title Hints
The spider can provide atitle_hint from other sources:
Example: Link text from parent page
When to Use
Use Newspaper When:
Well-structured news websites
Blog platforms (WordPress, Medium, Ghost)
Semantic HTML with proper tags
Need automatic metadata extraction
Don’t Use Newspaper When:
Debugging
Check if newspaper succeeded:source: newspaper4k in output.
Test on sample page:
- Check if content is in semantic tags (
<article>,<main>,<p>) - Try trafilatura as fallback
- Use custom selectors if generic extractors fail
Configuration Examples
News Site (Primary Strategy)
Blog with Custom Fallback
Related
- Extractors Overview - Strategy selection
- Trafilatura Extractor - Alternative generic extractor
- Custom Extractors - Site-specific extraction
- Settings - Configuration options