Skip to main content
The Newspaper4k extractor is a fast, general-purpose content extractor optimized for news articles and blog posts. It automatically identifies article structure and extracts metadata (author, date, keywords, images) without configuration.

Configuration

Enable in EXTRACTOR_ORDER:
{
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"]
  }
}

Extracted Fields

Standard Fields

title
string
Article title (extracted from <h1>, <title>, or meta tags)Falls back to title_hint if extraction fails.
content
string
Main article text (cleaned, no ads/navigation)Validation: Min 100 characters
author
string
Author name(s) (comma-separated if multiple)Extracted from:
  • <meta name="author">
  • rel="author"
  • Common author class names
published_date
datetime
Publication dateExtracted from:
  • <time> elements
  • <meta property="article:published_time">
  • URL patterns

Metadata Fields

metadata.top_image
string
Main article image URLExtracted from:
  • <meta property="og:image">
  • <img> tags within article
metadata.keywords
string[]
Extracted keywords (auto-generated from content)
metadata.summary
string
Auto-generated article summary

Example Output

ScrapedArticle(
    url="https://www.bbc.co.uk/news/articles/abc123",
    title="Breaking News: Major Event Occurs",
    content="Full article text here...",
    author="John Smith, Jane Doe",
    published_date=datetime(2024, 2, 24, 10, 30),
    source="newspaper4k",
    metadata={
        "top_image": "https://example.com/image.jpg",
        "keywords": ["news", "event", "breaking"],
        "summary": "A brief summary of the article..."
    }
)

Fallback Behavior

Newspaper fails when title extraction fails (without title_hint), content is < 100 characters, or HTML structure is non-semantic. Configure fallback extractors in EXTRACTOR_ORDER.

Title Hints

The spider can provide a title_hint from other sources: Example: Link text from parent page
<a href="/article/123">This becomes title_hint</a>
If newspaper fails to extract title, it uses the hint.

When to Use

Use Newspaper When:

Well-structured news websites
Blog platforms (WordPress, Medium, Ghost)
Semantic HTML with proper tags
Need automatic metadata extraction

Don’t Use Newspaper When:

E-commerce sites (use custom extractors)
Forums or discussion boards (use callbacks)
Heavily customized layouts (use custom selectors)
JavaScript-rendered content (use playwright first)

Debugging

Check if newspaper succeeded:
scrapai show spider 1 --project proj
Look for source: newspaper4k in output. Test on sample page:
# Download page HTML
scrapai inspect https://example.com/article --project proj

# Analyze structure
scrapai analyze data/proj/spider/analysis/page.html

# Run extraction test
scrapai crawl spider --limit 1 --project proj
If extraction fails:
  1. Check if content is in semantic tags (<article>, <main>, <p>)
  2. Try trafilatura as fallback
  3. Use custom selectors if generic extractors fail

Configuration Examples

News Site (Primary Strategy)

{
  "name": "bbc_co_uk",
  "source_url": "https://www.bbc.co.uk/",
  "allowed_domains": ["bbc.co.uk"],
  "start_urls": ["https://www.bbc.co.uk/news"],
  "rules": [
    {
      "allow": ["/news/articles/.*"],
      "callback": "parse_article"
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 1,
    "CONCURRENT_REQUESTS": 16
  }
}

Blog with Custom Fallback

{
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "custom"],
    "CUSTOM_SELECTORS": {
      "title": "h1.entry-title",
      "content": "div.entry-content",
      "author": "a.author-link",
      "date": "time.published"
    }
  }
}