Newspaper4k Extractor

The Newspaper4k extractor is a fast, general-purpose content extractor optimized for news articles and blog posts. It automatically identifies article structure and extracts metadata (author, date, keywords, images) without configuration.

Configuration

Enable in EXTRACTOR_ORDER:

{
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"]
  }
}

Extracted Fields

Standard Fields

title

string

Article title (extracted from <h1>, <title>, or meta tags)Falls back to title_hint if extraction fails.

content

string

Main article text (cleaned, no ads/navigation)Validation: Min 100 characters

author

string

Author name(s) (comma-separated if multiple)Extracted from:

<meta name="author">
rel="author"
Common author class names

published_date

datetime

Publication dateExtracted from:

<time> elements
<meta property="article:published_time">
URL patterns

Metadata Fields

metadata.top_image

string

Main article image URLExtracted from:

<meta property="og:image">
<img> tags within article

metadata.keywords

string[]

Extracted keywords (auto-generated from content)

metadata.summary

string

Auto-generated article summary

Example Output

ScrapedArticle(
    url="https://www.bbc.co.uk/news/articles/abc123",
    title="Breaking News: Major Event Occurs",
    content="Full article text here...",
    author="John Smith, Jane Doe",
    published_date=datetime(2024, 2, 24, 10, 30),
    source="newspaper4k",
    metadata={
        "top_image": "https://example.com/image.jpg",
        "keywords": ["news", "event", "breaking"],
        "summary": "A brief summary of the article..."
    }
)

Fallback Behavior

Newspaper fails when title extraction fails (without title_hint), content is < 100 characters, or HTML structure is non-semantic. Configure fallback extractors in EXTRACTOR_ORDER.

Title Hints

The spider can provide a title_hint from other sources: Example: Link text from parent page

<a href="/article/123">This becomes title_hint</a>

If newspaper fails to extract title, it uses the hint.

When to Use

Use Newspaper When:

Well-structured news websites

Blog platforms (WordPress, Medium, Ghost)

Semantic HTML with proper tags

Need automatic metadata extraction

Don’t Use Newspaper When:

E-commerce sites (use custom extractors)

Forums or discussion boards (use callbacks)

Heavily customized layouts (use custom selectors)

JavaScript-rendered content (use playwright first)

Debugging

Check if newspaper succeeded:

scrapai show spider 1 --project proj

Look for source: newspaper4k in output. Test on sample page:

# Download page HTML
scrapai inspect https://example.com/article --project proj

# Analyze structure
scrapai analyze data/proj/spider/analysis/page.html

# Run extraction test
scrapai crawl spider --limit 1 --project proj

If extraction fails:

Check if content is in semantic tags (<article>, <main>, <p>)
Try trafilatura as fallback
Use custom selectors if generic extractors fail

Configuration Examples

News Site (Primary Strategy)

{
  "name": "bbc_co_uk",
  "source_url": "https://www.bbc.co.uk/",
  "allowed_domains": ["bbc.co.uk"],
  "start_urls": ["https://www.bbc.co.uk/news"],
  "rules": [
    {
      "allow": ["/news/articles/.*"],
      "callback": "parse_article"
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 1,
    "CONCURRENT_REQUESTS": 16
  }
}

Blog with Custom Fallback

{
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "custom"],
    "CUSTOM_SELECTORS": {
      "title": "h1.entry-title",
      "content": "div.entry-content",
      "author": "a.author-link",
      "date": "time.published"
    }
  }
}

Extractors Overview - Strategy selection
Trafilatura Extractor - Alternative generic extractor
Custom Extractors - Site-specific extraction
Settings - Configuration options

​Configuration

​Extracted Fields

​Standard Fields

​Metadata Fields

​Example Output

​Fallback Behavior

​Title Hints

​When to Use

​Use Newspaper When:

​Don’t Use Newspaper When:

​Debugging

​Configuration Examples

​News Site (Primary Strategy)

​Blog with Custom Fallback

​Related

Configuration

Extracted Fields

Standard Fields

Metadata Fields

Example Output

Fallback Behavior

Title Hints

When to Use

Use Newspaper When:

Don’t Use Newspaper When:

Debugging

Configuration Examples

News Site (Primary Strategy)

Blog with Custom Fallback

Related