Skip to main content
Test generic extractors (newspaper, trafilatura) first. Only use custom selectors if they fail.

Discovery Workflow

1

Inspect article page

./scrapai inspect https://example.com/article-url --project proj
Fetches the page and saves HTML for analysis
2

Analyze HTML structure

./scrapai analyze data/proj/spider/analysis/page.html
Shows:
  • h1/h2 titles with classes
  • Content containers by size
  • Date elements
  • Author elements
3

Test selectors

./scrapai analyze data/proj/spider/analysis/page.html --test "h1.article-title"
./scrapai analyze data/proj/spider/analysis/page.html --test "div.article-body"
Validate that selectors extract correct content
4

Search for specific fields

./scrapai analyze data/proj/spider/analysis/page.html --find "price"
Find elements containing specific text

Extractor Order Options

ConfigWhen to use
["newspaper", "trafilatura"]Generic extractors work (clean news/blog HTML)
["custom", "newspaper", "trafilatura"]Generic extractors fail; custom selectors needed
["playwright", "custom"]JS-rendered content (SPAs, dynamic loading)
["playwright", "trafilatura"]JS-rendered, generic extractors work after rendering
Extractors are tried in order. First successful extraction wins.

Generic Extractors

Best for: Clean news articles and blog posts
{
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"]
  }
}
What it extracts:
  • Title
  • Author
  • Published date
  • Main content
  • Top image
  • Keywords
  • Summary
Generic extractors work for ~80% of news/blog sites. Try them first before creating custom selectors.

Custom Selectors

Use when generic extractors fail to extract content properly. Standard fields (title, author, content, date) → main DB columns. Any other field → stored in metadata JSON column.

News Article

spider.json
{
  "settings": {
    "EXTRACTOR_ORDER": ["custom", "trafilatura"],
    "CUSTOM_SELECTORS": {
      "title": "h1.article-title",
      "content": "div.article-body",
      "author": "span.author-name",
      "date": "time.published-date",
      "category": "a.category-link",
      "tags": "div.tags a"
    }
  }
}

E-commerce Product

spider.json
{
  "settings": {
    "EXTRACTOR_ORDER": ["custom"],
    "CUSTOM_SELECTORS": {
      "title": "h1.product-name",
      "content": "div.product-description",
      "price": "span.price-value",
      "rating": "div.star-rating",
      "stock": "span.availability",
      "brand": "div.brand-name"
    }
  }
}

Forum Thread

spider.json
{
  "settings": {
    "EXTRACTOR_ORDER": ["custom"],
    "CUSTOM_SELECTORS": {
      "title": "h1.thread-title",
      "author": "span.username",
      "content": "div.post-content",
      "date": "time.post-date",
      "upvotes": "span.vote-count"
    }
  }
}

Playwright Extractor

For JavaScript-rendered or dynamically loaded content.

Basic Configuration

spider.json
{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".article-content",
    "PLAYWRIGHT_DELAY": 5
  }
}
Settings:
  • PLAYWRIGHT_WAIT_SELECTOR: CSS selector to wait for (max 30s)
  • PLAYWRIGHT_DELAY: Extra seconds after page load

Infinite Scroll

For single-page sites with dynamic scroll loading:
spider.json
{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".quote",
    "PLAYWRIGHT_DELAY": 2,
    "INFINITE_SCROLL": true,
    "MAX_SCROLLS": 10,
    "SCROLL_DELAY": 2.0
  }
}
Settings:
  • INFINITE_SCROLL: Enable scroll behavior (default: false)
  • MAX_SCROLLS: Max scrolls to perform (default: 5)
  • SCROLL_DELAY: Seconds between scrolls (default: 1.0)
Requires playwright in EXTRACTOR_ORDER. Set follow: false in rules to avoid following dynamically loaded links.

Complete Playwright Example

spa_spider.json
{
  "name": "react_app",
  "allowed_domains": ["example.com"],
  "start_urls": ["https://example.com/app"],
  "rules": [
    {
      "allow": ["/article/.*"],
      "callback": "parse_article",
      "follow": false
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "custom"],
    "PLAYWRIGHT_WAIT_SELECTOR": "#article-loaded",
    "PLAYWRIGHT_DELAY": 3,
    "CUSTOM_SELECTORS": {
      "title": "h1.title",
      "content": "div.content",
      "author": "span.author"
    }
  }
}

Selector Discovery Principles

1

Target MAIN content element

Not navigation, sidebar, or footer
2

Selector should match ONE element

Verify uniqueness on multiple pages
3

Prefer specific classes

Use .article-title over generic .title
4

Test on multiple articles

Selector must work across different pages
5

Prefer semantic tags

Use <article>, <time>, <h1> when available
6

Validate content length

  • Content selector should return >500 chars
  • Title should return >10 chars
7

Avoid dynamic class names

Skip randomly generated or versioned classes
Common mistakes:
  • Selector matches multiple elements (gets first, often wrong)
  • Selector targets sidebar/footer instead of main content
  • Overly generic selectors like div.text
  • Guessing without testing on actual HTML

Identifying JS-Rendered Sites

Signs that page.html needs Playwright:
<body>
  <div id="app"></div>
  <script src="bundle.js"></script>
</body>
If you see minimal/empty content or just <div id="app"></div>, use ["playwright", ...] in EXTRACTOR_ORDER.

Playwright Wait: Common Selectors

Article sites:
  • .article-content
  • #main-content
  • article.post
  • [data-loaded="true"]
Product pages:
  • .product-details
  • .price-container
  • #product-info
Social/Forums:
  • .post-list
  • #posts
  • .loaded
Generic:
  • .content-loaded
  • [data-ready]
  • .main-container

Implementation Details

Extractor classes (from source: core/extractors.py):

NewspaperExtractor

extractors.py:36-78Uses newspaper4k library

TrafilaturaExtractor

extractors.py:80-128Uses trafilatura library

CustomExtractor

extractors.py:131-257Uses BeautifulSoup + CSS selectors

SmartExtractor

extractors.py:259-464Tries multiple strategies in order

PlaywrightExtractor

extractors.py:398-464Async browser rendering

Troubleshooting

Generic Extractors Return Empty Content

1

Check if JS-rendered

Inspect HTML for empty <div id="app"></div> or minimal content
2

Try Playwright extractor

{"EXTRACTOR_ORDER": ["playwright", "trafilatura"]}
3

Use custom selectors

If generic extractors consistently fail:
{"EXTRACTOR_ORDER": ["custom", "trafilatura"]}

Custom Selector Returns None

1

Test selector

./scrapai analyze page.html --test "your-selector"
2

Check selector specificity

Ensure selector matches the intended element uniquely
3

Verify HTML structure

Element might be in different location than expected

Playwright Timeout

1

Increase wait timeout

{"PLAYWRIGHT_DELAY": 10}
2

Use different wait selector

Find selector that appears earlier:
{"PLAYWRIGHT_WAIT_SELECTOR": ".loading-complete"}
3

Check selector exists

Verify wait selector actually appears in rendered page

Content Extracted But Wrong

1

Verify selector uniqueness

May be matching wrong element (sidebar, footer)
2

Make selector more specific

Add parent classes or IDs:
{"content": "main article.post div.body"}
3

Test on multiple pages

Ensure selector works across different articles

Best Practices

1

Start with generic extractors

80% success rate for news/blog sites
{"EXTRACTOR_ORDER": ["newspaper", "trafilatura"]}
2

Add custom selectors if needed

When generic extractors fail:
{"EXTRACTOR_ORDER": ["custom", "trafilatura"]}
3

Use Playwright for JS sites

Only when content is dynamically rendered:
{"EXTRACTOR_ORDER": ["playwright", "trafilatura"]}
4

Test on multiple pages

Verify extraction works across different articles
5

Monitor extraction quality

Check scraped content with show command:
./scrapai show 1 --project proj