Skip to main content
Test generic extractors (newspaper, trafilatura) first. Only use custom selectors if they fail.

Discovery Workflow

1

Inspect article page

./scrapai inspect https://example.com/article-url --project proj
2

Analyze HTML structure

./scrapai analyze data/proj/spider/analysis/page.html
3

Test selectors

./scrapai analyze data/proj/spider/analysis/page.html --test "h1.article-title"
./scrapai analyze data/proj/spider/analysis/page.html --test "div.article-body"
4

Search for specific fields

./scrapai analyze data/proj/spider/analysis/page.html --find "price"

Extractor Order Options

ConfigWhen to use
["newspaper", "trafilatura"]Generic extractors work (clean news/blog HTML)
["custom", "newspaper", "trafilatura"]Generic extractors fail; custom selectors needed
["playwright", "custom"]JS-rendered content (SPAs, dynamic loading)
["playwright", "trafilatura"]JS-rendered, generic extractors work after rendering

Generic Extractors

Best for: Clean news articles and blog posts
{
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"]
  }
}
What it extracts:
  • Title
  • Author
  • Published date
  • Main content
  • Top image
  • Keywords
  • Summary
Generic extractors work for ~80% of news/blog sites. Try them first before creating custom selectors.

Custom Selectors

Standard fields (title, author, content, date) → main DB columns. Any other field → stored in metadata JSON column.

News Article

spider.json
{
  "settings": {
    "EXTRACTOR_ORDER": ["custom", "trafilatura"],
    "CUSTOM_SELECTORS": {
      "title": "h1.article-title",
      "content": "div.article-body",
      "author": "span.author-name",
      "date": "time.published-date",
      "category": "a.category-link",
      "tags": "div.tags a"
    }
  }
}

E-commerce Product

spider.json
{
  "settings": {
    "EXTRACTOR_ORDER": ["custom"],
    "CUSTOM_SELECTORS": {
      "title": "h1.product-name",
      "content": "div.product-description",
      "price": "span.price-value",
      "rating": "div.star-rating",
      "stock": "span.availability",
      "brand": "div.brand-name"
    }
  }
}

Forum Thread

spider.json
{
  "settings": {
    "EXTRACTOR_ORDER": ["custom"],
    "CUSTOM_SELECTORS": {
      "title": "h1.thread-title",
      "author": "span.username",
      "content": "div.post-content",
      "date": "time.post-date",
      "upvotes": "span.vote-count"
    }
  }
}

Playwright Extractor

Basic Configuration

spider.json
{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".article-content",
    "PLAYWRIGHT_DELAY": 5
  }
}
Settings:
  • PLAYWRIGHT_WAIT_SELECTOR: CSS selector to wait for (max 30s)
  • PLAYWRIGHT_DELAY: Extra seconds after page load

Infinite Scroll

spider.json
{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".quote",
    "PLAYWRIGHT_DELAY": 2,
    "INFINITE_SCROLL": true,
    "MAX_SCROLLS": 10,
    "SCROLL_DELAY": 2.0
  }
}
Settings:
  • INFINITE_SCROLL: Enable scroll behavior (default: false)
  • MAX_SCROLLS: Max scrolls to perform (default: 5)
  • SCROLL_DELAY: Seconds between scrolls (default: 1.0)

Complete Playwright Example

spa_spider.json
{
  "name": "react_app",
  "allowed_domains": ["example.com"],
  "start_urls": ["https://example.com/app"],
  "rules": [
    {
      "allow": ["/article/.*"],
      "callback": "parse_article",
      "follow": false
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "custom"],
    "PLAYWRIGHT_WAIT_SELECTOR": "#article-loaded",
    "PLAYWRIGHT_DELAY": 3,
    "CUSTOM_SELECTORS": {
      "title": "h1.title",
      "content": "div.content",
      "author": "span.author"
    }
  }
}

Selector Discovery Principles

  • Target main content element (not navigation, sidebar, footer)
  • Selector should match ONE unique element
  • Prefer specific classes (.article-title over .title)
  • Test on multiple pages
  • Prefer semantic tags (<article>, <time>, <h1>)
  • Validate content length (>500 chars for content, >10 for title)
  • Avoid dynamic/generated class names
Common mistakes:
  • Selector matches multiple elements
  • Targets sidebar/footer instead of main content
  • Overly generic selectors like div.text

Identifying JS-Rendered Sites

<body>
  <div id="app"></div>
  <script src="bundle.js"></script>
</body>

Playwright Wait: Common Selectors

Article sites:
  • .article-content
  • #main-content
  • article.post
  • [data-loaded="true"]
Product pages:
  • .product-details
  • .price-container
  • #product-info
Social/Forums:
  • .post-list
  • #posts
  • .loaded
Generic:
  • .content-loaded
  • [data-ready]
  • .main-container

Implementation Details

Extractor classes (from source: core/extractors.py):

NewspaperExtractor

extractors.py:36-78Uses newspaper4k library

TrafilaturaExtractor

extractors.py:80-128Uses trafilatura library

CustomExtractor

extractors.py:131-257Uses BeautifulSoup + CSS selectors

SmartExtractor

extractors.py:259-464Tries multiple strategies in order

PlaywrightExtractor

extractors.py:398-464Async browser rendering

Troubleshooting

Generic Extractors Return Empty Content

  1. Check if JS-rendered (empty <div id="app"></div>)
  2. Try Playwright: {"EXTRACTOR_ORDER": ["playwright", "trafilatura"]}
  3. Use custom selectors: {"EXTRACTOR_ORDER": ["custom", "trafilatura"]}

Custom Selector Returns None

  1. Test selector: ./scrapai analyze page.html --test "your-selector"
  2. Check selector specificity and uniqueness
  3. Verify element exists in HTML structure

Playwright Timeout

  1. Increase delay: {"PLAYWRIGHT_DELAY": 10}
  2. Use different wait selector that appears earlier
  3. Verify selector exists in rendered page

Content Extracted But Wrong

  1. Verify selector uniqueness (may match sidebar/footer)
  2. Make selector more specific: {"content": "main article.post div.body"}
  3. Test on multiple pages

Best Practices

  1. Start with generic extractors: ["newspaper", "trafilatura"] (80% success rate)
  2. Add custom selectors if needed: ["custom", "trafilatura"]
  3. Use Playwright for JS-rendered sites: ["playwright", "trafilatura"]
  4. Test on multiple pages
  5. Monitor quality: ./scrapai show 1 --project proj

Custom Callbacks

Extract structured data with callbacks

Data Processors

Transform extracted data