Playwright extractor uses a headless browser to render JavaScript-heavy pages before extraction. Essential for Single Page Applications (SPAs) and dynamically loaded content.
When to Use
Content loaded by JavaScript (React, Vue, Angular apps)
AJAX/fetch requests after page load
Content appears after delay
Dynamic rendering based on user interaction
Slower than HTML-based extractors (newspaper, trafilatura)
Higher resource usage (browser instance per request)
Use low concurrency (2-4 requests)
Configuration
Basic Playwright
{
"settings": {
"EXTRACTOR_ORDER": ["playwright", "trafilatura"],
"CONCURRENT_REQUESTS": 2,
"DOWNLOAD_DELAY": 2
}
}
Behavior:
- Playwright renders page in browser
- Waits for page load
- Extracts HTML
- Trafilatura processes rendered HTML
Wait for Selector
{
"settings": {
"EXTRACTOR_ORDER": ["playwright", "trafilatura"],
"PLAYWRIGHT_WAIT_SELECTOR": ".article-content",
"CONCURRENT_REQUESTS": 2
}
}
Behavior:
- Wait for
.article-content element to appear (max 30s)
- If selector doesn’t appear → extraction fails
Additional Delay
{
"settings": {
"EXTRACTOR_ORDER": ["playwright"],
"PLAYWRIGHT_WAIT_SELECTOR": ".loaded",
"PLAYWRIGHT_DELAY": 5
}
}
Behavior:
- Wait for
.loaded element
- Wait additional 5 seconds
- Then extract
{
"settings": {
"EXTRACTOR_ORDER": ["playwright", "trafilatura"],
"PLAYWRIGHT_WAIT_SELECTOR": ".quote",
"INFINITE_SCROLL": true,
"MAX_SCROLLS": 10,
"SCROLL_DELAY": 2.0
},
"rules": [
{"allow": [".*"], "follow": false}
]
}
Behavior:
- Wait for
.quote element
- Scroll to bottom
- Wait 2 seconds
- Repeat up to 10 times
- Extract all content from full page
Settings Reference
CSS selector to wait for before extractionTimeout: 30 secondsUse when: Content loads via AJAX/fetchExample:"PLAYWRIGHT_WAIT_SELECTOR": ".article-content"
Additional seconds to wait after page loadUse when: Content appears with unpredictable timingExample:
Enable infinite scroll behaviorRequires: "playwright" in EXTRACTOR_ORDERExample:
Maximum scroll iterationsExample:
Delay between scrolls (seconds)Example:
Identifying JS-Rendered Sites
Signs page needs Playwright:
-
Minimal HTML: View page source shows almost empty body
<div id="app"></div>
<script src="bundle.js"></script>
-
Content in script tags: Data exists only as JavaScript objects
-
Loading placeholders: “Loading…” text or spinner elements
-
Generic extractors fail: Newspaper/trafilatura return no content
Test without browser:
scrapai inspect https://example.com/article --project proj
scrapai analyze data/proj/spider/analysis/page.html
If HTML is minimal/empty → use Playwright
Common Wait Selectors
/* Article loaded */
.article-content
.post-body
#main-content
/* Generic "loaded" markers */
.loaded
[data-loaded="true"]
/* Specific content */
.quote
.product-info
#posts
/* List items */
ul.items li:nth-child(3) /* Wait for 3rd item */
Reduce Concurrency
{
"CONCURRENT_REQUESTS": 2, // Low concurrency for browser requests
"DOWNLOAD_DELAY": 2
}
Why:
- Browser instances use significant memory
- Too many concurrent browsers → system overload
Use Playwright Only When Needed
// Good: Try fast extractors first
"EXTRACTOR_ORDER": ["newspaper", "playwright"]
// Bad: Always use slow browser
"EXTRACTOR_ORDER": ["playwright"]
Hybrid Strategy
{
"settings": {
"EXTRACTOR_ORDER": ["trafilatura", "playwright"],
"CONCURRENT_REQUESTS": 8
}
}
Behavior:
- Most pages use fast trafilatura
- Only JS-heavy pages trigger Playwright
- Average concurrency remains high
Examples
SPA (React/Vue/Angular)
{
"name": "spa_blog",
"settings": {
"EXTRACTOR_ORDER": ["playwright", "trafilatura"],
"PLAYWRIGHT_WAIT_SELECTOR": "article.post",
"PLAYWRIGHT_DELAY": 2,
"CONCURRENT_REQUESTS": 2,
"DOWNLOAD_DELAY": 3
}
}
Delayed Content
{
"settings": {
"EXTRACTOR_ORDER": ["playwright", "custom"],
"PLAYWRIGHT_WAIT_SELECTOR": ".content-loaded",
"PLAYWRIGHT_DELAY": 5,
"CUSTOM_SELECTORS": {
"title": "h1.title",
"content": "div.main-content"
}
}
}
{
"name": "quotes_scroll",
"source_url": "https://quotes.toscrape.com/scroll",
"allowed_domains": ["quotes.toscrape.com"],
"start_urls": ["https://quotes.toscrape.com/scroll"],
"settings": {
"EXTRACTOR_ORDER": ["playwright", "trafilatura"],
"PLAYWRIGHT_WAIT_SELECTOR": ".quote",
"INFINITE_SCROLL": true,
"MAX_SCROLLS": 10,
"SCROLL_DELAY": 2.0,
"CONCURRENT_REQUESTS": 1,
"DOWNLOAD_DELAY": 0
},
"rules": [
{
"allow": [".*"],
"callback": "parse_article",
"follow": false
}
]
}
Note: Set follow: false to prevent following links during scroll.
Cloudflare + Playwright
{
"settings": {
"CLOUDFLARE_ENABLED": true,
"CLOUDFLARE_STRATEGY": "browser_only",
"EXTRACTOR_ORDER": ["playwright", "trafilatura"],
"PLAYWRIGHT_WAIT_SELECTOR": ".article",
"CONCURRENT_REQUESTS": 2,
"DOWNLOAD_DELAY": 5
}
}
Debugging
Check Playwright Logs
scrapai crawl spider --limit 1 --project proj
Look for:
Starting Playwright fetch for https://...
Will wait for selector: .article-content
Will wait additional 2 seconds
Browser navigated
Got HTML from browser: 45678 bytes
Test Wait Selector
# Inspect page
scrapai inspect https://example.com/spa-page --project proj
# Check if selector exists
scrapai analyze page.html --test ".article-content"
If selector not found → Playwright will timeout (30s)
Common Issues
Timeout waiting for selector:
- Selector doesn’t exist on page
- Selector appears after 30s (increase timeout not supported - use delay instead)
- JavaScript error prevents rendering
Solution: Use generic selector or remove PLAYWRIGHT_WAIT_SELECTOR
Content still empty:
- Delay too short (increase
PLAYWRIGHT_DELAY)
- Selector correct but content loads separately
- Page requires interaction (click, scroll) - use
INFINITE_SCROLL if applicable
Browser Client
Playwright uses BrowserClient from utils/browser.py:
Features:
- Headless Chrome
- Async context manager
- Wait for selector support
- Infinite scroll support
- Returns rendered HTML
Usage in extractors:
async with BrowserClient() as browser:
if await browser.goto(
url,
wait_for_selector,
additional_delay,
enable_scroll,
max_scrolls,
scroll_delay
):
html = await browser.get_html()