Common Settings
Extraction strategy order (tries each until success)Allowed values:Common patterns:
"newspaper"- Newspaper4k extractor"trafilatura"- Trafilatura extractor"custom"- Custom CSS selectors (requiresCUSTOM_SELECTORS)"playwright"- Browser rendering (for JS content)
["newspaper", "trafilatura"]- Generic news/blogs["custom", "newspaper"]- Custom selectors with fallback["playwright", "trafilatura"]- JS-rendered content
CSS selectors for custom extractionRequired for: See Custom Extractors for details.
"custom" in EXTRACTOR_ORDERStandard fields (map to DB columns):title→ scraped_items.titlecontent→ scraped_items.contentauthor→ scraped_items.authordate→ scraped_items.published_date
metadata_json columnExample:Maximum concurrent requestsValidation:
- Min: 1
- Max: 32
- Small sites: 8-16
- Large sites: 16-32
- Playwright enabled: 2-4
Delay between requests (seconds)Validation:
- Min: 0
- Max: 60
- Polite crawling: 1-2
- Aggressive: 0-0.5
- Rate-limited sites: 2-5
Cloudflare Settings
Enable Cloudflare bypassUse when: Site returns “Checking your browser” or 403 errorsExample:
Cloudflare bypass strategyAllowed values:
"hybrid"- Try normal request first, fallback to browser"browser_only"- Always use browser
Playwright Settings
CSS selector to wait for before extractingUse when: Content loads after page renders (AJAX, lazy loading)Timeout: 30 secondsExample:See Playwright Extractor for details.
Additional seconds to wait after page loadUse when: Content loads with unpredictable timingExample:
Enable infinite scroll behaviorUse when: Single-page sites load content on scrollRequires:
"playwright" in EXTRACTOR_ORDERExample:Maximum scroll iterationsUsed with:
INFINITE_SCROLL: trueExample:Delay between scrolls (seconds)Used with:
INFINITE_SCROLL: trueExample:Scrapy Settings
Respect robots.txtExample:
Maximum crawl depthExample:
Advanced Features
Enable delta fetching (skip already-crawled URLs)Example:
Configuration Examples
News Site (Generic Extractors)
E-commerce (Custom Selectors)
JS-Rendered Site
Cloudflare-Protected Site
Infinite Scroll Page
Custom Settings
The schema allows arbitrary key-value pairs for custom Scrapy settings:Validation Errors
Unknown Extractor
Invalid Cloudflare Strategy
"hybrid" or "browser_only"
Out of Range
Related
- Extractors Overview - Extraction strategies
- Newspaper Extractor - Newspaper4k details
- Trafilatura Extractor - Trafilatura details
- Custom Extractors - CSS selector extraction
- Playwright Extractor - Browser rendering
- Spider Schema - Complete configuration