Skip to main content
Spider settings control extraction strategies, concurrency, delays, and feature flags. Settings are flexible key-value pairs with validation for common options.

Common Settings

EXTRACTOR_ORDER
string[]
default:"null"
Extraction strategy order (tries each until success)Allowed values:
  • "newspaper" - Newspaper4k extractor
  • "trafilatura" - Trafilatura extractor
  • "custom" - Custom CSS selectors (requires CUSTOM_SELECTORS)
  • "playwright" - Browser rendering (for JS content)
Example:
"EXTRACTOR_ORDER": ["newspaper", "trafilatura"]
Common patterns:
  • ["newspaper", "trafilatura"] - Generic news/blogs
  • ["custom", "newspaper"] - Custom selectors with fallback
  • ["playwright", "trafilatura"] - JS-rendered content
See Extractors Overview for strategy details.
CUSTOM_SELECTORS
object
default:"null"
CSS selectors for custom extractionRequired for: "custom" in EXTRACTOR_ORDERStandard fields (map to DB columns):
  • title → scraped_items.title
  • content → scraped_items.content
  • author → scraped_items.author
  • date → scraped_items.published_date
Any other fields → stored in metadata_json columnExample:
"CUSTOM_SELECTORS": {
  "title": "h1.article-title",
  "content": "div.article-body",
  "author": "span.author-name",
  "date": "time.published-date",
  "category": "a.category-link"
}
See Custom Extractors for details.
CONCURRENT_REQUESTS
integer
default:"null"
Maximum concurrent requestsValidation:
  • Min: 1
  • Max: 32
Recommended:
  • Small sites: 8-16
  • Large sites: 16-32
  • Playwright enabled: 2-4
Example:
"CONCURRENT_REQUESTS": 16
DOWNLOAD_DELAY
float
default:"null"
Delay between requests (seconds)Validation:
  • Min: 0
  • Max: 60
Recommended:
  • Polite crawling: 1-2
  • Aggressive: 0-0.5
  • Rate-limited sites: 2-5
Example:
"DOWNLOAD_DELAY": 1

Cloudflare Settings

CLOUDFLARE_ENABLED
boolean
default:"null"
Enable Cloudflare bypassUse when: Site returns “Checking your browser” or 403 errorsExample:
"CLOUDFLARE_ENABLED": true
CLOUDFLARE_STRATEGY
string
default:"null"
Cloudflare bypass strategyAllowed values:
  • "hybrid" - Try normal request first, fallback to browser
  • "browser_only" - Always use browser
Example:
"CLOUDFLARE_STRATEGY": "hybrid"

Playwright Settings

PLAYWRIGHT_WAIT_SELECTOR
string
default:"null"
CSS selector to wait for before extractingUse when: Content loads after page renders (AJAX, lazy loading)Timeout: 30 secondsExample:
"PLAYWRIGHT_WAIT_SELECTOR": ".article-content"
See Playwright Extractor for details.
PLAYWRIGHT_DELAY
float
default:"null"
Additional seconds to wait after page loadUse when: Content loads with unpredictable timingExample:
"PLAYWRIGHT_DELAY": 5
INFINITE_SCROLL
boolean
default:"null"
Enable infinite scroll behaviorUse when: Single-page sites load content on scrollRequires: "playwright" in EXTRACTOR_ORDERExample:
"INFINITE_SCROLL": true
MAX_SCROLLS
integer
default:"5"
Maximum scroll iterationsUsed with: INFINITE_SCROLL: trueExample:
"MAX_SCROLLS": 10
SCROLL_DELAY
float
default:"1.0"
Delay between scrolls (seconds)Used with: INFINITE_SCROLL: trueExample:
"SCROLL_DELAY": 2.0

Scrapy Settings

ROBOTSTXT_OBEY
boolean
default:"null"
Respect robots.txtExample:
"ROBOTSTXT_OBEY": true
DEPTH_LIMIT
integer
default:"null"
Maximum crawl depthExample:
"DEPTH_LIMIT": 3

Advanced Features

DELTAFETCH_ENABLED
boolean
default:"null"
Enable delta fetching (skip already-crawled URLs)Example:
"DELTAFETCH_ENABLED": true

Configuration Examples

News Site (Generic Extractors)

{
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 1,
    "CONCURRENT_REQUESTS": 16,
    "ROBOTSTXT_OBEY": true
  }
}

E-commerce (Custom Selectors)

{
  "settings": {
    "EXTRACTOR_ORDER": ["custom"],
    "CUSTOM_SELECTORS": {
      "title": "h1.product-name",
      "content": "div.description",
      "price": "span.price",
      "rating": "div.rating::attr(data-rating)"
    },
    "DOWNLOAD_DELAY": 2,
    "CONCURRENT_REQUESTS": 8
  }
}

JS-Rendered Site

{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".article-content",
    "PLAYWRIGHT_DELAY": 3,
    "DOWNLOAD_DELAY": 2,
    "CONCURRENT_REQUESTS": 2
  }
}

Cloudflare-Protected Site

{
  "settings": {
    "CLOUDFLARE_ENABLED": true,
    "CLOUDFLARE_STRATEGY": "hybrid",
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 3,
    "CONCURRENT_REQUESTS": 4
  }
}

Infinite Scroll Page

{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".quote",
    "INFINITE_SCROLL": true,
    "MAX_SCROLLS": 10,
    "SCROLL_DELAY": 2.0,
    "DOWNLOAD_DELAY": 0,
    "CONCURRENT_REQUESTS": 1
  },
  "rules": [
    {
      "allow": [".*"],
      "follow": false
    }
  ]
}

Custom Settings

The schema allows arbitrary key-value pairs for custom Scrapy settings:
{
  "settings": {
    "USER_AGENT": "MyBot/1.0",
    "AUTOTHROTTLE_ENABLED": true,
    "HTTPCACHE_ENABLED": true
  }
}
Custom settings are passed directly to Scrapy without validation.

Validation Errors

Unknown Extractor

Unknown extractor: custom_parser. 
Allowed: newspaper, trafilatura, custom, playwright
Fix: Use valid extractor name

Invalid Cloudflare Strategy

Invalid Cloudflare strategy: browser. 
Allowed: hybrid, browser_only
Fix: Use "hybrid" or "browser_only"

Out of Range

CONCURRENT_REQUESTS must be >= 1 and <= 32
Fix: Use value within allowed range