Spider Settings

Spider settings control extraction strategies, concurrency, delays, and feature flags.

Common Settings

EXTRACTOR_ORDER

string[]

default:"null"

Extraction strategy order (tries each until success)Allowed values:

"newspaper" - Newspaper4k extractor
"trafilatura" - Trafilatura extractor
"custom" - Custom CSS selectors (requires CUSTOM_SELECTORS)
"playwright" - Browser rendering (for JS content)

Example:

"EXTRACTOR_ORDER": ["newspaper", "trafilatura"]

Common patterns:

["newspaper", "trafilatura"] - Generic news/blogs
["custom", "newspaper"] - Custom selectors with fallback
["playwright", "trafilatura"] - JS-rendered content

See Extractors Overview for strategy details.

CUSTOM_SELECTORS

object

default:"null"

CSS selectors for custom extractionRequired for: "custom" in EXTRACTOR_ORDERStandard fields (map to DB columns):

title → scraped_items.title
content → scraped_items.content
author → scraped_items.author
date → scraped_items.published_date

Any other fields → stored in metadata_json columnExample:

"CUSTOM_SELECTORS": {
  "title": "h1.article-title",
  "content": "div.article-body",
  "author": "span.author-name",
  "date": "time.published-date",
  "category": "a.category-link"
}

See Custom Extractors for details.

CONCURRENT_REQUESTS

integer

default:"null"

Maximum concurrent requestsValidation:

Min: 1
Max: 32

Recommended:

Small sites: 8-16
Large sites: 16-32
Playwright enabled: 2-4

Example:

"CONCURRENT_REQUESTS": 16

DOWNLOAD_DELAY

float

default:"null"

Delay between requests (seconds)Validation:

Min: 0
Max: 60

Recommended:

Polite crawling: 1-2
Aggressive: 0-0.5
Rate-limited sites: 2-5

Example:

"DOWNLOAD_DELAY": 1

Cloudflare Settings

CLOUDFLARE_ENABLED

boolean

default:"null"

Enable Cloudflare bypassExample:

"CLOUDFLARE_ENABLED": true

CLOUDFLARE_STRATEGY

string

default:"null"

Cloudflare bypass strategyAllowed values:

"hybrid" - Try normal request first, fallback to browser
"browser_only" - Always use browser

Example:

"CLOUDFLARE_STRATEGY": "hybrid"

Playwright Settings

PLAYWRIGHT_WAIT_SELECTOR

string

default:"null"

CSS selector to wait for before extractingTimeout: 30 secondsExample:

"PLAYWRIGHT_WAIT_SELECTOR": ".article-content"

See Playwright Extractor for details.

PLAYWRIGHT_DELAY

float

default:"null"

Additional seconds to wait after page loadExample:

"PLAYWRIGHT_DELAY": 5

INFINITE_SCROLL

boolean

default:"null"

Enable infinite scroll behaviorRequires: "playwright" in EXTRACTOR_ORDERExample:

"INFINITE_SCROLL": true

MAX_SCROLLS

integer

default:"5"

Maximum scroll iterationsUsed with: INFINITE_SCROLL: trueExample:

"MAX_SCROLLS": 10

SCROLL_DELAY

float

default:"1.0"

Delay between scrolls (seconds)Used with: INFINITE_SCROLL: trueExample:

"SCROLL_DELAY": 2.0

Scrapy Settings

ROBOTSTXT_OBEY

boolean

default:"null"

Respect robots.txtExample:

"ROBOTSTXT_OBEY": true

DEPTH_LIMIT

integer

default:"null"

Maximum crawl depthExample:

"DEPTH_LIMIT": 3

Advanced Features

DELTAFETCH_ENABLED

boolean

default:"null"

Enable delta fetching (skip already-crawled URLs)Example:

"DELTAFETCH_ENABLED": true

Configuration Examples

News Site (Generic Extractors)

{
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 1,
    "CONCURRENT_REQUESTS": 16,
    "ROBOTSTXT_OBEY": true
  }
}

E-commerce (Custom Selectors)

{
  "settings": {
    "EXTRACTOR_ORDER": ["custom"],
    "CUSTOM_SELECTORS": {
      "title": "h1.product-name",
      "content": "div.description",
      "price": "span.price",
      "rating": "div.rating::attr(data-rating)"
    },
    "DOWNLOAD_DELAY": 2,
    "CONCURRENT_REQUESTS": 8
  }
}

JS-Rendered Site

{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".article-content",
    "PLAYWRIGHT_DELAY": 3,
    "DOWNLOAD_DELAY": 2,
    "CONCURRENT_REQUESTS": 2
  }
}

Cloudflare-Protected Site

{
  "settings": {
    "CLOUDFLARE_ENABLED": true,
    "CLOUDFLARE_STRATEGY": "hybrid",
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 3,
    "CONCURRENT_REQUESTS": 4
  }
}

Infinite Scroll Page

{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright", "trafilatura"],
    "PLAYWRIGHT_WAIT_SELECTOR": ".quote",
    "INFINITE_SCROLL": true,
    "MAX_SCROLLS": 10,
    "SCROLL_DELAY": 2.0,
    "DOWNLOAD_DELAY": 0,
    "CONCURRENT_REQUESTS": 1
  },
  "rules": [
    {
      "allow": [".*"],
      "follow": false
    }
  ]
}

Custom Settings

Arbitrary key-value pairs for custom Scrapy settings (passed directly without validation):

{
  "settings": {
    "USER_AGENT": "MyBot/1.0",
    "AUTOTHROTTLE_ENABLED": true,
    "HTTPCACHE_ENABLED": true
  }
}

Validation Errors

Unknown Extractor

Unknown extractor: custom_parser. 
Allowed: newspaper, trafilatura, custom, playwright

Fix: Use valid extractor name

Invalid Cloudflare Strategy

Invalid Cloudflare strategy: browser. 
Allowed: hybrid, browser_only

Fix: Use "hybrid" or "browser_only"

Out of Range

CONCURRENT_REQUESTS must be >= 1 and <= 32

Fix: Use value within allowed range

Extractors Overview - Extraction strategies
Newspaper Extractor - Newspaper4k details
Trafilatura Extractor - Trafilatura details
Custom Extractors - CSS selector extraction
Playwright Extractor - Browser rendering
Spider Schema - Complete configuration

​Common Settings

​Cloudflare Settings

​Playwright Settings

​Scrapy Settings

​Advanced Features

​Configuration Examples

​News Site (Generic Extractors)

​E-commerce (Custom Selectors)

​JS-Rendered Site

​Cloudflare-Protected Site

​Infinite Scroll Page

​Custom Settings

​Validation Errors

​Unknown Extractor

​Invalid Cloudflare Strategy

​Out of Range

​Related

Common Settings

Cloudflare Settings

Playwright Settings

Scrapy Settings

Advanced Features

Configuration Examples

News Site (Generic Extractors)

E-commerce (Custom Selectors)

JS-Rendered Site

Cloudflare-Protected Site

Infinite Scroll Page

Custom Settings

Validation Errors

Unknown Extractor

Invalid Cloudflare Strategy

Out of Range

Related