Skip to main content
sections is the recommended way to author a spider. You write one section per kind of page the crawl meets — one article layout, one product layout, the listing pages that only link onward — and that single list replaces hand-writing rules + callbacks + settings.FIELDS.
sections is desugared into the lower-level rules + callbacks + settings.FIELDS shape at import time. The runtime spiders are unchanged — sections are purely an authoring convenience. Configs that already use rules/callbacks/FIELDS pass through untouched.

Why sections

A section says two things: which URLs it matches and how to extract from them. One repeating concept gets one section. Same article layout everywhere → one section. Pages that differ in structure or fields → give each its own section.
A spider is not one function. Write as many sections as the site has kinds of pages — never force structurally-different pages through one extract spec.

The section record

Each entry is an object. Only match and extract carry the meaning; the rest are optional knobs.
{ "match": ["/articles/.*"], "extract": "auto", "follow": true, "priority": 100 }
match
string[]
List of URL regex patterns the section applies to. Absent = match all URLs. Becomes the rule’s allow.
extract
"auto" | object
How to extract. Exactly one of three forms — see Extract modes. Absent = follow-only (navigation).
follow
boolean
default:"true"
Whether to follow links found on matched pages.
priority
integer
Optional, 01000. Higher is evaluated first.
deny
string[]
URL regexes to exclude. Carried straight onto the rule.
restrict_xpaths
string[]
Restrict link extraction to regions matching these XPaths. Carried onto the rule.
restrict_css
string[]
Restrict link extraction to regions matching these CSS selectors. Carried onto the rule.
tags
string[]
HTML tags to extract links from. Carried onto the rule.
Unknown keys and wrong types are rejected at import — the section schema forbids extra keys. Transport, throughput, PDF_MODE, USE_SITEMAP, and DeltaFetch all stay in top-level settings, never per-section.

Extract modes

extract is exactly one of three things.
Omit extract entirely. The page is crawled for links but nothing is extracted from it. This is the listing / index / navigation section.
{ "match": ["/news$"], "follow": true }
Desugars to a rule with callback: null.
The built-in article reader fills the four core fields: title, content, author, published_date. Use this for ordinary article / blog pages.
{ "match": ["/news/articles/.*"], "extract": "auto" }
Desugars to a rule with callback: "parse_article".
A { field: value } dict, one entry per schema field. Each value is either:
  • "auto" — valid only for the four core fields (title, content, author, published_date); or
  • a directive { "css" | "xpath": "..." } (optionally get_all, to_text, to_markdown, processors) for any field, core or not.
{
  "match": ["/product/.*"],
  "extract": {
    "name":  { "css": "h1.title::text" },
    "price": { "css": "span.price::text" }
  }
}
The directive shape is identical to a FIELDS / callback directive — see Extractors and Data Processors.
"auto" on a non-core field is rejected at import. The core fields are exactly title, content, author, published_date. Give any other field an explicit selector.

The auto + override rule

Keep "auto" for the core fields the reader gets right; add a selector only for fields it can’t produce (anything non-core) or gets wrong. A non-core field like images does not mean you hand-write content — keep content on "auto" and just add the images selector.
{
  "match": ["/articles/.*"],
  "extract": {
    "title":   "auto",
    "content": "auto",
    "author":  { "css": ".byline a::text" },
    "images":  { "css": "figure img::attr(src)", "get_all": true }
  }
}
Mixing "auto" with a selector override writes the spider-wide global FIELDS dict, so at most one section per spider may mix "auto" with overrides. Other sections must give explicit selectors for every field. Violating this is rejected at import.

How it desugars

At import, core/sections.py translates each section into the legacy shape:
Section extractBecomes
absenta rule with callback: null (follow-only)
"auto"a rule with callback: "parse_article"
dict with any "auto" fielda rule with callback: "parse_article"; selector overrides merge into settings.FIELDS
dict, all selectorsa rule with callback: "parse_section_<n>" + a matching callback holding those selectors
In every case match → the rule’s allow, and follow / priority / deny / restrict_xpaths / restrict_css / tags are carried straight onto the rule.
Because it is a pure translation, anything expressible as sections is also expressible as rules + callbacks + settings. A handful of features are still authored the legacy way (iterate listing→detail, ajax_nested_list, JS PAGINATED_LISTINGS) — write rules/callbacks directly for those. See the Spider Schema.

Complete example

Reading top to bottom: the article section pins a stubborn author while keeping the rest on "auto" (the one allowed auto + override section); the product section gives one directive per non-core field; the final { "match": [".*"], "follow": true } is the follow-only navigation section.
spider.json
{
  "name": "domain_com",
  "allowed_domains": ["domain.com"],
  "start_urls": ["https://domain.com/articles"],
  "sections": [
    {
      "match": ["/articles/.*"],
      "extract": {
        "title":   "auto",
        "content": "auto",
        "author":  { "css": ".byline a::text" }
      }
    },
    {
      "match": ["/products/.*"],
      "extract": {
        "name":  { "css": "h1::text" },
        "price": { "css": ".price::text" }
      }
    },
    { "match": [".*"], "follow": true }
  ],
  "settings": {
    "DOWNLOAD_DELAY": 0,
    "CONCURRENT_REQUESTS": 32,
    "CONCURRENT_REQUESTS_PER_DOMAIN": 16,
    "AUTOTHROTTLE_ENABLED": false
  }
}
Plain article site (core fields only) — one "auto" article section plus a follow-only navigation section:
spider.json
{
  "sections": [
    { "match": ["/blog/[^/]+$"], "extract": "auto", "follow": false, "priority": 100 },
    { "match": ["/blog$"], "follow": true, "priority": 50 }
  ]
}
Every required: true field in project.json must be sourced by some section. Verify with a 5-article test crawl before importing the final spider.

Rules

The lower-level rule shape sections desugar to

Callbacks

Named callbacks and field directives

Spider Schema

Full spider config reference

Extractors

Selector and directive syntax