sections is the recommended way to author a spider. You write one section per kind of page the crawl meets — one article layout, one product layout, the listing pages that only link onward — and that single list replaces hand-writing rules + callbacks + settings.FIELDS.
Why sections
A section says two things: which URLs it matches and how to extract from them. One repeating concept gets one section. Same article layout everywhere → one section. Pages that differ in structure or fields → give each its own section.The section record
Each entry is an object. Onlymatch and extract carry the meaning; the rest are optional knobs.
List of URL regex patterns the section applies to. Absent = match all URLs. Becomes the rule’s
allow.How to extract. Exactly one of three forms — see Extract modes. Absent = follow-only (navigation).
Whether to follow links found on matched pages.
Optional,
0–1000. Higher is evaluated first.URL regexes to exclude. Carried straight onto the rule.
Restrict link extraction to regions matching these XPaths. Carried onto the rule.
Restrict link extraction to regions matching these CSS selectors. Carried onto the rule.
HTML tags to extract links from. Carried onto the rule.
Extract modes
extract is exactly one of three things.
Absent → follow-only navigation
Absent → follow-only navigation
"auto" → built-in article reader
"auto" → built-in article reader
The built-in article reader fills the four core fields: Desugars to a rule with
title, content, author, published_date. Use this for ordinary article / blog pages.callback: "parse_article".Selector dict → per-field extraction
Selector dict → per-field extraction
A The directive shape is identical to a FIELDS / callback directive — see Extractors and Data Processors.
{ field: value } dict, one entry per schema field. Each value is either:"auto"— valid only for the four core fields (title,content,author,published_date); or- a directive
{ "css" | "xpath": "..." }(optionallyget_all,to_text,to_markdown,processors) for any field, core or not.
The auto + override rule
Keep"auto" for the core fields the reader gets right; add a selector only for fields it can’t produce (anything non-core) or gets wrong. A non-core field like images does not mean you hand-write content — keep content on "auto" and just add the images selector.
How it desugars
At import,core/sections.py translates each section into the legacy shape:
Section extract | Becomes |
|---|---|
| absent | a rule with callback: null (follow-only) |
"auto" | a rule with callback: "parse_article" |
dict with any "auto" field | a rule with callback: "parse_article"; selector overrides merge into settings.FIELDS |
| dict, all selectors | a rule with callback: "parse_section_<n>" + a matching callback holding those selectors |
match → the rule’s allow, and follow / priority / deny / restrict_xpaths / restrict_css / tags are carried straight onto the rule.
Because it is a pure translation, anything expressible as sections is also expressible as rules + callbacks + settings. A handful of features are still authored the legacy way (
iterate listing→detail, ajax_nested_list, JS PAGINATED_LISTINGS) — write rules/callbacks directly for those. See the Spider Schema.Complete example
Reading top to bottom: the article section pins a stubbornauthor while keeping the rest on "auto" (the one allowed auto + override section); the product section gives one directive per non-core field; the final { "match": [".*"], "follow": true } is the follow-only navigation section.
spider.json
"auto" article section plus a follow-only navigation section:
spider.json
Every
required: true field in project.json must be sourced by some section. Verify with a 5-article test crawl before importing the final spider.Related
Rules
The lower-level rule shape sections desugar to
Callbacks
Named callbacks and field directives
Spider Schema
Full spider config reference
Extractors
Selector and directive syntax