> ## Documentation Index > Fetch the complete documentation index at: https://docs.scrapai.dev/llms.txt > Use this file to discover all available pages before exploring further. # Sections > The recommended spider authoring format — one section per kind of page, desugared into rules, callbacks, and FIELDS at import `sections` is the recommended way to author a spider. You write one **section** per *kind of page* the crawl meets — one article layout, one product layout, the listing pages that only link onward — and that single list replaces hand-writing `rules` + `callbacks` + `settings.FIELDS`. `sections` is desugared into the lower-level [rules](/api/rules) + [callbacks](/api/callbacks) + `settings.FIELDS` shape at import time. The runtime spiders are unchanged — sections are purely an authoring convenience. Configs that already use `rules`/`callbacks`/`FIELDS` pass through untouched. ## Why sections A section says two things: *which URLs it matches* and *how to extract from them*. One repeating concept gets one section. Same article layout everywhere → one section. Pages that differ in structure or fields → give each its own section. A spider is not one function. Write as many sections as the site has kinds of pages — never force structurally-different pages through one `extract` spec. ## The section record Each entry is an object. Only `match` and `extract` carry the meaning; the rest are optional knobs. ```json theme={null} { "match": ["/articles/.*"], "extract": "auto", "follow": true, "priority": 100 } ``` List of URL regex patterns the section applies to. **Absent = match all URLs.** Becomes the rule's `allow`. How to extract. Exactly one of three forms — see [Extract modes](#extract-modes). **Absent = follow-only** (navigation). Whether to follow links found on matched pages. Optional, `0`–`1000`. Higher is evaluated first. URL regexes to exclude. Carried straight onto the rule. Restrict link extraction to regions matching these XPaths. Carried onto the rule. Restrict link extraction to regions matching these CSS selectors. Carried onto the rule. HTML tags to extract links from. Carried onto the rule. Unknown keys and wrong types are **rejected at import** — the section schema forbids extra keys. Transport, throughput, `PDF_MODE`, `USE_SITEMAP`, and DeltaFetch all stay in top-level `settings`, never per-section. ## Extract modes `extract` is exactly one of three things. Omit `extract` entirely. The page is crawled for links but nothing is extracted from it. This is the listing / index / navigation section. ```json theme={null} { "match": ["/news$"], "follow": true } ``` Desugars to a rule with `callback: null`. The built-in article reader fills the four **core fields**: `title`, `content`, `author`, `published_date`. Use this for ordinary article / blog pages. ```json theme={null} { "match": ["/news/articles/.*"], "extract": "auto" } ``` Desugars to a rule with `callback: "parse_article"`. A `{ field: value }` dict, one entry per schema field. Each value is either: * **`"auto"`** — valid **only** for the four core fields (`title`, `content`, `author`, `published_date`); or * a **directive** `{ "css" | "xpath": "..." }` (optionally `get_all`, `to_text`, `to_markdown`, `processors`) for any field, core or not. ```json theme={null} { "match": ["/product/.*"], "extract": { "name": { "css": "h1.title::text" }, "price": { "css": "span.price::text" } } } ``` The directive shape is identical to a [FIELDS / callback](/api/callbacks) directive — see [Extractors](/guides/extractors) and [Data Processors](/guides/data-processors). `"auto"` on a **non-core** field is rejected at import. The core fields are exactly `title`, `content`, `author`, `published_date`. Give any other field an explicit selector. ## The auto + override rule Keep `"auto"` for the core fields the reader gets right; add a selector only for fields it can't produce (anything non-core) or gets wrong. A non-core field like `images` does **not** mean you hand-write `content` — keep `content` on `"auto"` and just add the `images` selector. ```json theme={null} { "match": ["/articles/.*"], "extract": { "title": "auto", "content": "auto", "author": { "css": ".byline a::text" }, "images": { "css": "figure img::attr(src)", "get_all": true } } } ``` Mixing `"auto"` with a selector override writes the **spider-wide** global `FIELDS` dict, so **at most one section per spider** may mix `"auto"` with overrides. Other sections must give explicit selectors for every field. Violating this is rejected at import. ## How it desugars At import, [`core/sections.py`](/api/rules) translates each section into the legacy shape: | Section `extract` | Becomes | | ---------------------------- | ----------------------------------------------------------------------------------------- | | absent | a rule with `callback: null` (follow-only) | | `"auto"` | a rule with `callback: "parse_article"` | | dict with any `"auto"` field | a rule with `callback: "parse_article"`; selector overrides merge into `settings.FIELDS` | | dict, all selectors | a rule with `callback: "parse_section_"` + a matching callback holding those selectors | In every case `match` → the rule's `allow`, and `follow` / `priority` / `deny` / `restrict_xpaths` / `restrict_css` / `tags` are carried straight onto the rule. Because it is a pure translation, anything expressible as sections is also expressible as [rules](/api/rules) + [callbacks](/api/callbacks) + [settings](/api/settings). A handful of features are still authored the legacy way (`iterate` listing→detail, `ajax_nested_list`, JS `PAGINATED_LISTINGS`) — write `rules`/`callbacks` directly for those. See the [Spider Schema](/api/spider-schema). ## Complete example Reading top to bottom: the article section pins a stubborn `author` while keeping the rest on `"auto"` (the one allowed `auto` + override section); the product section gives one directive per non-core field; the final `{ "match": [".*"], "follow": true }` is the follow-only navigation section. ```json spider.json theme={null} { "name": "domain_com", "allowed_domains": ["domain.com"], "start_urls": ["https://domain.com/articles"], "sections": [ { "match": ["/articles/.*"], "extract": { "title": "auto", "content": "auto", "author": { "css": ".byline a::text" } } }, { "match": ["/products/.*"], "extract": { "name": { "css": "h1::text" }, "price": { "css": ".price::text" } } }, { "match": [".*"], "follow": true } ], "settings": { "DOWNLOAD_DELAY": 0, "CONCURRENT_REQUESTS": 32, "CONCURRENT_REQUESTS_PER_DOMAIN": 16, "AUTOTHROTTLE_ENABLED": false } } ``` **Plain article site (core fields only)** — one `"auto"` article section plus a follow-only navigation section: ```json spider.json theme={null} { "sections": [ { "match": ["/blog/[^/]+$"], "extract": "auto", "follow": false, "priority": 100 }, { "match": ["/blog$"], "follow": true, "priority": 50 } ] } ``` Every `required: true` field in `project.json` must be sourced by some section. Verify with a 5-article test crawl before importing the final spider. ## Related The lower-level rule shape sections desugar to Named callbacks and field directives Full spider config reference Selector and directive syntax