> ## Documentation Index
> Fetch the complete documentation index at: https://docs.scrapai.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Sections

> The recommended spider authoring format — one section per kind of page, desugared into rules, callbacks, and FIELDS at import

`sections` is the recommended way to author a spider. You write one **section** per *kind of page* the crawl meets — one article layout, one product layout, the listing pages that only link onward — and that single list replaces hand-writing `rules` + `callbacks` + `settings.FIELDS`.

<Note>
  `sections` is desugared into the lower-level [rules](/api/rules) + [callbacks](/api/callbacks) + `settings.FIELDS` shape at import time. The runtime spiders are unchanged — sections are purely an authoring convenience. Configs that already use `rules`/`callbacks`/`FIELDS` pass through untouched.
</Note>

## Why sections

A section says two things: *which URLs it matches* and *how to extract from them*. One repeating concept gets one section. Same article layout everywhere → one section. Pages that differ in structure or fields → give each its own section.

<Tip>
  A spider is not one function. Write as many sections as the site has kinds of pages — never force structurally-different pages through one `extract` spec.
</Tip>

## The section record

Each entry is an object. Only `match` and `extract` carry the meaning; the rest are optional knobs.

```json theme={null}
{ "match": ["/articles/.*"], "extract": "auto", "follow": true, "priority": 100 }
```

<ParamField path="match" type="string[]">
  List of URL regex patterns the section applies to. **Absent = match all URLs.** Becomes the rule's `allow`.
</ParamField>

<ParamField path="extract" type="&#x22;auto&#x22; | object">
  How to extract. Exactly one of three forms — see [Extract modes](#extract-modes). **Absent = follow-only** (navigation).
</ParamField>

<ParamField path="follow" type="boolean" default="true">
  Whether to follow links found on matched pages.
</ParamField>

<ParamField path="priority" type="integer">
  Optional, `0`–`1000`. Higher is evaluated first.
</ParamField>

<ParamField path="deny" type="string[]">
  URL regexes to exclude. Carried straight onto the rule.
</ParamField>

<ParamField path="restrict_xpaths" type="string[]">
  Restrict link extraction to regions matching these XPaths. Carried onto the rule.
</ParamField>

<ParamField path="restrict_css" type="string[]">
  Restrict link extraction to regions matching these CSS selectors. Carried onto the rule.
</ParamField>

<ParamField path="tags" type="string[]">
  HTML tags to extract links from. Carried onto the rule.
</ParamField>

<Warning>
  Unknown keys and wrong types are **rejected at import** — the section schema forbids extra keys. Transport, throughput, `PDF_MODE`, `USE_SITEMAP`, and DeltaFetch all stay in top-level `settings`, never per-section.
</Warning>

## Extract modes

`extract` is exactly one of three things.

<AccordionGroup>
  <Accordion title="Absent → follow-only navigation" icon="arrow-right">
    Omit `extract` entirely. The page is crawled for links but nothing is extracted from it. This is the listing / index / navigation section.

    ```json theme={null}
    { "match": ["/news$"], "follow": true }
    ```

    Desugars to a rule with `callback: null`.
  </Accordion>

  <Accordion title="&#x22;auto&#x22; → built-in article reader" icon="wand-magic-sparkles">
    The built-in article reader fills the four **core fields**: `title`, `content`, `author`, `published_date`. Use this for ordinary article / blog pages.

    ```json theme={null}
    { "match": ["/news/articles/.*"], "extract": "auto" }
    ```

    Desugars to a rule with `callback: "parse_article"`.
  </Accordion>

  <Accordion title="Selector dict → per-field extraction" icon="crosshairs">
    A `{ field: value }` dict, one entry per schema field. Each value is either:

    * **`"auto"`** — valid **only** for the four core fields (`title`, `content`, `author`, `published_date`); or
    * a **directive** `{ "css" | "xpath": "..." }` (optionally `get_all`, `to_text`, `to_markdown`, `processors`) for any field, core or not.

    ```json theme={null}
    {
      "match": ["/product/.*"],
      "extract": {
        "name":  { "css": "h1.title::text" },
        "price": { "css": "span.price::text" }
      }
    }
    ```

    The directive shape is identical to a [FIELDS / callback](/api/callbacks) directive — see [Extractors](/guides/extractors) and [Data Processors](/guides/data-processors).
  </Accordion>
</AccordionGroup>

<Warning>
  `"auto"` on a **non-core** field is rejected at import. The core fields are exactly `title`, `content`, `author`, `published_date`. Give any other field an explicit selector.
</Warning>

## The auto + override rule

Keep `"auto"` for the core fields the reader gets right; add a selector only for fields it can't produce (anything non-core) or gets wrong. A non-core field like `images` does **not** mean you hand-write `content` — keep `content` on `"auto"` and just add the `images` selector.

```json theme={null}
{
  "match": ["/articles/.*"],
  "extract": {
    "title":   "auto",
    "content": "auto",
    "author":  { "css": ".byline a::text" },
    "images":  { "css": "figure img::attr(src)", "get_all": true }
  }
}
```

<Warning>
  Mixing `"auto"` with a selector override writes the **spider-wide** global `FIELDS` dict, so **at most one section per spider** may mix `"auto"` with overrides. Other sections must give explicit selectors for every field. Violating this is rejected at import.
</Warning>

## How it desugars

At import, [`core/sections.py`](/api/rules) translates each section into the legacy shape:

| Section `extract`            | Becomes                                                                                   |
| ---------------------------- | ----------------------------------------------------------------------------------------- |
| absent                       | a rule with `callback: null` (follow-only)                                                |
| `"auto"`                     | a rule with `callback: "parse_article"`                                                   |
| dict with any `"auto"` field | a rule with `callback: "parse_article"`; selector overrides merge into `settings.FIELDS`  |
| dict, all selectors          | a rule with `callback: "parse_section_<n>"` + a matching callback holding those selectors |

In every case `match` → the rule's `allow`, and `follow` / `priority` / `deny` / `restrict_xpaths` / `restrict_css` / `tags` are carried straight onto the rule.

<Note>
  Because it is a pure translation, anything expressible as sections is also expressible as [rules](/api/rules) + [callbacks](/api/callbacks) + [settings](/api/settings). A handful of features are still authored the legacy way (`iterate` listing→detail, `ajax_nested_list`, JS `PAGINATED_LISTINGS`) — write `rules`/`callbacks` directly for those. See the [Spider Schema](/api/spider-schema).
</Note>

## Complete example

Reading top to bottom: the article section pins a stubborn `author` while keeping the rest on `"auto"` (the one allowed `auto` + override section); the product section gives one directive per non-core field; the final `{ "match": [".*"], "follow": true }` is the follow-only navigation section.

```json spider.json theme={null}
{
  "name": "domain_com",
  "allowed_domains": ["domain.com"],
  "start_urls": ["https://domain.com/articles"],
  "sections": [
    {
      "match": ["/articles/.*"],
      "extract": {
        "title":   "auto",
        "content": "auto",
        "author":  { "css": ".byline a::text" }
      }
    },
    {
      "match": ["/products/.*"],
      "extract": {
        "name":  { "css": "h1::text" },
        "price": { "css": ".price::text" }
      }
    },
    { "match": [".*"], "follow": true }
  ],
  "settings": {
    "DOWNLOAD_DELAY": 0,
    "CONCURRENT_REQUESTS": 32,
    "CONCURRENT_REQUESTS_PER_DOMAIN": 16,
    "AUTOTHROTTLE_ENABLED": false
  }
}
```

**Plain article site (core fields only)** — one `"auto"` article section plus a follow-only navigation section:

```json spider.json theme={null}
{
  "sections": [
    { "match": ["/blog/[^/]+$"], "extract": "auto", "follow": false, "priority": 100 },
    { "match": ["/blog$"], "follow": true, "priority": 50 }
  ]
}
```

<Check>
  Every `required: true` field in `project.json` must be sourced by some section. Verify with a 5-article test crawl before importing the final spider.
</Check>

## Related

<CardGroup cols={2}>
  <Card title="Rules" icon="list-check" href="/api/rules">
    The lower-level rule shape sections desugar to
  </Card>

  <Card title="Callbacks" icon="code" href="/api/callbacks">
    Named callbacks and field directives
  </Card>

  <Card title="Spider Schema" icon="file-code" href="/api/spider-schema">
    Full spider config reference
  </Card>

  <Card title="Extractors" icon="filter" href="/guides/extractors">
    Selector and directive syntax
  </Card>
</CardGroup>
