Sections - scrapai

sections is the recommended way to author a spider. You write one section per kind of page the crawl meets — one article layout, one product layout, the listing pages that only link onward — and that single list replaces hand-writing rules + callbacks + settings.FIELDS.

sections is desugared into the lower-level rules + callbacks + settings.FIELDS shape at import time. The runtime spiders are unchanged — sections are purely an authoring convenience. Configs that already use rules/callbacks/FIELDS pass through untouched.

Why sections

A section says two things: which URLs it matches and how to extract from them. One repeating concept gets one section. Same article layout everywhere → one section. Pages that differ in structure or fields → give each its own section.

A spider is not one function. Write as many sections as the site has kinds of pages — never force structurally-different pages through one extract spec.

The section record

Each entry is an object. Only match and extract carry the meaning; the rest are optional knobs.

{ "match": ["/articles/.*"], "extract": "auto", "follow": true, "priority": 100 }

match

string[]

List of URL regex patterns the section applies to. Absent = match all URLs. Becomes the rule’s allow.

extract

"auto" | object

How to extract. Exactly one of three forms — see Extract modes. Absent = follow-only (navigation).

boolean

default:"true"

Whether to follow links found on matched pages.

priority

integer

Optional, 0–1000. Higher is evaluated first.

deny

string[]

URL regexes to exclude. Carried straight onto the rule.

restrict_xpaths

string[]

Restrict link extraction to regions matching these XPaths. Carried onto the rule.

restrict_css

string[]

Restrict link extraction to regions matching these CSS selectors. Carried onto the rule.

Extract modes

extract is exactly one of three things.

Absent → follow-only navigation

Omit extract entirely. The page is crawled for links but nothing is extracted from it. This is the listing / index / navigation section.

{ "match": ["/news$"], "follow": true }

Desugars to a rule with callback: null.

"auto" → built-in article reader

The built-in article reader fills the four core fields: title, content, author, published_date. Use this for ordinary article / blog pages.

{ "match": ["/news/articles/.*"], "extract": "auto" }

Desugars to a rule with callback: "parse_article".

Selector dict → per-field extraction

A { field: value } dict, one entry per schema field. Each value is either:

"auto" — valid only for the four core fields (title, content, author, published_date); or
a directive { "css" | "xpath": "..." } (optionally get_all, to_text, to_markdown, processors) for any field, core or not.

{
  "match": ["/product/.*"],
  "extract": {
    "name":  { "css": "h1.title::text" },
    "price": { "css": "span.price::text" }
  }
}

The directive shape is identical to a FIELDS / callback directive — see Extractors and Data Processors.

"auto" on a non-core field is rejected at import. The core fields are exactly title, content, author, published_date. Give any other field an explicit selector.

The auto + override rule

Keep "auto" for the core fields the reader gets right; add a selector only for fields it can’t produce (anything non-core) or gets wrong. A non-core field like images does not mean you hand-write content — keep content on "auto" and just add the images selector.

{
  "match": ["/articles/.*"],
  "extract": {
    "title":   "auto",
    "content": "auto",
    "author":  { "css": ".byline a::text" },
    "images":  { "css": "figure img::attr(src)", "get_all": true }
  }
}

Mixing "auto" with a selector override writes the spider-wide global FIELDS dict, so at most one section per spider may mix "auto" with overrides. Other sections must give explicit selectors for every field. Violating this is rejected at import.

How it desugars

At import, core/sections.py translates each section into the legacy shape:

Section `extract`	Becomes
absent	a rule with `callback: null` (follow-only)
`"auto"`	a rule with `callback: "parse_article"`
dict with any `"auto"` field	a rule with `callback: "parse_article"`; selector overrides merge into `settings.FIELDS`
dict, all selectors	a rule with `callback: "parse_section_<n>"` + a matching callback holding those selectors

In every case match → the rule’s allow, and follow / priority / deny / restrict_xpaths / restrict_css / tags are carried straight onto the rule.

Because it is a pure translation, anything expressible as sections is also expressible as rules + callbacks + settings. A handful of features are still authored the legacy way (iterate listing→detail, ajax_nested_list, JS PAGINATED_LISTINGS) — write rules/callbacks directly for those. See the Spider Schema.

Complete example

Reading top to bottom: the article section pins a stubborn author while keeping the rest on "auto" (the one allowed auto + override section); the product section gives one directive per non-core field; the final { "match": [".*"], "follow": true } is the follow-only navigation section.

spider.json

{
  "name": "domain_com",
  "allowed_domains": ["domain.com"],
  "start_urls": ["https://domain.com/articles"],
  "sections": [
    {
      "match": ["/articles/.*"],
      "extract": {
        "title":   "auto",
        "content": "auto",
        "author":  { "css": ".byline a::text" }
      }
    },
    {
      "match": ["/products/.*"],
      "extract": {
        "name":  { "css": "h1::text" },
        "price": { "css": ".price::text" }
      }
    },
    { "match": [".*"], "follow": true }
  ],
  "settings": {
    "DOWNLOAD_DELAY": 0,
    "CONCURRENT_REQUESTS": 32,
    "CONCURRENT_REQUESTS_PER_DOMAIN": 16,
    "AUTOTHROTTLE_ENABLED": false
  }
}

Plain article site (core fields only) — one "auto" article section plus a follow-only navigation section:

spider.json

{
  "sections": [
    { "match": ["/blog/[^/]+$"], "extract": "auto", "follow": false, "priority": 100 },
    { "match": ["/blog$"], "follow": true, "priority": 50 }
  ]
}

Every required: true field in project.json must be sourced by some section. Verify with a 5-article test crawl before importing the final spider.

Rules

The lower-level rule shape sections desugar to

Callbacks

Named callbacks and field directives

Spider Schema

Full spider config reference

Extractors

Selector and directive syntax

​Why sections

​The section record

​Extract modes

​The auto + override rule

​How it desugars

​Complete example

​Related

Rules

Callbacks

Spider Schema

Extractors

Why sections

The section record

Extract modes

The auto + override rule

How it desugars

Complete example

Related