> ## Documentation Index
> Fetch the complete documentation index at: https://docs.scrapai.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Authenticated Sessions

> Scrape behind a login by capturing a hand-performed session and replaying it on later crawls

Scrape content behind a login (paywalls, members-only areas, social sites) by reusing a login **you** perform once. scrapai **never types your password** and stores no credentials. It captures the resulting browser session (cookies + localStorage) and replays it so later crawls start already logged in.

<Note>
  The capture is a by-hand step: `./scrapai session login` opens a browser, **you** log in (including 2FA or SSO), then close the window. scrapai types nothing.
</Note>

## The Model

<CardGroup cols={2}>
  <Card title="Global and per-site" icon="globe">
    One `nytimes_com` login serves every spider and project scraping that site. A session is not project-specific.
  </Card>

  <Card title="Only one is loaded" icon="lock">
    A crawl loads only the session it names — never all of them. Each scraped site's browser context holds only its own login.
  </Card>
</CardGroup>

**Name a session after its domain** (`nytimes_com`, `thefga_org`). Other names work (e.g. `nytimes_com_work` for a second account on the same site), but the domain is the convention. Names must be letters, digits, underscores, or hyphens only.

## Storage

* Sessions are saved to `~/.scrapai/sessions/<name>.json`, mode `0600` (owner-readable only).
* Override the directory with the `SCRAPAI_SESSIONS_DIR` env var.

<Warning>
  These files hold **live login cookies** — treat them like passwords.
</Warning>

## Commands

```bash theme={null}
./scrapai session login <name> [url]    # capture a login (human-in-the-loop)
./scrapai session check <name> <url>    # confirm: open a gated URL, save a PNG
./scrapai session list                  # list saved sessions
./scrapai session remove <name>         # delete a session
```

### `session login` — Capture (by hand)

```bash theme={null}
./scrapai session login nytimes_com https://www.nytimes.com/
```

A browser window opens. **Log in by hand** — navigate to the site's login if needed, finish any 2FA or "Sign in with Google". When you're logged in, just **close the window** — that captures the session (cookies + localStorage). No password is typed and no keyboard is needed to trigger the save, so the same flow works headless/remote.

<Tip>
  Re-running `./scrapai session login <name>` overwrites the file — that is how you **refresh** a session once its cookies expire.
</Tip>

<ParamField path="url" type="string" default="demo login page">
  Optional. The page the browser opens on. Pass the site you want to log in to; if omitted, a built-in demo login page loads.
</ParamField>

### `session check` — Confirm with a screenshot

```bash theme={null}
./scrapai session check nytimes_com https://www.nytimes.com/account
./scrapai session check x_com https://x.com/home --wait 8
```

Loads the session, opens the gated URL in its own browser context, and saves a PNG to `~/.scrapai/sessions/<name>_check.png`. Open it to verify you're actually logged in — you should see the logged-in page (account name in the corner, full article), not the paywall.

<ParamField path="--wait" type="int" default="3">
  Seconds to let the page render before the screenshot. Raise it for JS-heavy SPAs (e.g. `--wait 8` for x.com), whose timeline takes several seconds to paint — otherwise you screenshot a loading spinner.
</ParamField>

### `session list` / `session remove`

```bash theme={null}
./scrapai session list
./scrapai session remove nytimes_com
```

`list` prints saved session names (without the `.json` extension). `remove` deletes the named session file.

## Using a Session in Inspect and Crawl

<Steps>
  <Step title="QA the paywall anonymously first">
    Inspect a gated article **without** a session. If the full text is already in the HTML (a *soft* paywall), you need no login at all. If you get a login/paywall stub (a *hard* paywall), continue.

    ```bash theme={null}
    ./scrapai inspect https://www.nytimes.com/section/world --project news
    ```
  </Step>

  <Step title="Capture the login">
    ```bash theme={null}
    ./scrapai session login nytimes_com https://www.nytimes.com/
    ```

    Log in by hand, then close the window.
  </Step>

  <Step title="Verify it took">
    ```bash theme={null}
    ./scrapai session check nytimes_com https://www.nytimes.com/account
    ```

    Read the PNG — confirm you're logged in.
  </Step>

  <Step title="Inspect gated pages with the session">
    Pass `--session <name>` **and** `--browser`:

    ```bash theme={null}
    ./scrapai inspect https://www.nytimes.com/section/world --session nytimes_com --browser --project news
    ```
  </Step>

  <Step title="Configure the spider">
    Set the `SESSION` setting plus a browser transport so every fetch is authenticated:

    ```json spider.json theme={null}
    {
      "settings": {
        "SESSION": "nytimes_com",
        "CLOUDFLARE_ENABLED": true
      }
    }
    ```
  </Step>
</Steps>

<Warning>
  `inspect` has a `--session` CLI flag. **Crawls and the warm browser service do not** — a production crawl reads the session from the spider's `"SESSION"` setting, not a command-line flag.
</Warning>

### The session only applies in the browser path

A session is a *browser* storage\_state — it takes effect only when the browser fetches the page. This bites in two ways:

<AccordionGroup>
  <Accordion title="inspect without --browser silently skips the session">
    `inspect` escalates HTTP → curl\_cffi → browser and stops at the first transport that returns a 200. On a paywalled site, plain HTTP returns the **paywall page** with a valid 200, so `inspect` stops there — never reaching the browser, never applying the session. Force it with `--session <name> --browser`.
  </Accordion>

  <Accordion title="SESSION needs a browser setting to pair with">
    * `"CLOUDFLARE_ENABLED": true` — **hybrid, preferred.** The browser authenticates and solves Cloudflare once per host, then curl\_cffi replays the login cookies over fast HTTP. Works when the paywall is enforced server-side by cookie (common).
    * `"BROWSER_ENABLED": true` — renders **every** page in the browser. Correct but slow; use only when the paywall is enforced client-side by JS and the hybrid path returns the stub.

    Test hybrid first: crawl with `--limit 5`. If articles come back with full text, keep it. If they come back as the paywall stub, switch to `BROWSER_ENABLED`.
  </Accordion>
</AccordionGroup>

## Session Expiry: Fail Loud

A saved login eventually expires. When it does, every fetch silently reverts to the site's auth-wall page — which has no article, no date. Under a mandatory-date schema those rows are quarantined, so a crawl can run for hours "succeeding" while collecting nothing.

Guard against it with `SESSION_EXPIRED_SIGNAL` — a string that appears on the site's auth-wall page:

```json spider.json theme={null}
{
  "settings": {
    "SESSION": "ladiaria_com_uy",
    "SESSION_EXPIRED_SIGNAL": "Muro de pago"
  }
}
```

On a `SESSION` crawl, if a fetched page contains that string, the login has expired — so the crawl **stops on the first hit** with an ERROR naming the fix:

```
[ladiaria_com_uy] SESSION expired (auth-wall detected). Stopping crawl.
Re-run: scrapai session login ladiaria_com_uy
```

<Info>
  * **Off by default** — no signal set, no check, behaviour unchanged.
  * **Stop-on-first** — a valid session never shows the auth-wall, so the first hit is already proof it's dead. No threshold, no tolerance window.
  * **No auto-recovery** — unlike a Cloudflare block (which the crawl re-solves itself), a dead login needs a human. Re-run `./scrapai session login <name>` to refresh the file, then restart the crawl.
</Info>

<Tip>
  **Finding the signal:** inspect a gated article *without* a session (or after logout) and note a stable string unique to the wall — a page `<title>` like `Muro de pago`, a "Suscribite para continuar" banner, etc. Pick something that never appears on a real logged-in article.
</Tip>

## On a Server

`session login` opens a real browser window, so it needs a display. On a headless server, capture the session on a machine that has a display and copy `~/.scrapai/sessions/<name>.json` to the server (re-copy when it expires).

## Security

<Check>
  * Files are `0600` and hold live login cookies — treat them like passwords.
  * Only the session a crawl names is loaded, so each scraped site sees only its own login.
  * scrapai never stores or types your username/password.
</Check>

## Related Guides

<CardGroup cols={2}>
  <Card title="Cloudflare Bypass" icon="shield-halved" href="/guides/cloudflare-bypass">
    Pair a session with hybrid CF bypass
  </Card>

  <Card title="Crawl Command" icon="spider" href="/cli/crawl">
    Run the production crawl
  </Card>
</CardGroup>
