` like `Muro de pago`, a "Suscribite para continuar" banner, etc. Pick something that never appears on a real logged-in article. </Tip> ## On a Server `session login` opens a real browser window, so it needs a display. On a headless server, capture the session on a machine that has a display and copy `~/.scrapai/sessions/<name>.json` to the server (re-copy when it expires). ## Security <Check> * Files are `0600` and hold live login cookies — treat them like passwords. * Only the session a crawl names is loaded, so each scraped site sees only its own login. * scrapai never stores or types your username/password. </Check> ## Related Guides <CardGroup cols={2}> <Card title="Cloudflare Bypass" icon="shield-halved" href="/guides/cloudflare-bypass"> Pair a session with hybrid CF bypass </Card> <Card title="Crawl Command" icon="spider" href="/cli/crawl"> Run the production crawl </Card> </CardGroup>

> ## Documentation Index > Fetch the complete documentation index at: https://docs.scrapai.dev/llms.txt > Use this file to discover all available pages before exploring further. # Authenticated Sessions > Scrape behind a login by capturing a hand-performed session and replaying it on later crawls Scrape content behind a login (paywalls, members-only areas, social sites) by reusing a login **you** perform once. scrapai **never types your password** and stores no credentials. It captures the resulting browser session (cookies + localStorage) and replays it so later crawls start already logged in. The capture is a by-hand step: `./scrapai session login` opens a browser, **you** log in (including 2FA or SSO), then close the window. scrapai types nothing. ## The Model One `nytimes_com` login serves every spider and project scraping that site. A session is not project-specific. A crawl loads only the session it names — never all of them. Each scraped site's browser context holds only its own login. **Name a session after its domain** (`nytimes_com`, `thefga_org`). Other names work (e.g. `nytimes_com_work` for a second account on the same site), but the domain is the convention. Names must be letters, digits, underscores, or hyphens only. ## Storage * Sessions are saved to `~/.scrapai/sessions/.json`, mode `0600` (owner-readable only). * Override the directory with the `SCRAPAI_SESSIONS_DIR` env var. These files hold **live login cookies** — treat them like passwords. ## Commands ```bash theme={null} ./scrapai session login [url] # capture a login (human-in-the-loop) ./scrapai session check # confirm: open a gated URL, save a PNG ./scrapai session list # list saved sessions ./scrapai session remove # delete a session ``` ### `session login` — Capture (by hand) ```bash theme={null} ./scrapai session login nytimes_com https://www.nytimes.com/ ``` A browser window opens. **Log in by hand** — navigate to the site's login if needed, finish any 2FA or "Sign in with Google". When you're logged in, just **close the window** — that captures the session (cookies + localStorage). No password is typed and no keyboard is needed to trigger the save, so the same flow works headless/remote. Re-running `./scrapai session login ` overwrites the file — that is how you **refresh** a session once its cookies expire. Optional. The page the browser opens on. Pass the site you want to log in to; if omitted, a built-in demo login page loads. ### `session check` — Confirm with a screenshot ```bash theme={null} ./scrapai session check nytimes_com https://www.nytimes.com/account ./scrapai session check x_com https://x.com/home --wait 8 ``` Loads the session, opens the gated URL in its own browser context, and saves a PNG to `~/.scrapai/sessions/_check.png`. Open it to verify you're actually logged in — you should see the logged-in page (account name in the corner, full article), not the paywall. Seconds to let the page render before the screenshot. Raise it for JS-heavy SPAs (e.g. `--wait 8` for x.com), whose timeline takes several seconds to paint — otherwise you screenshot a loading spinner. ### `session list` / `session remove` ```bash theme={null} ./scrapai session list ./scrapai session remove nytimes_com ``` `list` prints saved session names (without the `.json` extension). `remove` deletes the named session file. ## Using a Session in Inspect and Crawl Inspect a gated article **without** a session. If the full text is already in the HTML (a *soft* paywall), you need no login at all. If you get a login/paywall stub (a *hard* paywall), continue. ```bash theme={null} ./scrapai inspect https://www.nytimes.com/section/world --project news ``` ```bash theme={null} ./scrapai session login nytimes_com https://www.nytimes.com/ ``` Log in by hand, then close the window. ```bash theme={null} ./scrapai session check nytimes_com https://www.nytimes.com/account ``` Read the PNG — confirm you're logged in. Pass `--session ` **and** `--browser`: ```bash theme={null} ./scrapai inspect https://www.nytimes.com/section/world --session nytimes_com --browser --project news ``` Set the `SESSION` setting plus a browser transport so every fetch is authenticated: ```json spider.json theme={null} { "settings": { "SESSION": "nytimes_com", "CLOUDFLARE_ENABLED": true } } ``` `inspect` has a `--session` CLI flag. **Crawls and the warm browser service do not** — a production crawl reads the session from the spider's `"SESSION"` setting, not a command-line flag. ### The session only applies in the browser path A session is a *browser* storage\_state — it takes effect only when the browser fetches the page. This bites in two ways: `inspect` escalates HTTP → curl\_cffi → browser and stops at the first transport that returns a 200. On a paywalled site, plain HTTP returns the **paywall page** with a valid 200, so `inspect` stops there — never reaching the browser, never applying the session. Force it with `--session --browser`. * `"CLOUDFLARE_ENABLED": true` — **hybrid, preferred.** The browser authenticates and solves Cloudflare once per host, then curl\_cffi replays the login cookies over fast HTTP. Works when the paywall is enforced server-side by cookie (common). * `"BROWSER_ENABLED": true` — renders **every** page in the browser. Correct but slow; use only when the paywall is enforced client-side by JS and the hybrid path returns the stub. Test hybrid first: crawl with `--limit 5`. If articles come back with full text, keep it. If they come back as the paywall stub, switch to `BROWSER_ENABLED`. ## Session Expiry: Fail Loud A saved login eventually expires. When it does, every fetch silently reverts to the site's auth-wall page — which has no article, no date. Under a mandatory-date schema those rows are quarantined, so a crawl can run for hours "succeeding" while collecting nothing. Guard against it with `SESSION_EXPIRED_SIGNAL` — a string that appears on the site's auth-wall page: ```json spider.json theme={null} { "settings": { "SESSION": "ladiaria_com_uy", "SESSION_EXPIRED_SIGNAL": "Muro de pago" } } ``` On a `SESSION` crawl, if a fetched page contains that string, the login has expired — so the crawl **stops on the first hit** with an ERROR naming the fix: ``` [ladiaria_com_uy] SESSION expired (auth-wall detected). Stopping crawl. Re-run: scrapai session login ladiaria_com_uy ``` * **Off by default** — no signal set, no check, behaviour unchanged. * **Stop-on-first** — a valid session never shows the auth-wall, so the first hit is already proof it's dead. No threshold, no tolerance window. * **No auto-recovery** — unlike a Cloudflare block (which the crawl re-solves itself), a dead login needs a human. Re-run `./scrapai session login ` to refresh the file, then restart the crawl. **Finding the signal:** inspect a gated article *without* a session (or after logout) and note a stable string unique to the wall — a page `` like `Muro de pago`, a "Suscribite para continuar" banner, etc. Pick something that never appears on a real logged-in article. </Tip> ## On a Server `session login` opens a real browser window, so it needs a display. On a headless server, capture the session on a machine that has a display and copy `~/.scrapai/sessions/<name>.json` to the server (re-copy when it expires). ## Security <Check> * Files are `0600` and hold live login cookies — treat them like passwords. * Only the session a crawl names is loaded, so each scraped site sees only its own login. * scrapai never stores or types your username/password. </Check> ## Related Guides <CardGroup cols={2}> <Card title="Cloudflare Bypass" icon="shield-halved" href="/guides/cloudflare-bypass"> Pair a session with hybrid CF bypass </Card> <Card title="Crawl Command" icon="spider" href="/cli/crawl"> Run the production crawl </Card> </CardGroup>