Skip to main content
Scrape content behind a login (paywalls, members-only areas, social sites) by reusing a login you perform once. scrapai never types your password and stores no credentials. It captures the resulting browser session (cookies + localStorage) and replays it so later crawls start already logged in.
The capture is a by-hand step: ./scrapai session login opens a browser, you log in (including 2FA or SSO), then close the window. scrapai types nothing.

The Model

Global and per-site

One nytimes_com login serves every spider and project scraping that site. A session is not project-specific.

Only one is loaded

A crawl loads only the session it names — never all of them. Each scraped site’s browser context holds only its own login.
Name a session after its domain (nytimes_com, thefga_org). Other names work (e.g. nytimes_com_work for a second account on the same site), but the domain is the convention. Names must be letters, digits, underscores, or hyphens only.

Storage

  • Sessions are saved to ~/.scrapai/sessions/<name>.json, mode 0600 (owner-readable only).
  • Override the directory with the SCRAPAI_SESSIONS_DIR env var.
These files hold live login cookies — treat them like passwords.

Commands

./scrapai session login <name> [url]    # capture a login (human-in-the-loop)
./scrapai session check <name> <url>    # confirm: open a gated URL, save a PNG
./scrapai session list                  # list saved sessions
./scrapai session remove <name>         # delete a session

session login — Capture (by hand)

./scrapai session login nytimes_com https://www.nytimes.com/
A browser window opens. Log in by hand — navigate to the site’s login if needed, finish any 2FA or “Sign in with Google”. When you’re logged in, just close the window — that captures the session (cookies + localStorage). No password is typed and no keyboard is needed to trigger the save, so the same flow works headless/remote.
Re-running ./scrapai session login <name> overwrites the file — that is how you refresh a session once its cookies expire.
url
string
default:"demo login page"
Optional. The page the browser opens on. Pass the site you want to log in to; if omitted, a built-in demo login page loads.

session check — Confirm with a screenshot

./scrapai session check nytimes_com https://www.nytimes.com/account
./scrapai session check x_com https://x.com/home --wait 8
Loads the session, opens the gated URL in its own browser context, and saves a PNG to ~/.scrapai/sessions/<name>_check.png. Open it to verify you’re actually logged in — you should see the logged-in page (account name in the corner, full article), not the paywall.
--wait
int
default:"3"
Seconds to let the page render before the screenshot. Raise it for JS-heavy SPAs (e.g. --wait 8 for x.com), whose timeline takes several seconds to paint — otherwise you screenshot a loading spinner.

session list / session remove

./scrapai session list
./scrapai session remove nytimes_com
list prints saved session names (without the .json extension). remove deletes the named session file.

Using a Session in Inspect and Crawl

1

QA the paywall anonymously first

Inspect a gated article without a session. If the full text is already in the HTML (a soft paywall), you need no login at all. If you get a login/paywall stub (a hard paywall), continue.
./scrapai inspect https://www.nytimes.com/section/world --project news
2

Capture the login

./scrapai session login nytimes_com https://www.nytimes.com/
Log in by hand, then close the window.
3

Verify it took

./scrapai session check nytimes_com https://www.nytimes.com/account
Read the PNG — confirm you’re logged in.
4

Inspect gated pages with the session

Pass --session <name> and --browser:
./scrapai inspect https://www.nytimes.com/section/world --session nytimes_com --browser --project news
5

Configure the spider

Set the SESSION setting plus a browser transport so every fetch is authenticated:
spider.json
{
  "settings": {
    "SESSION": "nytimes_com",
    "CLOUDFLARE_ENABLED": true
  }
}
inspect has a --session CLI flag. Crawls and the warm browser service do not — a production crawl reads the session from the spider’s "SESSION" setting, not a command-line flag.

The session only applies in the browser path

A session is a browser storage_state — it takes effect only when the browser fetches the page. This bites in two ways:
inspect escalates HTTP → curl_cffi → browser and stops at the first transport that returns a 200. On a paywalled site, plain HTTP returns the paywall page with a valid 200, so inspect stops there — never reaching the browser, never applying the session. Force it with --session <name> --browser.
  • "CLOUDFLARE_ENABLED": truehybrid, preferred. The browser authenticates and solves Cloudflare once per host, then curl_cffi replays the login cookies over fast HTTP. Works when the paywall is enforced server-side by cookie (common).
  • "BROWSER_ENABLED": true — renders every page in the browser. Correct but slow; use only when the paywall is enforced client-side by JS and the hybrid path returns the stub.
Test hybrid first: crawl with --limit 5. If articles come back with full text, keep it. If they come back as the paywall stub, switch to BROWSER_ENABLED.

Session Expiry: Fail Loud

A saved login eventually expires. When it does, every fetch silently reverts to the site’s auth-wall page — which has no article, no date. Under a mandatory-date schema those rows are quarantined, so a crawl can run for hours “succeeding” while collecting nothing. Guard against it with SESSION_EXPIRED_SIGNAL — a string that appears on the site’s auth-wall page:
spider.json
{
  "settings": {
    "SESSION": "ladiaria_com_uy",
    "SESSION_EXPIRED_SIGNAL": "Muro de pago"
  }
}
On a SESSION crawl, if a fetched page contains that string, the login has expired — so the crawl stops on the first hit with an ERROR naming the fix:
[ladiaria_com_uy] SESSION expired (auth-wall detected). Stopping crawl.
Re-run: scrapai session login ladiaria_com_uy
  • Off by default — no signal set, no check, behaviour unchanged.
  • Stop-on-first — a valid session never shows the auth-wall, so the first hit is already proof it’s dead. No threshold, no tolerance window.
  • No auto-recovery — unlike a Cloudflare block (which the crawl re-solves itself), a dead login needs a human. Re-run ./scrapai session login <name> to refresh the file, then restart the crawl.
Finding the signal: inspect a gated article without a session (or after logout) and note a stable string unique to the wall — a page <title> like Muro de pago, a “Suscribite para continuar” banner, etc. Pick something that never appears on a real logged-in article.

On a Server

session login opens a real browser window, so it needs a display. On a headless server, capture the session on a machine that has a display and copy ~/.scrapai/sessions/<name>.json to the server (re-copy when it expires).

Security

  • Files are 0600 and hold live login cookies — treat them like passwords.
  • Only the session a crawl names is loaded, so each scraped site sees only its own login.
  • scrapai never stores or types your username/password.

Cloudflare Bypass

Pair a session with hybrid CF bypass

Crawl Command

Run the production crawl