The capture is a by-hand step:
./scrapai session login opens a browser, you log in (including 2FA or SSO), then close the window. scrapai types nothing.The Model
Global and per-site
One
nytimes_com login serves every spider and project scraping that site. A session is not project-specific.Only one is loaded
A crawl loads only the session it names — never all of them. Each scraped site’s browser context holds only its own login.
nytimes_com, thefga_org). Other names work (e.g. nytimes_com_work for a second account on the same site), but the domain is the convention. Names must be letters, digits, underscores, or hyphens only.
Storage
- Sessions are saved to
~/.scrapai/sessions/<name>.json, mode0600(owner-readable only). - Override the directory with the
SCRAPAI_SESSIONS_DIRenv var.
Commands
session login — Capture (by hand)
Optional. The page the browser opens on. Pass the site you want to log in to; if omitted, a built-in demo login page loads.
session check — Confirm with a screenshot
~/.scrapai/sessions/<name>_check.png. Open it to verify you’re actually logged in — you should see the logged-in page (account name in the corner, full article), not the paywall.
Seconds to let the page render before the screenshot. Raise it for JS-heavy SPAs (e.g.
--wait 8 for x.com), whose timeline takes several seconds to paint — otherwise you screenshot a loading spinner.session list / session remove
list prints saved session names (without the .json extension). remove deletes the named session file.
Using a Session in Inspect and Crawl
QA the paywall anonymously first
Inspect a gated article without a session. If the full text is already in the HTML (a soft paywall), you need no login at all. If you get a login/paywall stub (a hard paywall), continue.
The session only applies in the browser path
A session is a browser storage_state — it takes effect only when the browser fetches the page. This bites in two ways:inspect without --browser silently skips the session
inspect without --browser silently skips the session
inspect escalates HTTP → curl_cffi → browser and stops at the first transport that returns a 200. On a paywalled site, plain HTTP returns the paywall page with a valid 200, so inspect stops there — never reaching the browser, never applying the session. Force it with --session <name> --browser.SESSION needs a browser setting to pair with
SESSION needs a browser setting to pair with
"CLOUDFLARE_ENABLED": true— hybrid, preferred. The browser authenticates and solves Cloudflare once per host, then curl_cffi replays the login cookies over fast HTTP. Works when the paywall is enforced server-side by cookie (common)."BROWSER_ENABLED": true— renders every page in the browser. Correct but slow; use only when the paywall is enforced client-side by JS and the hybrid path returns the stub.
--limit 5. If articles come back with full text, keep it. If they come back as the paywall stub, switch to BROWSER_ENABLED.Session Expiry: Fail Loud
A saved login eventually expires. When it does, every fetch silently reverts to the site’s auth-wall page — which has no article, no date. Under a mandatory-date schema those rows are quarantined, so a crawl can run for hours “succeeding” while collecting nothing. Guard against it withSESSION_EXPIRED_SIGNAL — a string that appears on the site’s auth-wall page:
spider.json
SESSION crawl, if a fetched page contains that string, the login has expired — so the crawl stops on the first hit with an ERROR naming the fix:
- Off by default — no signal set, no check, behaviour unchanged.
- Stop-on-first — a valid session never shows the auth-wall, so the first hit is already proof it’s dead. No threshold, no tolerance window.
- No auto-recovery — unlike a Cloudflare block (which the crawl re-solves itself), a dead login needs a human. Re-run
./scrapai session login <name>to refresh the file, then restart the crawl.
On a Server
session login opens a real browser window, so it needs a display. On a headless server, capture the session on a machine that has a display and copy ~/.scrapai/sessions/<name>.json to the server (re-copy when it expires).
Security
- Files are
0600and hold live login cookies — treat them like passwords. - Only the session a crawl names is loaded, so each scraped site sees only its own login.
- scrapai never stores or types your username/password.
Related Guides
Cloudflare Bypass
Pair a session with hybrid CF bypass
Crawl Command
Run the production crawl