Crawl Commands

Crawl commands run spiders stored in the database. scrapai has two execution paths, chosen by whether you pass --limit:

Test crawl (--limit N): bounded, runs in the foreground. Use it to verify a spider before a full run.
Production crawl (no --limit): the full crawl. scrapai hands it to Pueue so it keeps running after you disconnect from SSH.

Pueue is an operational dependency for production crawls. If it is not installed, a crawl without --limit exits with an error and tells you to install Pueue (see the README, “Long-running crawls”) or to test with --limit N.

crawl

Run a spider by name.

./scrapai crawl <spider> --project <name> [options]

Arguments & Options

spider

string

required

Spider name (from the database).

--project

string

required

Project name containing the spider.

--output, -o

string

Output file path. Defaults to a date-based file in data/<project>/<spider>/crawls/.

--limit, -l

integer

Limit number of items. When set, runs a test crawl in the foreground.

--timeout, -t

integer

Max runtime in seconds. Sets CLOSESPIDER_TIMEOUT for a graceful stop.

--proxy-type

string

default:"auto"

Proxy to use: auto (smart escalation), none, or any named profile configured in .env (e.g. datacenter, residential, isp, mobile).

--browser

boolean

Use a browser for JS-rendered sites and Cloudflare bypass. Sets CLOUDFLARE_ENABLED=True.

--scrapy-args

string

Additional Scrapy arguments, e.g. "-s SETTING=value -L DEBUG".

--reset-deltafetch

boolean

Clear the DeltaFetch cache to re-crawl all URLs. Also clears the checkpoint.

--save-html

boolean

Save raw HTML in the output (INCLUDE_HTML_IN_OUTPUT=True). Makes files larger.

There is an internal --detached flag used only when Pueue re-runs the command inside its worker. It is hidden and not meant for direct use.

Test Crawl (with —limit)

./scrapai crawl bbc_co_uk --project news --limit 5

Behavior:

Runs in the foreground (not queued to Pueue).
Stops after N items (CLOSESPIDER_ITEMCOUNT).
Saves to the database. Verify with ./scrapai show bbc_co_uk.
Pass -o <file> to also write results to a file.

Output:

🚀 Running DB spider: bbc_co_uk
🔄 Proxy mode: auto (smart escalation with expert-in-the-loop)
🧪 Test mode: Saving to database (limit: 5 items)
   Use './scrapai show bbc_co_uk' to verify results

[Scrapy crawl output...]

2026-02-28 15:30:42 [scrapy.core.engine] INFO: Spider closed (closespider_itemcount)

Run a test crawl to verify spider configuration before launching a full production crawl.

Production Crawl (no limit)

./scrapai crawl bbc_co_uk --project news

Behavior:

Queued to Pueue and run detached, so it survives an SSH disconnect.
Exports to a date-based JSONL: data/news/bbc_co_uk/crawls/crawl_28022026.jsonl (one file per day; same-day re-runs append).
Full HTML content is included only with --save-html.
Checkpoint enabled — see Checkpoint Resume.
Uploaded to S3 after a successful run, if S3 is configured.

Output:

Production crawl 'bbc_co_uk' queued in Pueue (task 42); survives SSH disconnect.
  progress: pueue log 42   all: pueue status   stop: pueue kill 42

Use crawl-status (below) for a higher-level view of progress across spiders, or the raw Pueue commands shown above.

A production crawl requires Pueue. If it is not installed:

Pueue not installed - needed to run full crawls detached.
Install it (README: 'Long-running crawls'), or test with --limit N.

Checkpoint Pause/Resume

Production crawls set a Scrapy JOBDIR checkpoint at data/<project>/<spider>/checkpoint/. To resume, run the same command — it picks up where it left off and continues writing to the same day’s file. To stop a detached crawl, kill its Pueue task (pueue kill <task-id>); re-running the command resumes from the checkpoint. The checkpoint is removed automatically on successful completion.

scrapai guards two checkpoint edge cases automatically:

Corrupted checkpoint (dupefilter persisted but the request queue is empty — a known Scrapy bug): the dupefilter is cleared so URLs can be re-discovered.
Proxy type changed since the last run: the checkpoint is cleared so all URLs are retried with the new proxy.

Proxy Modes

--proxy-type is not a fixed list — it accepts auto, none, or any profile name you have configured in .env.

./scrapai crawl myspider --project myproject --proxy-type auto

./scrapai crawl myspider --project myproject --proxy-type none

./scrapai crawl myspider --project myproject --proxy-type residential

auto — smart escalation with expert-in-the-loop (default).
none — direct connections only.
any other name — an explicit profile, used when blocked.

See Proxy Configuration for .env setup of named profiles.

Changing the proxy type between runs clears the checkpoint so all URLs are retried with the new proxy.

Timeout

./scrapai crawl myspider --project myproject --timeout 7200  # 2 hours

Sets CLOSESPIDER_TIMEOUT for a graceful stop (finishes in-flight requests, saves the checkpoint).

Cloudflare / Browser Mode

For sites that need a browser (JS rendering or Cloudflare), use --browser or set CLOUDFLARE_ENABLED: true in the spider settings:

./scrapai crawl cloudflare_spider --project myproject --browser

On headless Linux servers the crawler auto-detects the missing display and wraps the run with xvfb-run -a. If Xvfb is not installed, browser mode errors out and tells you to sudo apt-get install xvfb (or force headless with CLOUDFLARE_HEADLESS=true). See the Cloudflare Bypass guide.

Sitemap Spider

For spiders with USE_SITEMAP: true, scrapai uses the sitemap spider and crawls from the XML sitemap instead of following links.

crawl-status

Show each detached crawl’s Pueue run state alongside how much it has downloaded. Reads pueue status --json and joins it with the latest crawl file per spider.

./scrapai crawl-status [<spider>] [--project <name>]

spider

string

Optional. Report just this spider (cheaper — only its crawl file is read). Omit to report every spider.

--project

string

Only show crawls in this project.

Output:

  spider      project   state     downloaded   with-content   start           end             last-item

  bbc_co_uk   news      running        1,234    1,180 (96%)    15:30 28-02-26  -               4s
  guardian    news      done           8,902    8,540 (95%)    09:12 27-02-26  14:03 27-02-26  17h

Columns:

state — Pueue run state: running, queued, paused, done, killed, or failed.
downloaded — items in the latest crawl file.
with-content — items whose extracted content is non-empty, with a percentage. The percentage excludes PDFs (collected links-only, no content by design), so a healthy crawl isn’t dragged down by PDFs.
start / end — Pueue task start and end times (- when not yet ended).
last-item — time since the crawl file was last written — a liveness signal for a running crawl.

crawl-status only knows about crawls Pueue tracked (detached production crawls, labelled scrapai:<project>:<spider>). If Pueue is not installed it reports that there are no detached crawls to show.

crawl-all

Run all active spiders in a project, sequentially and in the foreground.

./scrapai crawl-all --project <name> [--limit <N>]

--project

string

required

Project name.

--limit, -l

integer

Limit items per spider (test crawl for each).

./scrapai crawl-all --project news --limit 10

crawl-all runs each spider inline, one after another — it does not queue them to Pueue.

Output Formats

JSONL (Production)

Each line is a JSON object. url and content are the fields scrapai reads for status reporting; a spider’s own fields (title, html, timestamps, …) are also written per its schema.

{"url": "https://bbc.co.uk/news/article-1", "content": "..."}
{"url": "https://bbc.co.uk/news/article-2", "content": "..."}

With --save-html, the raw HTML is included in each record.

Database (Test Crawl)

Test crawls save to the scraped_items table:

SELECT id, url, title, scraped_at
FROM scraped_items
WHERE spider_id = (SELECT id FROM spiders WHERE name = 'bbc_co_uk');

Troubleshooting

Spider Not Found

❌ Spider 'myspider' not found in project 'myproject'.

Import the spider first:

./scrapai spiders import myspider.json --project myproject

Pueue Not Installed

A production crawl needs Pueue. Install it (README: “Long-running crawls”), or run a bounded test with --limit N.

Checkpoint Corruption

If resume fails, clear the checkpoint and restart:

rm -rf data/<project>/<spider>/checkpoint
./scrapai crawl <spider> --project <project>

Cloudflare Bypass Failed

Browser mode needs a display. On Linux servers install Xvfb (sudo apt-get install xvfb), or set CLOUDFLARE_HEADLESS=true in the spider settings.

Commands

crawl

Arguments & Options

Test Crawl (with —limit)

Production Crawl (no limit)

Checkpoint Pause/Resume

Proxy Modes

Timeout

Cloudflare / Browser Mode

Sitemap Spider

crawl-status

crawl-all

Output Formats

JSONL (Production)

Database (Test Crawl)

Troubleshooting

Spider Not Found

Pueue Not Installed

Checkpoint Corruption

Cloudflare Bypass Failed

Next Steps

View Scraped Data

Cloudflare Bypass

​crawl

​Arguments & Options

​Test Crawl (with —limit)

​Production Crawl (no limit)

​Checkpoint Pause/Resume

​Proxy Modes

​Timeout

​Cloudflare / Browser Mode

​Sitemap Spider

​crawl-status

​crawl-all

​Output Formats

​JSONL (Production)

​Database (Test Crawl)

​Troubleshooting

​Spider Not Found

​Pueue Not Installed

​Checkpoint Corruption

​Cloudflare Bypass Failed

​Next Steps

View Scraped Data

Cloudflare Bypass

crawl

Arguments & Options

Test Crawl (with —limit)

Production Crawl (no limit)

Checkpoint Pause/Resume

Proxy Modes

Timeout

Cloudflare / Browser Mode

Sitemap Spider

crawl-status

crawl-all

Output Formats

JSONL (Production)

Database (Test Crawl)

Troubleshooting

Spider Not Found

Pueue Not Installed

Checkpoint Corruption

Cloudflare Bypass Failed

Next Steps