Queue Processing

Optional queue system for managing multiple website crawls. Use when explicitly needed. Always specify --project when working with queues.

Commands Reference

Add Single URL

./scrapai queue add <url> --project NAME [-m "instruction"] [--priority N]

Example:

./scrapai queue add https://example.com --project myproject -m "Focus on research articles" --priority 5

Bulk Add URLs

./scrapai queue bulk <file.csv|file.json> --project NAME [--priority N]

Example:

./scrapai queue bulk sites.json --project myproject

List Queue Items

Default (5 pending)
More Items
All Statuses
Filter by Status
Count Only

./scrapai queue list --project NAME

Shows 5 pending/processing items by default

./scrapai queue list --project NAME --limit 20

Show more queue items

./scrapai queue list --project NAME --all --limit 50

Include completed and failed items

./scrapai queue list --project NAME --status pending
./scrapai queue list --project NAME --status processing
./scrapai queue list --project NAME --status failed

Show specific status only

./scrapai queue list --project NAME --count
./scrapai queue list --project NAME --status failed --count

Just show the count

Claim Next Item

./scrapai queue next --project NAME

Claims next highest-priority pending item and marks it as processing.

Update Item Status

ID is globally unique, no --project needed for status updates

./scrapai queue complete <id>

./scrapai queue fail <id> [-m "error message"]

./scrapai queue retry <id>

./scrapai queue remove <id>

Cleanup Queue

Cleanup commands require --force flag to prevent accidental deletion

./scrapai queue cleanup --completed --force --project NAME

./scrapai queue cleanup --failed --force --project NAME

./scrapai queue cleanup --all --force --project NAME

Bulk File Formats

JSON Format

queue.json

[
  {
    "url": "https://site1.com",
    "custom_instruction": "Focus on research articles",
    "priority": 5
  },
  {
    "url": "https://site2.com",
    "priority": 10
  },
  {
    "url": "https://site3.com",
    "custom_instruction": "Include all news sections",
    "priority": 1
  }
]

Template available: templates/queue-template.json

CSV Format

queue.csv

url,custom_instruction,priority
https://site1.com,Focus on research articles,5
https://site2.com,,10
https://site3.com,Include all news sections,1

Template available: templates/queue-template.csv

Required columns:

url - Website URL (required)
custom_instruction - Special instructions for this site (optional)
priority - Higher number = higher priority (optional, default: 0)

Queue Processing Workflow

Claim next item

./scrapai queue next --project NAME

Note the returned values:

ID (for status updates)
URL (target website)
Project name
Custom instruction (if exists)

Process custom instruction

If custom_instruction exists, use it to override default analysis behavior (e.g., focus on specific content types, custom selectors).

Run analysis phases

Execute standard analysis workflow (Phases 1-4).Include "source_url": "<queue_url>" in final_spider.json.

Import spider

./scrapai import final_spider.json --project myproject

Always use --project matching the queue item’s project! Without it, spider defaults to “default” project, mixing data across projects.

Run crawl

./scrapai crawl spider_name --project myproject

Update status on success

./scrapai queue complete <id>

Update status on failure

./scrapai queue fail <id> -m "error description"

Complete Example

1. Create Queue File

sites.json

[
  {
    "url": "https://techcrunch.com",
    "custom_instruction": "Focus on AI and startup news",
    "priority": 10
  },
  {
    "url": "https://theverge.com",
    "custom_instruction": "All technology articles",
    "priority": 5
  },
  {
    "url": "https://arstechnica.com",
    "priority": 3
  }
]

2. Bulk Add to Queue

./scrapai queue bulk sites.json --project tech_news

3. Check Queue

./scrapai queue list --project tech_news

Output:

Queue items for project 'tech_news':

ID: 1
URL: https://techcrunch.com
Status: pending
Priority: 10
Custom instruction: Focus on AI and startup news
Created: 2024-02-24 10:30:00

... (2 more items)

4. Process Queue

# Claim next (highest priority)
./scrapai queue next --project tech_news

# Output:
# Claimed queue item #1
# Project: tech_news
# URL: https://techcrunch.com
# Custom instruction: Focus on AI and startup news
# Priority: 10

# Run analysis workflow (Phases 1-4)
./scrapai inspect https://techcrunch.com --project tech_news
# ... complete analysis phases ...

# Import spider
./scrapai import final_spider.json --project tech_news

# Test crawl
./scrapai crawl techcrunch --project tech_news --limit 10

# Mark as complete
./scrapai queue complete 1

5. Handle Failure

# If crawl fails
./scrapai queue fail 1 -m "Site requires authentication"

# Check failed items
./scrapai queue list --project tech_news --status failed

# Retry if needed
./scrapai queue retry 1

6. Cleanup

# Remove completed items
./scrapai queue cleanup --completed --force --project tech_news

# Or remove all
./scrapai queue cleanup --all --force --project tech_news

Statistics

Check queue statistics:

# Count by status
./scrapai queue list --project tech_news --status pending --count
./scrapai queue list --project tech_news --status completed --count
./scrapai queue list --project tech_news --status failed --count

# View all items
./scrapai queue list --project tech_news --all --limit 100

Best Practices

Use descriptive custom instructions

Help guide the analysis process with clear instructions:

./scrapai queue add https://example.com --project proj -m "Focus on research papers, ignore blog posts"

Set meaningful priorities

Higher numbers processed first:

10: Urgent, high-value sites
5: Standard priority
1: Low priority, process when available

Always specify project

Prevent data mixing across projects:

./scrapai import spider.json --project myproject
./scrapai crawl spider --project myproject

Document failures

Help debug issues later:

./scrapai queue fail <id> -m "Requires JavaScript rendering - use --browser flag"

Regular cleanup

Remove old completed items:

./scrapai queue cleanup --completed --force --project myproject

Checkpoint Resume

Pause and resume long crawls

Incremental Crawling

Skip unchanged pages

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

Queue Processing

Commands Reference

Add Single URL

Bulk Add URLs

List Queue Items

Claim Next Item

Update Item Status

Cleanup Queue

Bulk File Formats

JSON Format

CSV Format

Queue Processing Workflow

Complete Example

1. Create Queue File

2. Bulk Add to Queue

3. Check Queue

4. Process Queue

5. Handle Failure

6. Cleanup

Statistics

Best Practices

Checkpoint Resume

Incremental Crawling

​Commands Reference

​Add Single URL

​Bulk Add URLs

​List Queue Items

​Claim Next Item

​Update Item Status

​Cleanup Queue

​Bulk File Formats

​JSON Format

​CSV Format

​Queue Processing Workflow

​Complete Example

​1. Create Queue File

​2. Bulk Add to Queue

​3. Check Queue

​4. Process Queue

​5. Handle Failure

​6. Cleanup

​Statistics

​Best Practices

​Related Guides

Checkpoint Resume

Incremental Crawling

Commands Reference

Add Single URL

Bulk Add URLs

List Queue Items

Claim Next Item

Update Item Status

Cleanup Queue

Bulk File Formats

JSON Format

CSV Format

Queue Processing Workflow

Complete Example

1. Create Queue File

2. Bulk Add to Queue

3. Check Queue

4. Process Queue

5. Handle Failure

6. Cleanup

Statistics

Best Practices

Related Guides