Skip to main content
Data commands allow you to inspect scraped items from test crawls and export them to various formats (CSV, JSON, JSONL, Parquet).

show

View scraped items from the database.

Syntax

./scrapai show <spider> --project <name> [options]

Arguments

spider
string
required
Spider name.

Options

--project
string
required
Project name.
--limit, -l
integer
default:"5"
Number of items to display.
--url
string
Filter by URL pattern (case-insensitive substring match).
--title
string
Search in titles only (case-insensitive).
--text, -t
string
Search in both title and content (case-insensitive).

Examples

# Show last 5 items
./scrapai show bbc_co_uk --project news

# Show last 10 items
./scrapai show bbc_co_uk --project news --limit 10

# Filter by URL pattern
./scrapai show bbc_co_uk --project news --url "/technology/"

# Search titles
./scrapai show bbc_co_uk --project news --title "climate"

# Search title and content
./scrapai show bbc_co_uk --project news --text "election"

Output

Article Items (Generic Extractors)

$ ./scrapai show bbc_co_uk --project news --limit 3
📰 Showing 3 articles from 'bbc_co_uk':

🔸 [1] UK economy grows 0.4% in February
   📅 Published: 2026-02-28 | Scraped: 2026-02-28 15:30
   🔗 https://bbc.co.uk/news/business-123456
   ✍️  By Economics Reporter
   📝 The UK economy grew by 0.4% in February, official figures show, 
        beating economists' expectations of 0.2% growth...

🔸 [2] NASA announces Mars mission timeline
   📅 Published: 2026-02-28 | Scraped: 2026-02-28 15:32
   🔗 https://bbc.co.uk/news/science-789012
   📝 NASA has unveiled an ambitious timeline for its first crewed mission 
        to Mars, targeting a 2035 launch date...

🔸 [3] New AI regulations announced
   📅 Published: 2026-02-27 | Scraped: 2026-02-28 15:35
   🔗 https://bbc.co.uk/news/technology-345678
   ✍️  By Tech Correspondent
   📝 The government has announced new regulations for artificial intelligence 
        systems, focusing on transparency and accountability...

Callback Items (Custom Extractors)

$ ./scrapai show ecommerce_spider --project shop --limit 2
📰 Showing 2 articles from 'ecommerce_spider':

🔸 [1] parse_product item
   📅 Scraped: 2026-02-28 16:10
   🔗 https://shop.example.com/products/widget-pro-3000
 title: Widget Pro 3000
 price: $299.99
 rating: 4.5
 reviews_count: 1,247
 availability: In Stock

export

Export scraped items to file formats.

Syntax

./scrapai export <spider> --project <name> --format <fmt> [options]

Arguments

spider
string
required
Spider name.

Options

--project
string
required
Project name.
--format, -f
choice
required
Export format: csv, json, jsonl, parquet.
--output, -o
string
Custom output file path. If not specified, uses timestamped filename in data/<project>/<spider>/exports/.
--limit, -l
integer
Limit number of items to export.
--url
string
Filter by URL pattern.
--title
string
Filter by title.
--text, -t
string
Filter by title or content.

Examples

# Export to CSV (default location)
./scrapai export bbc_co_uk --project news --format csv

# Export to JSON with custom path
./scrapai export bbc_co_uk --project news --format json --output my_data.json

# Export to Parquet (requires pandas)
./scrapai export bbc_co_uk --project news --format parquet

# Export filtered items
./scrapai export bbc_co_uk --project news --format jsonl --title "climate"

# Export limited items
./scrapai export bbc_co_uk --project news --format csv --limit 100

Output

$ ./scrapai export bbc_co_uk --project news --format csv
 Exported 247 articles to CSV: data/news/bbc_co_uk/exports/export_28022026_153042.csv

Export Formats

CSV

id,url,title,content,author,published_date,scraped_at
1,https://bbc.co.uk/news/business-123456,"UK economy grows 0.4%","The UK economy...","Economics Reporter",2026-02-28,2026-02-28T15:30:42
2,https://bbc.co.uk/news/science-789012,"NASA announces Mars mission","NASA has unveiled...","",2026-02-28,2026-02-28T15:32:15

JSON

Pretty-printed JSON array:
[
  {
    "id": 1,
    "url": "https://bbc.co.uk/news/business-123456",
    "title": "UK economy grows 0.4%",
    "content": "The UK economy...",
    "author": "Economics Reporter",
    "published_date": "2026-02-28",
    "scraped_at": "2026-02-28T15:30:42"
  },
  {
    "id": 2,
    "url": "https://bbc.co.uk/news/science-789012",
    "title": "NASA announces Mars mission",
    "content": "NASA has unveiled...",
    "author": null,
    "published_date": "2026-02-28",
    "scraped_at": "2026-02-28T15:32:15"
  }
]

JSONL (JSON Lines)

{"id": 1, "url": "https://bbc.co.uk/news/business-123456", "title": "UK economy grows 0.4%", "scraped_at": "2026-02-28T15:30:42"}
{"id": 2, "url": "https://bbc.co.uk/news/science-789012", "title": "NASA announces Mars mission", "scraped_at": "2026-02-28T15:32:15"}

Parquet

Requires pandas and pyarrow:
.venv/bin/pip install pandas pyarrow
./scrapai export bbc_co_uk --project news --format parquet

Default Export Location

data/<project>/<spider>/exports/export_<timestamp>.<format>

Database Storage

scraped_items Table

Items are stored in this table during test crawls:
CREATE TABLE scraped_items (
    id INTEGER PRIMARY KEY,
    spider_id INTEGER NOT NULL,
    url VARCHAR(2048) NOT NULL,
    title TEXT,
    content TEXT,
    author VARCHAR(255),
    published_date TIMESTAMP,
    scraped_at TIMESTAMP NOT NULL,
    metadata_json JSON,
    FOREIGN KEY (spider_id) REFERENCES spiders(id) ON DELETE CASCADE
);

Standard vs. Custom Fields

Standard fields (from newspaper/trafilatura extractors):
  • Stored in dedicated columns: title, content, author, published_date
Custom fields (from callback extractors):
  • Stored in metadata_json column as JSON
  • Includes _callback key to identify which callback was used

Data Retention

Test Crawls: Data stored in database until spider is deleted (cascading delete). Production Crawls: Data exported to JSONL files in data/<project>/<spider>/crawls/ (not stored in database).

Troubleshooting

Spider Not Found: Verify project and spider name with ./scrapai spiders list --project <name> No Items Found: Run a test crawl first - production crawls save to JSONL files, not database Parquet Export Error: Install dependencies: .venv/bin/pip install pandas pyarrow

Next Steps

Database Commands

Advanced queries and database management

Inspection

Analyze websites before scraping