Inspect Command

The inspect command fetches and analyzes a website to help you understand its structure and build scraper configurations. It supports three modes: lightweight HTTP, browser-based (for JavaScript sites), and Cloudflare bypass.

inspect

Inspect a website URL.

Syntax

./scrapai inspect <url> [options]

Arguments

url

string

required

Website URL to inspect.

Options

--project

string

default:"default"

Project name (used for saving analysis files).

--output-dir

string

Directory to save analysis files. Defaults to data/<project>/inspect/.

--proxy-type

choice

default:"auto"

Proxy type: none, static, residential, auto.

--no-save-html

flag

Do not save the full HTML to disk.

--browser

flag

Use browser automation for JavaScript-rendered sites and Cloudflare bypass. Automatically handles browser challenges and renders dynamic content.

--log-level

choice

default:"info"

Logging level: debug, info, warning, error, critical.

--log-file

string

Path to log file.

Modes

HTTP Mode (Default)

Lightweight HTTP fetch with requests library:

./scrapai inspect https://example.com --project myproject

Use for:

Simple websites with server-side rendering
Static HTML sites
Fastest inspection method

Output:

⚡ Using lightweight HTTP fetch

Fetching: https://example.com
Status: 200
Content-Type: text/html; charset=utf-8
Content-Length: 45,231 bytes

Analysis saved to: data/myproject/inspect/example_com/
  • page.html (full HTML)
  • metadata.json (headers, status, timing)

Browser Mode

Use Playwright for JavaScript-heavy sites:

./scrapai inspect https://spa-site.com --project myproject --browser

Use for:

Single-page applications (React, Vue, Angular)
Sites with JavaScript-rendered content
Dynamic content loading

Output:

🌐 Using browser for JS-rendered content

Launching browser...
Navigating to: https://spa-site.com
Waiting for page load...
Content rendered!

Analysis saved to: data/myproject/inspect/spa_site_com/
  • page.html (rendered HTML after JS execution)
  • screenshot.png (page screenshot)
  • metadata.json

Browser mode waits for JavaScript to execute and renders the final DOM. This is the HTML you should analyze for extraction selectors.

Additional browser mode features:

Automatic Cloudflare challenge detection and bypass
Cookie extraction for session persistence
Browser fingerprinting resistance

Linux (headless server): Requires xvfb for browser automation:

xvfb-run -a ./scrapai inspect https://protected-site.com --project myproject --browser

Install xvfb if needed:

sudo apt-get install xvfb

Output when Cloudflare is detected:

🖥️  Browser mode enabled

Solving Cloudflare challenge...
Challenge solved!
Extracting cookies...

Analysis saved to: data/myproject/inspect/protected_site_com/
  • page.html (final HTML after bypass)
  • cookies.json (Cloudflare session cookies)
  • metadata.json

Saved Files

Inspection saves files to data/<project>/inspect/<domain>/:

page.html

Full HTML content:

HTTP mode: Raw HTML from server
Browser mode: Rendered HTML after JavaScript execution

Use this file with the analyze command to discover CSS selectors.

metadata.json

Request metadata:

{
  "url": "https://example.com",
  "status_code": 200,
  "headers": {
    "content-type": "text/html; charset=utf-8",
    "server": "nginx",
    "content-length": "45231"
  },
  "fetch_time": "2026-02-28T15:30:42",
  "elapsed_seconds": 0.823,
  "mode": "http"
}

cookies.json (Browser mode with Cloudflare)

Cloudflare session cookies:

{
  "cf_clearance": "abc123...",
  "__cf_bm": "xyz789...",
  "expires_at": "2026-02-28T15:40:42"
}

These cookies can be used in spider settings to bypass Cloudflare without browser on every request.

screenshot.png (Browser mode)

Full-page screenshot for visual verification.

Proxy Support

Specify proxy type for inspection:

# No proxy
./scrapai inspect https://example.com --proxy-type none

# Datacenter proxy
./scrapai inspect https://example.com --proxy-type static

# Residential proxy
./scrapai inspect https://example.com --proxy-type residential

# Auto (smart escalation)
./scrapai inspect https://example.com --proxy-type auto

Requires proxy configuration in .env.

Skip HTML Saving

For quick inspection without saving files:

./scrapai inspect https://example.com --no-save-html

Logging

Control logging verbosity:

# Debug logging
./scrapai inspect https://example.com --log-level debug

# Save logs to file
./scrapai inspect https://example.com --log-file inspect.log

analyze

Analyze saved HTML for CSS selector discovery (separate command, not a subcommand of inspect).

Syntax

./scrapai analyze <html_file> [options]

Arguments

html_file

string

required

Path to HTML file to analyze.

Options

--test

string

Test a specific CSS selector.

--find

string

Find elements by keyword (searches classes and IDs).

Examples

# Analyze HTML structure
./scrapai analyze data/news/inspect/example_com/page.html

# Test a CSS selector
./scrapai analyze page.html --test "article.post h1.title"

# Find elements with keyword
./scrapai analyze page.html --find "author"

Output (Analysis Mode)

$ ./scrapai analyze page.html
📄 Analyzing: page.html
📊 HTML size: 45231 bytes

💡 TIP: Use --find 'keyword' to search for specific elements

============================================================
🏷️  HEADERS (h1, h2)
============================================================

H1 - Found 1:
  [1] h1.article-headline
      Text: UK economy grows 0.4% in February

H2 - Found 5:
  [1] h2.section-title
      Text: Economic Growth
  [2] h2.section-title
      Text: Market Response
  ...

============================================================
📝 CONTENT CONTAINERS
============================================================

  [1] article.main-article
      Size: 3,245 chars
      Preview: The UK economy grew by 0.4% in February, official figures show...

  [2] div.article-body
      Size: 2,891 chars
      Preview: Economists had expected growth of 0.2%, making this a positive...

============================================================
📅 DATES
============================================================
  time.published-date: February 28, 2026
  span.updated-time: Updated 2 hours ago

============================================================
✍️  AUTHORS
============================================================
  span.author-name: Economics Reporter
  a.byline: By John Smith

============================================================

Test a Selector

$ ./scrapai analyze page.html --test "h1.article-headline::text"

🔍 Testing selector: h1.article-headline::text
============================================================
✓ Found 1 element(s)

[1] h1
    Classes: ['article-headline']
    Text (62 chars): UK economy grows 0.4% in February

Use ::text pseudo-selector to extract text content instead of HTML.

Find by Keyword

$ ./scrapai analyze page.html --find "author"

🔎 Finding elements with keyword: 'author'
============================================================

  span.author-name
    Text: Economics Reporter

  div.author-bio
    Text: Economics Reporter specializes in UK economic policy and analysis.

  a.author-profile
    Text: View profile

✓ Found 3 elements

extract-urls

Extract all URLs from a saved HTML file. Useful for understanding URL patterns on a site during analysis.

Syntax

./scrapai extract-urls --file <html_file> [options]

Arguments

--file

string

required

Path to HTML file to extract URLs from.

Options

--output

string

Output file path. If not specified, URLs are printed to console.

-o

string

Short form of --output.

Examples

# Extract URLs to console
./scrapai extract-urls --file data/news/inspect/example_com/page.html

# Extract URLs to file
./scrapai extract-urls --file page.html --output urls.txt

# Short form
./scrapai extract-urls --file page.html -o urls.txt

Output

$ ./scrapai extract-urls --file page.html
https://example.com/
https://example.com/about
https://example.com/contact
https://example.com/articles/2026/02/story-1
https://example.com/articles/2026/02/story-2
...

When using --output, URLs are written one per line to the specified file:

$ ./scrapai extract-urls --file page.html -o urls.txt
✅ Extracted 147 URLs to urls.txt

Use extract-urls after inspect to analyze URL patterns and design spider rules. Look for common patterns like /articles/[year]/[month]/[slug] to write effective regex rules.

Workflow: Inspect to Spider Config

1. Inspect the Site

./scrapai inspect https://example.com --project myproject --browser

2. Extract URLs (Optional)

./scrapai extract-urls --file data/myproject/inspect/example_com/page.html -o urls.txt

Review URL patterns to design effective spider rules.

3. Analyze HTML Structure

./scrapai analyze data/myproject/inspect/example_com/page.html

Note the CSS selectors for title, content, author, etc.

4. Test Selectors

./scrapai analyze page.html --test "h1.article-title::text"
./scrapai analyze page.html --test "div.article-body"
./scrapai analyze page.html --test "span.author-name::text"

5. Create Spider Config

Write spider.json using discovered selectors:

{
  "name": "example_com",
  "allowed_domains": ["example.com"],
  "start_urls": ["https://example.com/articles"],
  "rules": [
    {
      "allow": ["/article/[^/]+$"],
      "callback": "parse_article",
      "follow": false
    }
  ],
  "callbacks": {
    "parse_article": {
      "extract": {
        "title": {"css": "h1.article-title::text"},
        "content": {"css": "div.article-body", "get": "all_text"},
        "author": {"css": "span.author-name::text"}
      }
    }
  }
}

6. Import and Test

./scrapai spiders import spider.json --project myproject
./scrapai crawl example_com --project myproject --limit 5
./scrapai show example_com --project myproject

Platform Notes

macOS/Windows

Browser mode uses native display automatically (no xvfb needed).

Linux (Headless Servers)

Browser automation requires xvfb (covered in Browser Mode section above).

Troubleshooting

Browser Launch Failed

Playwright Error: Executable doesn't exist at /path/to/chromium

Solution:

.venv/bin/python -m playwright install chromium

Invalid URL

Invalid URL scheme: must be http or https

Solution: Ensure URL includes protocol:

./scrapai inspect https://example.com  # Correct
./scrapai inspect example.com          # Wrong

​inspect

​Syntax

​Arguments

​Options

​Modes

​HTTP Mode (Default)

​Browser Mode

​Saved Files

​page.html

​metadata.json

​cookies.json (Browser mode with Cloudflare)

​screenshot.png (Browser mode)

​Proxy Support

​Skip HTML Saving

​Logging

​analyze

​Syntax

​Arguments

​Options

​Examples

​Output (Analysis Mode)

​Test a Selector

​Find by Keyword

​extract-urls

​Syntax

​Arguments

​Options

​Examples

​Output

​Workflow: Inspect to Spider Config

​1. Inspect the Site

​2. Extract URLs (Optional)

​3. Analyze HTML Structure

​4. Test Selectors

​5. Create Spider Config

​6. Import and Test

​Platform Notes

​macOS/Windows

​Linux (Headless Servers)

​Troubleshooting

​Browser Launch Failed

​Invalid URL

​Next Steps

Spider Management

Crawl Commands

inspect

Syntax

Arguments

Options

Modes

HTTP Mode (Default)

Browser Mode

Saved Files

page.html

metadata.json

cookies.json (Browser mode with Cloudflare)

screenshot.png (Browser mode)

Proxy Support

Skip HTML Saving

Logging

analyze

Syntax

Arguments

Options

Examples

Output (Analysis Mode)

Test a Selector

Find by Keyword

extract-urls

Syntax

Arguments

Options

Examples

Output

Workflow: Inspect to Spider Config

1. Inspect the Site

2. Extract URLs (Optional)

3. Analyze HTML Structure

4. Test Selectors

5. Create Spider Config

6. Import and Test

Platform Notes

macOS/Windows

Linux (Headless Servers)

Troubleshooting

Browser Launch Failed

Invalid URL

Next Steps