Skip to main content
See complete examples of the agent’s output for different types of websites. Each example shows the analysis notes, test config, and production config.

BBC News (bbc.co.uk)

Site Type: Major news organization with 10M+ articles across multiple sections
The agent’s Phase 1 analysis document (sections.md):
# BBC.co.uk - Comprehensive Site Structure Analysis

**Source URL:** https://bbc.co.uk/
**Project:** news
**Date:** 2026-02-26

---

## Executive Summary

BBC (British Broadcasting Corporation) is one of the world's largest news organizations
with extensive content across news, sport, entertainment, education, and lifestyle categories.
The site contains **6 primary content sections** with hundreds of subsections and categories.

**Key Findings:**
- No Cloudflare protection (regular HTTP works fine)
- Semantic HTML with `<article>` tags, `<h1>` titles, `<time>` dates
- Consistent structure across all content types
- New URL format: `/section/articles/<id>` (hashed IDs)
- Legacy URL format: `/section/<number>` (numeric IDs)
- Regional editions: England, Scotland, Wales, Northern Ireland
- Massive content volume (millions of articles)

---

## Content Sections (Detailed Analysis)

### 1. News Articles
**URL Pattern:** `/news/articles/<id>` (primary) + `/news/<number>` (legacy)
**Homepage:** `/news`
**Category Structure:** `/news/<category>`

**Article Examples (New Format):**
- https://www.bbc.co.uk/news/articles/cgjz1x5e1xyo
- https://www.bbc.co.uk/news/articles/c1mjx1grj3yo
- https://www.bbc.co.uk/news/articles/c2k8zyq0qgzo

**News Categories/Subsections:**
- `/news/business` - Business and finance news
- `/news/politics` - UK and international politics
- `/news/technology` - Technology and digital news
- `/news/health` - Health and medical news

**HTML Structure Confirmed:**
- `<h1 class="ssrcss-zwdxc1-Heading">` for title
- `<article class="ssrcss-hmqe3h-ArticleWrapper">` container
- `<time>` tag with readable dates
- `<div class="ssrcss-nqezkk-RichTextContainer">` for content
- Clean semantic HTML structure

**Volume:** Millions of articles

---

### 2. Sport Content
**URL Pattern:** `/sport/<sport>/articles/<id>`
**Homepage:** `/sport`

**Sport Categories:**
- `/sport/football` - Football/soccer (largest category)
- `/sport/cricket` - Cricket
- `/sport/rugby-union` - Rugby union
- `/sport/formula1` - Formula 1 racing

**Volume:** Hundreds of thousands of articles

---

### 3. Food Articles
**URL Pattern:** `/food/articles/<id>`
**Volume:** Thousands of articles

### 4. Bitesize (Educational Content)
**URL Pattern:** `/bitesize/articles/<id>`
**Volume:** Tens of thousands of educational articles

### 5. Newsround (Children's News)
**URL Pattern:** `/newsround/articles/<id>`
**Volume:** Thousands of children's news articles

### 6. Media Centre
**URL Pattern:** `/mediacentre/articles/<year>/<slug>`
**Volume:** Thousands of press releases

---

## Technical Requirements

**HTTP Access:**
- Status: No protection (HTTP works fine)
- Bypass Method: Not needed
- Settings Required: None (default Scrapy settings work)

**Extractor Strategy:**
Generic extractors (newspaper/trafilatura) recommended - site uses clean semantic HTML

**Crawl Strategy:**
1. Start URL: https://www.bbc.co.uk/
2. Follow links: Yes
3. Respect robots.txt: Yes
4. Concurrent requests: 16
5. Politeness: DOWNLOAD_DELAY = 1 second

---

**Analysis completed:** 2026-02-26
**Status:** ✅ Ready for Phase 2
Full analysis includes URL matching rules, exclusion patterns, and volume estimates. This is an abbreviated version showing key findings.

What This Shows

Across all examples, notice how the agent:
  1. Analyzes thoroughly - Documents URL patterns, content sections, HTML structure
  2. Tests first - Creates test config with sample URLs before production
  3. Adapts strategy - Uses generic extractors for articles, custom callbacks for products/forums
  4. Handles edge cases - Deny patterns, pagination, comment sections, utility pages
  5. Respects sites - Appropriate DOWNLOAD_DELAY, CONCURRENT_REQUESTS, ROBOTSTXT_OBEY