# Blog Content Extraction Guide

**Last Updated:** 2026-01-09

Guide for extracting full blog post content and images for migration preparation.

## Overview

This guide explains how to extract full content from blog posts using the automated extraction script. The script fetches HTML content, extracts main content areas, identifies all images, and prepares data for migration.

## Prerequisites

- Python 3.7+
- Required packages: `requests`, `beautifulsoup4`, `lxml`
- Access to blog URLs (public access)

## Installation

Install required packages:

```bash
pip install requests beautifulsoup4 lxml
```

## Usage

### Basic Extraction

Run the content extraction script:

```bash
python3 scripts/blog/extract-content.py
```

The script will:

1. Load post URLs from `docs/data/blog-posts-metadata.json`
2. Fetch HTML content for each post
3. Extract main content area
4. Identify all images (featured + content)
5. Preserve embedded content (iframes, scripts, videos)
6. Save results to JSON files

### Output Files

**Content Extraction:**

- `docs/data/blog-posts-content-full.json` - Full content for all posts
  - HTML content (including embeds: iframes, scripts, videos)
  - Plain text content
  - Word counts
  - Image references
  - Embedded content (YouTube, Podigee, Vimeo, etc.)

**Images List:**

- `docs/data/blog-images-list.json` - Complete list of all images
  - Image URLs
  - Image metadata (alt text, dimensions)
  - Source post information
  - Image type (featured/content)

## Content Structure

### Extracted Content Format

```json
{
  "url": "https://www.ordio.com/insights/...",
  "title": "Post Title",
  "category": "lexikon",
  "publication_date": "2025-01-01",
  "extracted_at": "2026-01-09T12:00:00",
  "content": {
    "html": "<div>...</div>",
    "text": "Plain text content...",
    "word_count": 1234
  },
  "images": [
    {
      "url": "https://www.ordio.com/wp-content/uploads/...",
      "type": "featured",
      "alt": "Image description",
      "source": "meta"
    }
  ],
  "content_images_count": 5,
  "total_images_count": 6
}
```

## Image Extraction

### Image Types

1. **Featured Images**

   - Extracted from Open Graph meta tags
   - Usually displayed at top of post
   - One per post

2. **Content Images**

   - Images within post content
   - Multiple per post possible
   - Includes inline images, diagrams, screenshots

3. **Embedded Content**
   - YouTube video embeds (iframe)
   - Podigee podcast players (script)
   - Vimeo video embeds (iframe)
   - Other video embeds (video tag)
   - Demo embeds (iframe)
   - WordPress oEmbed blocks

### Image Metadata

Each image includes:

- Full URL (absolute)
- Alt text (if available)
- Dimensions (if available)
- Image type (featured/content)
- Source post information

## Content Processing

### Main Content Extraction

The script identifies main content using:

1. WordPress `entry-content` class (primary)
2. `<article>` tag (fallback)
3. `<main>` tag (fallback)
4. `<body>` tag (last resort)

### Content Cleaning

After extraction, content automatically has:

- **Author names removed**: Author information (e.g., "Autor: [Name]", "Von [Name]") is automatically removed from content since it's displayed in the post header
- **CTA sections removed**: Marketing CTAs (e.g., "7 Tage kostenlos testen...") are automatically removed from content
- **Embeds preserved**: All embedded content (iframes, scripts, videos) is preserved during extraction
- HTML cleanup (remove WordPress-specific classes, but preserve embed containers)
- Image path updates
- Link path updates
- Formatting normalization

**Note**: Author and CTA removal happens during extraction, cleanup, and at runtime in the PostContent component for defensive cleanup. Embeds are preserved throughout the entire process.

### CTA Removal

CTA sections are removed using multiple patterns:

- Divs with `bg-ordio-sand` or `bg-ordio-sand-dark` classes containing CTA text
- Divs with `order-3` class containing CTA text
- Any div/section containing CTA text patterns ("7 Tage kostenlos", "Abwesenheiten einfach", "Jetzt kostenlos testen")

**Cleanup Script**: If CTAs are found in existing post files, run:

```bash
python3 scripts/blog/remove-ctas-from-posts.py
```

This script scans all post JSON files and removes CTA sections.

## Next Steps After Extraction

### 1. Review Extracted Content

Check `blog-posts-content-full.json`:

- Verify all posts extracted
- Check content quality
- Identify any missing content

### 2. Download Images

Use `blog-images-list.json` to:

- Generate download script
- Download all images
- Convert to WebP format
- Optimize file sizes

### 3. Content Migration

Prepare content for static pages:

- Clean HTML content
- Update image paths
- Update internal links
- Generate static page templates

## Troubleshooting

### Failed Posts

If posts fail to extract:

- Check URL accessibility
- Verify network connection
- Review error messages in output
- Manually extract if needed

### Missing Content

If content appears incomplete:

- Check HTML structure
- Verify content selectors
- Review WordPress template changes
- Adjust extraction logic

### Image Issues

If images are missing:

- Check image URLs
- Verify image accessibility
- Review image source tags
- Check for lazy loading

## Related Documentation

- [Migration Requirements](MIGRATION_REQUIREMENTS.md) - Content requirements
- [Migration Strategy](MIGRATION_STRATEGY.md) - Migration approach
- [Migration Inventory](MIGRATION_INVENTORY.md) - URL mappings
- [Next Steps](NEXT_STEPS.md) - Implementation guide

## Script Reference

**Location:** `scripts/blog/extract-content.py`

**Dependencies:**

- `docs/data/blog-posts-metadata.json` (input)
- Python packages: `requests`, `beautifulsoup4`, `lxml`

**Output:**

- `docs/data/blog-posts-content-full.json`
- `docs/data/blog-images-list.json`
