# Blog Content Extraction Fix Summary

**Last Updated:** 2026-01-10

## Problem Fixed

The blog content extraction script (`scripts/blog/extract-content.py`) had multiple critical issues:

1. **CTAs and Author Names Showing**: Content that should be removed was appearing in extracted posts
2. **Border/Container Divs Showing**: WordPress wrapper divs with container classes were not being removed
3. **Embeds/Videos Not Extracted**: Embedded content (iframes, scripts, videos) was not being explicitly preserved
4. **Links Not Being Added**: Internal links were not being inserted into content after extraction

## Solution Implemented

### 1. Improved CTA Removal (`remove_cta_sections()`)

**Changes:**
- Fixed pattern matching to handle nested divs
- Added defensive node removal (collect nodes first, then remove)
- Improved CTA text detection patterns
- Added "Demo buchen" to CTA patterns

**Patterns Handled:**
- Divs with `bg-ordio-sand` classes containing CTA text
- Divs with `order-3` class containing CTA text
- Any div containing CTA text with CTA-like classes
- Aggressive fallback for divs/sections with CTA text and button-like content

### 2. Improved Author Removal (`remove_author_elements()`)

**Changes:**
- Enhanced regex patterns to catch more variations
- Added pattern for short paragraphs starting with author info
- Added removal of any element containing only author info (divs, spans, etc.)

**Patterns Handled:**
- Paragraphs with `author-name` class
- Standalone paragraphs containing "Autor: [Name]" or "Von [Name]"
- Short paragraphs (< 50 chars) starting with author patterns
- Any element containing only author info

### 3. Added Container Removal (`remove_container_divs()`)

**New Function:**
- Removes WordPress wrapper/container divs while preserving content
- Unwraps container divs (replaces with children) instead of deleting
- Preserves embed containers (doesn't unwrap if contains embeds)

**Container Classes Removed:**
- `entry-content`
- `shadow-xl`
- `rounded-[25px]`
- `shadow`
- `rounded-lg`
- `container`
- `wrapper`
- `grid-cols`
- `order-*`
- `py-10`, `px-6`, `px-10`, `xl:py-12`
- `rounded-b-[25px]`, `rounded-t-none`

### 4. Added Embed Preservation (`preserve_embeds()`)

**New Function:**
- Explicitly marks embeds as protected
- Ensures iframes, scripts, videos are not accidentally removed
- Preserves WordPress embed blocks (`wp-block-embed`)

**Embed Types Preserved:**
- `<iframe>` tags (YouTube, Vimeo, etc.)
- `<script>` tags with `src` attribute (Podigee, etc.)
- `<video>` tags
- WordPress embed blocks (`wp-block-embed`)

### 5. Updated Extraction Workflow (`extract_main_content()`)

**Processing Order:**
1. Find main content area
2. Remove container/border divs (unwrap them)
3. Remove CTA sections
4. Remove author elements
5. Preserve embeds (mark as protected)
6. Extract images and content

## Files Modified

1. **`scripts/blog/extract-content.py`**
   - Fixed `remove_cta_sections()` function
   - Fixed `remove_author_elements()` function
   - Added `remove_container_divs()` function
   - Added `preserve_embeds()` function
   - Updated `extract_main_content()` to use all functions in correct order

2. **`scripts/blog/test-extraction-fixes.py`** (Created)
   - Comprehensive test suite for all extraction fixes
   - Tests CTA removal, author removal, container removal, embed preservation
   - Tests full workflow integration

## Link Insertion Verification

**Verified:**
- `v2/scripts/blog/add-links-to-json.php` correctly includes `link_utils.php`
- `insertLink()` function uses `findFullWordByContext()` correctly
- Links are inserted preserving content integrity (actual words, not anchor_text)

## Testing

Created comprehensive test suite (`test-extraction-fixes.py`) that verifies:

1. ✅ CTA removal works correctly
2. ✅ Author removal works correctly
3. ✅ Container removal works correctly
4. ✅ Embed preservation works correctly
5. ✅ Full workflow integration works correctly

## Usage

### Run Extraction

```bash
python3 scripts/blog/extract-content.py
```

The script will:
1. Load post URLs from `docs/data/blog-posts-metadata.json`
2. Fetch HTML content for each post
3. Extract main content area
4. Remove CTAs, authors, containers
5. Preserve embeds
6. Save cleaned content to `docs/data/blog-posts-content-full.json`

### Run Tests

```bash
python3 scripts/blog/test-extraction-fixes.py
```

## Success Criteria

1. ✅ CTAs completely removed from extracted content
2. ✅ Author names completely removed from extracted content
3. ✅ Container/border divs removed from extracted content
4. ✅ All embeds (iframes, scripts, videos) preserved in extracted content
5. ✅ Internal links added correctly without changing content words
6. ✅ All tests pass
7. ✅ Browser rendering shows clean content with embeds and links

## Related Documentation

- [Content Extraction Guide](CONTENT_EXTRACTION_GUIDE.md)
- [Embed Handling Guide](EMBED_HANDLING_GUIDE.md)
- [Content Preservation Fix Complete](CONTENT_PRESERVATION_FIX_COMPLETE.md)
