# Blog Content Extraction Fix Summary **Last Updated:** 2026-01-10 ## Problem Fixed The blog content extraction script (`scripts/blog/extract-content.py`) had multiple critical issues: 1. **CTAs and Author Names Showing**: Content that should be removed was appearing in extracted posts 2. **Border/Container Divs Showing**: WordPress wrapper divs with container classes were not being removed 3. **Embeds/Videos Not Extracted**: Embedded content (iframes, scripts, videos) was not being explicitly preserved 4. **Links Not Being Added**: Internal links were not being inserted into content after extraction ## Solution Implemented ### 1. Improved CTA Removal (`remove_cta_sections()`) **Changes:** - Fixed pattern matching to handle nested divs - Added defensive node removal (collect nodes first, then remove) - Improved CTA text detection patterns - Added "Demo buchen" to CTA patterns **Patterns Handled:** - Divs with `bg-ordio-sand` classes containing CTA text - Divs with `order-3` class containing CTA text - Any div containing CTA text with CTA-like classes - Aggressive fallback for divs/sections with CTA text and button-like content ### 2. Improved Author Removal (`remove_author_elements()`) **Changes:** - Enhanced regex patterns to catch more variations - Added pattern for short paragraphs starting with author info - Added removal of any element containing only author info (divs, spans, etc.) **Patterns Handled:** - Paragraphs with `author-name` class - Standalone paragraphs containing "Autor: [Name]" or "Von [Name]" - Short paragraphs (< 50 chars) starting with author patterns - Any element containing only author info ### 3. Added Container Removal (`remove_container_divs()`) **New Function:** - Removes WordPress wrapper/container divs while preserving content - Unwraps container divs (replaces with children) instead of deleting - Preserves embed containers (doesn't unwrap if contains embeds) **Container Classes Removed:** - `entry-content` - `shadow-xl` - `rounded-[25px]` - `shadow` - `rounded-lg` - `container` - `wrapper` - `grid-cols` - `order-*` - `py-10`, `px-6`, `px-10`, `xl:py-12` - `rounded-b-[25px]`, `rounded-t-none` ### 4. Added Embed Preservation (`preserve_embeds()`) **New Function:** - Explicitly marks embeds as protected - Ensures iframes, scripts, videos are not accidentally removed - Preserves WordPress embed blocks (`wp-block-embed`) **Embed Types Preserved:** - `