` tag (fallback) 4. `` tag (last resort) ### Content Cleaning After extraction, content automatically has: - **Author names removed**: Author information (e.g., "Autor: [Name]", "Von [Name]") is automatically removed from content since it's displayed in the post header - **CTA sections removed**: Marketing CTAs (e.g., "7 Tage kostenlos testen...") are automatically removed from content - **Embeds preserved**: All embedded content (iframes, scripts, videos) is preserved during extraction - HTML cleanup (remove WordPress-specific classes, but preserve embed containers) - Image path updates - Link path updates - Formatting normalization **Note**: Author and CTA removal happens during extraction, cleanup, and at runtime in the PostContent component for defensive cleanup. Embeds are preserved throughout the entire process. ### CTA Removal CTA sections are removed using multiple patterns: - Divs with `bg-ordio-sand` or `bg-ordio-sand-dark` classes containing CTA text - Divs with `order-3` class containing CTA text - Any div/section containing CTA text patterns ("7 Tage kostenlos", "Abwesenheiten einfach", "Jetzt kostenlos testen") **Cleanup Script**: If CTAs are found in existing post files, run: ```bash python3 scripts/blog/remove-ctas-from-posts.py ``` This script scans all post JSON files and removes CTA sections. ## Next Steps After Extraction ### 1. Review Extracted Content Check `blog-posts-content-full.json`: - Verify all posts extracted - Check content quality - Identify any missing content ### 2. Download Images Use `blog-images-list.json` to: - Generate download script - Download all images - Convert to WebP format - Optimize file sizes ### 3. Content Migration Prepare content for static pages: - Clean HTML content - Update image paths - Update internal links - Generate static page templates ## Troubleshooting ### Failed Posts If posts fail to extract: - Check URL accessibility - Verify network connection - Review error messages in output - Manually extract if needed ### Missing Content If content appears incomplete: - Check HTML structure - Verify content selectors - Review WordPress template changes - Adjust extraction logic ### Image Issues If images are missing: - Check image URLs - Verify image accessibility - Review image source tags - Check for lazy loading ## Related Documentation - [Migration Requirements](MIGRATION_REQUIREMENTS.md) - Content requirements - [Migration Strategy](MIGRATION_STRATEGY.md) - Migration approach - [Migration Inventory](MIGRATION_INVENTORY.md) - URL mappings - [Next Steps](NEXT_STEPS.md) - Implementation guide ## Script Reference **Location:** `scripts/blog/extract-content.py` **Dependencies:** - `docs/data/blog-posts-metadata.json` (input) - Python packages: `requests`, `beautifulsoup4`, `lxml` **Output:** - `docs/data/blog-posts-content-full.json` - `docs/data/blog-images-list.json`