# Next Steps: Re-extract Blog Content

**Last Updated:** 2026-01-10

## Current Status

The extraction script has been fixed and tested, but the extraction file (`docs/data/blog-posts-content-full.json`) is from before the fixes were applied (January 9th). The posts still contain CTAs, authors, and container wrappers.

## Required Action

**Re-run the extraction script** to get cleaned content for all posts:

```bash
python3 scripts/blog/extract-content.py
```

This will:

- Fetch content from all 99 blog posts
- Apply the fixed extraction logic (remove CTAs, authors, containers)
- Preserve embeds
- Save cleaned content to `docs/data/blog-posts-content-full.json`

**Estimated time:** 2-3 minutes (1 second delay per post)

## After Extraction Completes

### 1. Update Post JSON Files

```bash
python3 scripts/blog/update-posts-from-extraction.py
```

This updates all post JSON files with cleaned content.

### 2. Re-add Links

```bash
php v2/scripts/blog/add-links-to-json.php
```

This re-adds internal links using the corrected word boundary logic.

### 3. Verify

Check a sample post to verify:

- No CTAs
- No authors
- No container wrappers
- Embeds preserved
- Links working

## Verification Command

After re-extraction, verify content is cleaned:

```bash
python3 -c "
import json
f = open('docs/data/blog-posts-content-full.json')
d = json.load(f)
sample = d['posts'][0]
html = sample['content']['html']
print('CTAs removed:', not ('7 Tage kostenlos' in html))
print('Author removed:', not ('Autor:' in html))
print('Containers removed:', not ('entry-content' in html))
print('Embeds preserved:', '<iframe' in html)
"
```

## Notes

- The extraction script includes a 1-second delay between requests to be polite to the server
- All 99 posts will be extracted
- The process is idempotent - safe to re-run
- Existing post JSON files will be updated, preserving metadata