# Link Preservation During Content Extraction

**Last Updated:** 2026-01-10

Guide for preserving internal links when extracting or updating blog post content.

## Overview

When extracting blog post content from WordPress or updating content, it's critical to preserve existing internal links. This guide explains how to extract links before content updates and re-apply them correctly using German-aware word boundaries.

## Process

### Step 1: Extract Links Before Content Update

Before updating blog post content, extract all existing links:

```python
from preserve_links_during_extraction import preserve_links_before_extraction

# Extract links from post
preserved_links = preserve_links_before_extraction('v2/data/blog/posts/lexikon/post-slug.json')

# Save preserved links
with open('preserved-links.json', 'w') as f:
    json.dump(preserved_links, f, indent=2)
```

### Step 2: Update Content

Update the blog post content (extract from WordPress, update HTML, etc.)

### Step 3: Re-apply Links After Update

After content update, re-apply links using German-aware word boundaries:

```python
from preserve_links_during_extraction import reapply_links_after_extraction

# Load preserved links
with open('preserved-links.json', 'r') as f:
    preserved_links = json.load(f)

# Re-apply links
result = reapply_links_after_extraction('v2/data/blog/posts/lexikon/post-slug.json', preserved_links)

# Save updated post
if result:
    with open('v2/data/blog/posts/lexikon/post-slug.json', 'w') as f:
        json.dump(result['updated_data'], f, indent=2, ensure_ascii=False)
```

## Key Principles

### 1. German-Aware Word Boundaries

Always use German-aware word boundary patterns when finding anchor text:

```python
from link_utils import german_word_boundary_pattern

pattern = german_word_boundary_pattern(re.escape(anchor_text))
```

This ensures correct matching for German words with special characters (ä, ö, ü, ß).

### 2. Link Full Words

When re-applying links, always link the **entire word**, not a substring:

- ✅ **Correct:** `<a href="...">Schichtplanungssoftware</a>`
- ❌ **Incorrect:** `<a href="...">Schichtplanung</a>ssoftware`

### 3. Preserve Content Integrity

Never change words when adding links:

- ✅ **Correct:** Link "Checklisten" → `<a href="...">Checklisten</a>`
- ❌ **Incorrect:** Change "Checklisten" to "Checkliste" and link that

## Link Metadata Format

Preserved links are stored in this format:

```json
{
  "post_url": "/insights/lexikon/post-slug/",
  "slug": "post-slug",
  "links": [
    {
      "anchor_text": "Schichtplanung",
      "href": "/schichtplan",
      "context": "Eine digitale Schichtplanung hilft...",
      "position_in_text": 1234
    }
  ],
  "extracted_at": "2026-01-10T16:00:00"
}
```

## Validation

After re-applying links, validate content integrity:

```bash
python3 v2/scripts/blog/validate-content-integrity.py
```

This checks for:

- Split words (links that split words)
- Content changes
- Invalid HTML structure

## Related Documentation

- [Word Boundary Guidelines](WORD_BOUNDARY_GUIDELINES.md) - German word boundary patterns
- [Content Extraction Guide](CONTENT_EXTRACTION_GUIDE.md) - Content extraction process
- [Internal Linking Guide](INTERNAL_LINKING_GUIDE.md) - Internal linking best practices
