# Ahrefs Internal Link Opportunities - Process Documentation

**Last Updated:** 2026-01-27

## Quick Start (For Recurring Imports)

**For future Ahrefs CSV imports, use the automated workflow:**

```bash
python3 v2/scripts/blog/process-ahrefs-csv.py /path/to/ahrefs-export.csv
```

See `docs/seo/AHREFS_RECURRING_WORKFLOW.md` for complete recurring workflow guide.

## Overview

This document outlines the complete process for analyzing, filtering, and implementing internal link opportunities from Ahrefs CSV exports. This process ensures high-quality, contextual internal links that improve SEO while maintaining natural content flow.

**Enhanced Version:** This documentation covers the improved filtering system with enhanced priority scoring, context quality assessment, topical relevance scoring, and automated classification.

**Automated Workflow:** The process is now fully automated via `process-ahrefs-csv.py` script. See Quick Start above.

## Process Workflow

### Phase 1: Data Import and Initial Analysis

**Input:** Ahrefs CSV export (`ordio_*.csv`)

**Scripts:**
- `v2/scripts/blog/analyze-ahrefs-opportunities.py` - Original analysis script
- `v2/scripts/blog/analyze-ahrefs-opportunities-enhanced.py` - Enhanced analysis with comprehensive statistics
- `v2/scripts/blog/analyze-ahrefs-opportunities-jan-2026.py` - January 2026 CSV analysis script (UTF-16 LE handling)
- `v2/scripts/blog/audit-existing-links.py` - Audit existing internal links

**Steps:**

1. **Parse CSV file**
   - Handle UTF-16 LE encoding
   - Tab-separated values
   - Extract: Source page, Keyword, Keyword context, Target page, PR, Traffic data

2. **Normalize URLs**
   - Remove protocol and domain
   - Remove trailing slashes
   - Remove query parameters and fragments
   - Standardize for comparison

3. **Validate opportunities**
   - Check source page is not `noindex`
   - Verify target page exists
   - Check if link already exists
   - Identify blog post JSON files

4. **Calculate priority score**
   - Source PR (weight: 0.3)
   - Source traffic (weight: 0.2)
   - Keyword search volume (weight: 0.2)
   - Target traffic (weight: 0.3)

**Output:** `ahrefs-analysis/full-analysis.json`, `valid-opportunities.json`, `prioritized-opportunities.json`

### Phase 2: Quality Filtering (Enhanced)

**Scripts:**
- `v2/scripts/blog/filter-ahrefs-opportunities.py` - Original filtering script
- `v2/scripts/blog/filter-ahrefs-opportunities-enhanced.py` - Enhanced filtering with all improvements

**Enhanced Filter Criteria:**

1. **Enhanced Priority Scoring**
   - Source PR (30%)
   - Source URL Rating (20%)
   - Source traffic (10%, capped at 1000)
   - Keyword search volume (15%, capped at 50K)
   - Keyword difficulty (inverse, 10%)
   - Target traffic potential (10%, capped at 1000)
   - Content placement bonus (5% for first 30% of content)
   - Pillar/cluster boost multiplier (1.5x for pillar pages)
   - Target ranking position factor (boost for top 10/50)

2. **Enhanced Context Quality Assessment**
   - Minimum context length: 50 characters
   - Keyword must appear in context
   - **Protected Areas (Never Add Links):**
     - Headers (h1-h6) - Links in headers appear spammy and don't improve SEO
     - FAQ questions - Questions are structural elements, not content
     - Related content carousel - Already has links (check for duplicates)
     - Script/style tags - Already protected
     - HTML tag attributes - Already protected
     - Existing links - Already checked
   - **Safe Areas (Can Add Links):**
     - Paragraphs (`<p>`) in `content.html` with natural context
     - List items (`<li>`) with sufficient context (minimum 15 chars)
     - FAQ answers (`faqs[].answer`) - HTML content separate from questions
     - Table cells (`<td>`) - Only if natural sentence context (rare)
   - Validate context is in natural paragraph (not header/list fragment)
   - Check header proximity (minimum 50 characters from headers)
   - Check content placement (prefer first 30% of content)
   - Ensure context forms complete sentence
   - Verify keyword appears naturally (not in generic template text)

3. **Enhanced Anchor Text Validation**
   - Expanded generic anchor blacklist (German-specific)
   - Check anchor text diversity per source page
   - Validate grammatical correctness in context
   - Ensure anchor matches actual word form (case, pluralization)
   - German word boundary detection

4. **Topical Relevance Scoring**
   - Extract topics from source page (categories, tags)
   - Extract topics from target page
   - Calculate semantic similarity score (0-1)
   - Validate pillar-cluster relationships
   - Check keyword matches target page topic
   - Ensure source category aligns with target category

5. **Dynamic Link Density Management**
   - Dynamic limit based on content length (1 link per 200 words, max 20)
   - Ensure minimum 200 characters between links
   - Check link distribution (avoid clustering)
   - Validate links aren't too close to section headers

6. **Automated Classification**
   - **Auto-Approve:** High-value opportunities (pillar links, high PR+traffic, high volume+low difficulty)
   - **Review:** Medium-value opportunities (passed all filters but need manual review)
   - **Auto-Reject:** Low-value opportunities (failed quality checks)

**Output:** 
- `ahrefs-analysis/filtered-opportunities-enhanced.json` - Enhanced filtered opportunities
- `ahrefs-analysis/review-report.json` - Review report with classifications
- `ahrefs-analysis/manual-review-report.md` - Markdown review report

### Phase 3: Link Implementation

**Scripts:**
- `v2/scripts/blog/add-ahrefs-links.py` - Original implementation script
- `v2/scripts/blog/add-ahrefs-links-enhanced.py` - Enhanced implementation with safe placement validation

**Enhanced Process with Safe Area Protection:**

1. **Group by source page**
   - Process all opportunities for one page together
   - Track links per page to enforce minimum distance (200 chars)
   - Check link density per page

2. **Check protected areas**
   - **Never add links to:**
     - Headers (h1-h6) - Detected via `is_inside_header()`
     - FAQ questions - Detected via `check_faq_question_contains_keyword()`
     - Related content carousel - Checked via `check_related_posts_duplicate()`
     - Script/style tags - Already protected
     - HTML tag attributes - Already protected
     - Existing links - Already checked

3. **Find safe insertion point**
   - Use `german_word_boundary_pattern()` for accurate matching
   - Use `find_safe_match_positions()` with enhanced protection:
     - Header protection (`is_inside_header()`)
     - Paragraph detection (`is_in_safe_paragraph()`)
     - Header proximity check (`is_too_close_to_header()`, minimum 50 chars)
     - Minimum distance from existing links (200 chars)
   - Preserve original word form (capitalization, pluralization)

4. **Process FAQ answers separately**
   - If keyword found in FAQ answer (not question), use `process_faq_answer_link()`
   - FAQ answers can contain links (they're separate from questions)
   - Update FAQ answer HTML while preserving structure

5. **Check carousel duplicates**
   - If target URL is in `related_posts` carousel:
     - Only add if high-value (pillar, high PR+volume, different anchor)
     - Use `should_add_despite_carousel()` decision logic
     - Log decision for review

6. **Insert link**
   - Wrap actual text found in HTML with `<a>` tag
   - Use absolute URL: `https://www.ordio.com/...`
   - Update both `content.html` and `content.text` fields
   - Track link position for minimum distance enforcement

7. **Update metadata**
   - Add link entry to `internal_links` array
   - Include: URL, normalized URL, anchor text, target type, priority, timestamp, reasoning
   - Log skipped reasons (header protection, carousel duplicate, etc.)

**Options:**

- `--dry-run`: Test without modifying files
- `--limit N`: Process only first N opportunities

**Output:** `ahrefs-analysis/implementation-results.json`

### Phase 4: Validation

**Script:** `v2/scripts/blog/validate-added-links.py`

**Checks:**

1. **HTML structure**
   - Well-formed HTML (no unclosed tags)
   - No empty `href` attributes
   - Valid link syntax

2. **Link density**
   - Count internal links per page
   - Flag if exceeds 20 links/page

3. **Link functionality**
   - Verify URLs are accessible
   - Check anchor text is present
   - Ensure links are properly formatted

**Output:** `ahrefs-analysis/validation-results.json`, `link-test-results.json`

## Quality Standards

### Link Quality Checklist

- [ ] Context is meaningful (50+ characters)
- [ ] Keyword appears naturally in context
- [ ] Anchor text is keyword-relevant (not generic)
- [ ] Link supports pillar-cluster model
- [ ] Source and target are topically related
- [ ] Link density ≤ 20 per page
- [ ] Original word form preserved (capitalization, pluralization)
- [ ] Link is not inside HTML tags, scripts, or existing links
- [ ] **Link is NOT in header (h1-h6)** - Protected area
- [ ] **Link is NOT in FAQ question** - Protected area
- [ ] **Link is in safe paragraph** - With natural context
- [ ] **Link is minimum 50 chars from headers** - Header proximity check
- [ ] **Link is minimum 200 chars from other links** - Prevents clustering
- [ ] **Carousel duplicates handled** - Checked and logged
- [ ] URL is absolute and correct
- [ ] Link flows naturally in content

### Anchor Text Best Practices

**Good anchor text:**
- Natural keyword usage: "Tarifverträge", "Lohnabrechnung", "HACCP-Kontrolle"
- Contextual phrases: "digitale Schichtplanung", "Personaleinsatzplanung"
- Varied and descriptive

**Bad anchor text:**
- Generic: "hier", "mehr", "klicken", "lesen"
- Over-optimized: exact match keyword stuffing
- Unnatural: forced keyword placement

## Scripts Reference

### Enhanced Analysis Scripts

#### analyze-ahrefs-opportunities-enhanced.py

```bash
python3 v2/scripts/blog/analyze-ahrefs-opportunities-enhanced.py
```

**Outputs:**
- `comprehensive-statistics.json`: Detailed statistics (PR distribution, traffic, keyword volumes, duplicates)
- `full-analysis.json`: All parsed opportunities with metadata
- `valid-opportunities.json`: Opportunities passing initial validation
- `prioritized-opportunities.json`: Valid opportunities sorted by priority

#### audit-existing-links.py

```bash
python3 v2/scripts/blog/audit-existing-links.py
```

**Outputs:**
- `existing-links-inventory.json`: Complete inventory of all existing internal links
- `existing-links-summary.json`: Summary statistics and link density analysis

### Enhanced Filtering Scripts

#### filter-ahrefs-opportunities-enhanced.py

```bash
python3 v2/scripts/blog/filter-ahrefs-opportunities-enhanced.py
```

**Outputs:**
- `filtered-opportunities-enhanced.json`: Enhanced filtered opportunities with classifications
- `review-report.json`: Review report with auto-approve/review/reject classifications

#### generate-review-report.py

```bash
python3 v2/scripts/blog/generate-review-report.py
```

**Outputs:**
- `manual-review-report.md`: Markdown report for manual review

### Original Scripts (Still Available)

#### analyze-ahrefs-opportunities.py

```bash
python3 v2/scripts/blog/analyze-ahrefs-opportunities.py
```

**Outputs:**
- `full-analysis.json`: All parsed opportunities with metadata
- `valid-opportunities.json`: Opportunities passing initial validation
- `prioritized-opportunities.json`: Valid opportunities sorted by priority

#### filter-ahrefs-opportunities.py

```bash
python3 v2/scripts/blog/filter-ahrefs-opportunities.py
```

**Outputs:**
- `filtered-opportunities.json`: High-quality opportunities ready for implementation

### add-ahrefs-links.py

```bash
# Dry run first
python3 v2/scripts/blog/add-ahrefs-links.py \
  --input v2/scripts/blog/ahrefs-analysis/filtered-opportunities.json \
  --dry-run

# Actual implementation
python3 v2/scripts/blog/add-ahrefs-links.py \
  --input v2/scripts/blog/ahrefs-analysis/filtered-opportunities.json
```

**Outputs:**
- `implementation-results.json`: Results of link addition (success/failure per opportunity)

### validate-added-links.py

```bash
python3 v2/scripts/blog/validate-added-links.py
```

**Outputs:**
- `validation-results.json`: HTML structure and link density validation
- `link-test-results.json`: Link functionality test results

### test-added-links.py

```bash
python3 v2/scripts/blog/test-added-links.py
```

**Outputs:**
- `link-test-results.json`: Detailed test results for all added links

## Safe Areas and Protected Areas

### Protected Areas (Never Add Links)

**Headers (h1-h6):**
- Links in headers appear spammy and don't improve SEO
- Headers are structural elements, not content
- Detection: `is_inside_header()` function
- Minimum distance: 50 characters from headers

**FAQ Questions:**
- Questions are structural elements displayed separately
- Questions should never contain links
- Detection: `check_faq_question_contains_keyword()` function
- Questions are stored in `faqs[].question` field (plain text)

**Related Content Carousel:**
- Posts in `related_posts` array already have links via carousel component
- Duplicate inline links should be avoided unless high-value
- Detection: `check_related_posts_duplicate()` function
- Decision logic: `should_add_despite_carousel()` (pillar, high PR, different anchor)

**Script/Style Tags:**
- Already protected by `is_inside_script_or_style()` function

**HTML Tag Attributes:**
- Already protected by `is_inside_html_tag()` function

**Existing Links:**
- Already protected by `is_inside_existing_link()` function

### Safe Areas (Can Add Links)

**Paragraphs (`<p>`) in `content.html`:**
- Natural paragraph content with sufficient context (minimum 20 chars)
- Must be actual paragraph tag, not header/list fragment
- Detection: `is_in_safe_paragraph()` function

**List Items (`<li>`) in `content.html`:**
- List items with sufficient context (minimum 15 chars)
- Must have natural sentence context, not just keyword
- Detection: `is_in_safe_paragraph()` function

**FAQ Answers (`faqs[].answer`):**
- FAQ answers are HTML content separate from questions
- Answers can contain contextual links
- Processing: `process_faq_answer_link()` function
- Structure preserved: question/answer separation maintained

**Table Cells (`<td>`):**
- Only if natural sentence context (rare)
- Must have sufficient text (minimum 20 chars)

### Best Practices

**Always Add:**
- Pillar page links (even if in carousel)
- High PR (35+) to high-traffic targets
- Contextual mentions in paragraphs

**Review Before Adding:**
- Links to posts already in carousel
- Links close to headers (<50 chars)
- Links in list items without paragraph context

**Never Add:**
- Links in headers (h1-h6)
- Links in FAQ questions
- Links in carousel component HTML
- Duplicate links with same anchor text
- Links too close to headers (<50 chars)
- Links too close to other links (<200 chars)

## Common Issues and Solutions

### Issue: Keyword not found

**Cause:** Keyword form mismatch (plural vs singular, capitalization)

**Solution:** Script uses German word boundary detection and preserves original word form. Check if keyword appears in different form in content.

### Issue: Link already exists

**Cause:** Link was added previously or exists in different form

**Solution:** Script checks for existing links before adding. Review `internal_links` array in JSON file.

### Issue: Link density exceeded

**Cause:** Page already has 20+ internal links

**Solution:** Prioritize highest-value opportunities. Consider removing lower-value existing links if needed.

### Issue: Link skipped - Keyword in header

**Cause:** Keyword found in header tag (h1-h6)

**Solution:** This is expected behavior. Headers should never contain links. Check if keyword appears elsewhere in content (paragraphs, FAQ answers).

### Issue: Link skipped - Keyword in FAQ question

**Cause:** Keyword found in FAQ question field

**Solution:** This is expected behavior. FAQ questions should never contain links. Check if keyword appears in FAQ answer instead.

### Issue: Link skipped - Target in carousel

**Cause:** Target URL already in `related_posts` carousel

**Solution:** Check decision logic. Link will be added if high-value (pillar, high PR+volume, different anchor). Otherwise, carousel link is sufficient.

### Issue: Link skipped - Too close to header

**Cause:** Keyword position is less than 50 characters from header

**Solution:** This is expected behavior. Links should maintain minimum distance from headers for natural flow. Check if keyword appears elsewhere with more distance.

### Issue: HTML validation errors

**Cause:** Pre-existing HTML issues (external links without nofollow, etc.)

**Solution:** These are false positives if they relate to existing external links. Focus on validating newly added internal links.

## Best Practices

1. **Always run dry-run first** before implementing links
2. **Review filtered opportunities** before implementation
3. **Test link functionality** after implementation
4. **Validate HTML structure** to ensure no broken tags
5. **Monitor link density** to stay within limits
6. **Preserve natural content flow** - links should enhance, not disrupt readability
7. **Use varied anchor text** - avoid exact match keyword stuffing
8. **Support pillar-cluster model** - link from detailed posts to pillar pages
9. **Maintain topical relevance** - only link related content
10. **Document reasoning** - include "Ahrefs opportunity" in link metadata

## Future Improvements

- [ ] Automated link health monitoring (broken links, 404s)
- [ ] Performance tracking (click-through rates, engagement)
- [ ] Periodic review of link quality
- [ ] Integration with SEO dashboard for metrics
- [ ] Automated link density alerts
- [ ] Anchor text variation analysis

## Recent Implementation (January 2026)

**CSV Source:** `ordio_24-jan-2026_link-opportunities_2026-01-27_07-39-05.csv`

**Results:**
- Total opportunities analyzed: 43
- Approved & implemented: 8 (18.6%)
- Rejected: 35 (81.4%)
  - 17 already linked
  - 13 unsafe placement
  - 3 source pages not found (pillar pages)
  - 2 keywords in FAQ questions

**Implementation Report:** `v2/scripts/blog/ahrefs-analysis/implementation-report-jan-2026.md`

**Key Learnings:**
1. Many opportunities rejected due to existing links (good - shows content already well-linked)
2. Pillar pages (`/insights/dienstplan`, `/insights/zeiterfassung`) are not blog posts - handle separately
3. Keywords in headers are common - consider expanding content or finding alternative placements
4. All implemented links validated for safe placement, duplicates, and SEO quality

## Related Documentation

- `docs/seo/ahrefs-link-opportunities-implementation-report.md` - Implementation report template
- `v2/scripts/blog/ahrefs-analysis/implementation-report-jan-2026.md` - January 2026 implementation report
- `v2/scripts/blog/link_utils.py` - Utility functions for link operations
- `.cursor/rules/shared-patterns.mdc` - Universal validation checklist
- `docs/content/blog/SAFE_LINK_PLACEMENT_GUIDE.md` - Safe link placement guidelines
